[
  {
    "path": ".github/pull_request_template.md",
    "content": "## Description\n\n<!-- Briefly describe your changes and add links to the relevant resources -->\n\nReferences:\n\n<!-- Add links to the relevant resources -->\n\n## Type of Change\n\n<!-- Mark the appropriate option with an [x] -->\n\n- [ ] Model Update/Addition\n- [ ] Qualitative Metrics (Benchmark Results) Update/Addition\n- [ ] Provider Update/Addition\n- [ ] Other (please specify)\n\n## Checklist\n\n- [ ] I've read the [CONTRIBUTING.md](../CONTRIBUTING.md) guidelines\n- [ ] My changes are accurate and properly referenced\n"
  },
  {
    "path": ".github/workflows/schema-validation.yml",
    "content": "name: Schema Validation\n\non:\n  pull_request:\n    branches: [main]\n\njobs:\n  validate:\n    name: Validate Schema\n    runs-on: ubuntu-latest\n\n    steps:\n      - name: Checkout code\n        uses: actions/checkout@v3\n\n      - name: Setup Node.js\n        uses: actions/setup-node@v3\n        with:\n          node-version: \"16\"\n          cache: \"npm\"\n\n      - name: Install dependencies\n        run: npm ci\n\n      - name: Run schema validation\n        run: node schemas/validator.js\n"
  },
  {
    "path": ".gitignore",
    "content": "/node_modules\n"
  },
  {
    "path": ".vscode/settings.json",
    "content": "{\n  \"json.schemas\": [\n    {\n      \"fileMatch\": [\"/models/*/model.json\"],\n      \"url\": \"../schemas/models-schema.json\"\n    },\n    {\n      \"fileMatch\": [\"/models/*/qualitativemetrics.json\"],\n      \"url\": \"../schemas/qualitativemetrics-schema.json\"\n    },\n    {\n      \"fileMatch\": [\"/providers/*/provider.json\"],\n      \"url\": \"../schemas/providers-schema.json\"\n    },\n    {\n      \"fileMatch\": [\"/providers/*/providermodels.json\"],\n      \"url\": \"../schemas/providermodels-schema.json\"\n    }\n  ]\n}"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "# Contributing to LLM Stats\n\nThank you for your interest in contributing. This guide outlines the process for updating and adding information to the LLM Stats database.\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Data Structure](#data-structure)\n- [General Guidelines](#general-guidelines)\n- [Organizations](#organizations)\n- [Models](#models)\n- [Benchmark Results](#benchmark-results)\n- [Benchmarks](#benchmarks)\n- [Providers](#providers)\n- [Licenses](#licenses)\n- [Validation](#validation)\n- [Submitting Your Contribution](#submitting-your-contribution)\n\n## Overview\n\nAll data is organized in the `data/data/` directory with a hierarchical structure. Each entity type has its own JSON schema definition in `schemas/` that validates the data structure.\n\n## Data Structure\n\n```\ndata/\n├── data/\n│   ├── organizations/\n│   │   └── [organization_id]/\n│   │       ├── organization.json\n│   │       └── models/\n│   │           └── [model_id]/\n│   │               ├── model.json\n│   │               └── benchmarks.json\n│   ├── providers/\n│   │   └── [provider_id]/\n│   │       ├── provider.json\n│   │       └── models.json\n│   ├── licenses/\n│   │   └── [license_id].json\n│   └── benchmarks/\n│       └── [benchmark_id].json\n└── schemas/\n    ├── organization.schema.json\n    ├── model.schema.json\n    ├── benchmark-results.schema.json\n    ├── benchmark.schema.json\n    ├── provider.schema.json\n    ├── provider-models.schema.json\n    └── license.schema.json\n```\n\n## General Guidelines\n\n1. **Accuracy First**: Ensure all data is accurate and sourced from authoritative references\n2. **Follow Structure**: Adhere to the existing file structure and naming conventions\n3. **Consistent Formatting**: Use consistent JSON formatting with 2-space indentation\n4. **One Change per PR**: Submit one pull request per logical change (e.g., one model, one provider)\n5. **Schema Validation**: All data files must validate against their respective JSON schemas\n6. **Required Fields**: Pay attention to required vs optional fields in schemas\n7. **Timestamps**: Use ISO 8601 format for dates (YYYY-MM-DD or full timestamp)\n\n## Organizations\n\nOrganizations represent the entities that create and release models (e.g., OpenAI, Anthropic, Meta).\n\n### Location\n\n`data/data/organizations/[organization_id]/organization.json`\n\n### Adding a New Organization\n\n1. Create a new directory: `data/data/organizations/[organization_id]/`\n2. Create `organization.json` with the following structure:\n\n```json\n{\n  \"organization_id\": \"organization-name\",\n  \"name\": \"Organization Display Name\",\n  \"website\": \"https://organization.com\",\n  \"description\": \"Brief description of the organization\",\n  \"country\": \"US\",\n  \"created_at\": \"2025-10-02T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-10-02T00:00:00.000000+00:00\"\n}\n```\n\n3. Validate against `schemas/organization.schema.json`\n4. Create a `models/` subdirectory for future models\n\n### Updating an Existing Organization\n\n1. Navigate to `data/data/organizations/[organization_id]/organization.json`\n2. Update the relevant fields\n3. Update the `updated_at` timestamp\n4. Validate against the schema\n\n## Models\n\nModels are stored within their respective organization directories.\n\n### Location\n\n`data/data/organizations/[organization_id]/models/[model_id]/`\n\n### Adding a New Model\n\n1. Ensure the organization exists in `data/data/organizations/`\n2. Ensure the license exists in `data/data/licenses/`\n3. Create a new directory: `data/data/organizations/[organization_id]/models/[model_id]/`\n4. Create two files in this directory:\n\n#### `model.json`\n\n```json\n{\n  \"model_id\": \"model-name-version\",\n  \"name\": \"Model Display Name\",\n  \"organization_id\": \"organization-name\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Detailed description of the model's capabilities\",\n  \"release_date\": \"2024-10-22\",\n  \"announcement_date\": \"2024-10-22\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2024-04-01\",\n  \"param_count\": 7000000000,\n  \"training_tokens\": 15000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://...\",\n  \"source_playground\": \"https://...\",\n  \"source_paper\": \"https://...\",\n  \"source_scorecard_blog_link\": \"https://...\",\n  \"source_repo_link\": \"https://github.com/...\",\n  \"source_weights_link\": \"https://huggingface.co/...\",\n  \"created_at\": \"2025-10-02T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-10-02T00:00:00.000000+00:00\",\n  \"model_family_id\": null\n}\n```\n\n**Required Fields**: `model_id`, `name`, `organization_id`, `release_date`, `license_id`, `multimodal`\n\n**Optional Fields**: Set to `null` if not applicable\n\n#### `benchmarks.json`\n\nStart with an empty array if no benchmark results are available yet:\n\n```json\n[]\n```\n\n5. Validate both files against their respective schemas\n\n### Updating an Existing Model\n\n1. Navigate to `data/data/organizations/[organization_id]/models/[model_id]/model.json`\n2. Update the relevant fields\n3. Update the `updated_at` timestamp\n4. Validate against `schemas/model.schema.json`\n\n## Benchmark Results\n\nBenchmark results are stored in the `benchmarks.json` file within each model directory.\n\n### Location\n\n`data/data/organizations/[organization_id]/models/[model_id]/benchmarks.json`\n\n### Adding Benchmark Results\n\n1. Ensure the benchmark exists in `data/data/benchmarks/`\n2. Ensure the model exists\n3. Add a new entry to the `benchmarks.json` array:\n\n```json\n[\n  {\n    \"benchmark_id\": \"mmlu\",\n    \"score\": 85.5,\n    \"score_unit\": \"percentage\",\n    \"source_link\": \"https://example.com/results\",\n    \"created_at\": \"2025-10-02T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-02T00:00:00.000000+00:00\"\n  }\n]\n```\n\n4. Validate against `schemas/benchmark-results.schema.json`\n\n### Updating Benchmark Results\n\n1. Locate the specific result in the array\n2. Update the `score` and/or `source_link`\n3. Update the `updated_at` timestamp\n4. Ensure the `source_link` is reliable and authoritative\n\n## Benchmarks\n\nBenchmarks define the evaluation tests used to measure model performance.\n\n### Location\n\n`data/data/benchmarks/[benchmark_id].json`\n\n### Adding a New Benchmark\n\n1. Create a new file: `data/data/benchmarks/[benchmark_id].json`\n2. Follow this structure:\n\n```json\n{\n  \"benchmark_id\": \"benchmark-name\",\n  \"name\": \"Benchmark Display Name\",\n  \"description\": \"Description of what this benchmark measures\",\n  \"category\": \"reasoning\",\n  \"source_link\": \"https://...\",\n  \"created_at\": \"2025-10-02T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-10-02T00:00:00.000000+00:00\"\n}\n```\n\n3. Validate against `schemas/benchmark.schema.json`\n\n## Providers\n\nProviders are services that offer access to models (e.g., OpenAI API, AWS Bedrock, Google Vertex AI).\n\n### Location\n\n`data/data/providers/[provider_id]/`\n\n### Adding a New Provider\n\n1. Create a new directory: `data/data/providers/[provider_id]/`\n2. Create two files:\n\n#### `provider.json`\n\n```json\n{\n  \"provider_id\": \"provider-name\",\n  \"name\": \"Provider Display Name\",\n  \"website\": \"https://provider.com\",\n  \"created_at\": \"2025-10-02T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-10-02T00:00:00.000000+00:00\"\n}\n```\n\n#### `models.json`\n\nStart with an empty array:\n\n```json\n[]\n```\n\n3. Validate both files against their respective schemas\n\n### Updating Provider Information\n\n1. Navigate to `data/data/providers/[provider_id]/provider.json`\n2. Update the relevant fields\n3. Update the `updated_at` timestamp\n\n### Adding Provider Models\n\nProvider models specify pricing and availability of models through specific providers.\n\n1. Open `data/data/providers/[provider_id]/models.json`\n2. Add a new entry to the array:\n\n```json\n[\n  {\n    \"provider_model_id\": \"provider-specific-id\",\n    \"model_id\": \"actual-model-id\",\n    \"provider_id\": \"provider-name\",\n    \"input_price_per_million\": 3.0,\n    \"output_price_per_million\": 15.0,\n    \"context_window\": 200000,\n    \"max_output_tokens\": 4096,\n    \"available\": true,\n    \"created_at\": \"2025-10-02T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-02T00:00:00.000000+00:00\"\n  }\n]\n```\n\n3. Ensure the model exists in `data/data/organizations/[org]/models/[model_id]/`\n4. Validate against `schemas/provider-models.schema.json`\n\n## Licenses\n\nLicenses define the terms under which models can be used.\n\n### Location\n\n`data/data/licenses/[license_id].json`\n\n### Adding a New License\n\n1. Create a new file: `data/data/licenses/[license_id].json`\n2. Follow this structure:\n\n```json\n{\n  \"license_id\": \"license-name\",\n  \"name\": \"License Display Name\",\n  \"url\": \"https://...\",\n  \"commercial_use\": true,\n  \"created_at\": \"2025-10-02T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-10-02T00:00:00.000000+00:00\"\n}\n```\n\n3. Validate against `schemas/license.schema.json`\n\n## Validation\n\nBefore submitting your contribution:\n\n### Manual Validation\n\nRun the validator script from the `data/` directory:\n\n```bash\ncd data\nnode schemas/validator.js\n```\n\nThis will check all JSON files against their respective schemas.\n\n### What the Validator Checks\n\n- JSON syntax correctness\n- Required fields are present\n- Field types match schema definitions\n- ID references exist (e.g., organization_id, license_id)\n- Date formats are valid\n- URLs are properly formatted\n\n### Common Validation Errors\n\n1. **Missing Required Fields**: Ensure all required fields are present\n2. **Invalid Date Format**: Use ISO 8601 format (YYYY-MM-DD or full timestamp)\n3. **Invalid References**: Ensure referenced IDs exist (organization_id, license_id, etc.)\n4. **Type Mismatch**: Ensure numbers are numbers, strings are strings, etc.\n5. **Trailing Commas**: Remove trailing commas in JSON arrays/objects\n\n## Submitting Your Contribution\n\n1. **Fork the Repository**: Create your own fork of the project\n2. **Create a Branch**: Use a descriptive branch name (e.g., `add-gpt5-model`, `update-claude-pricing`)\n3. **Make Changes**: Follow the guidelines above\n4. **Validate Locally**: Run `node schemas/validator.js` to ensure your changes are valid\n5. **Commit Changes**: Write clear, descriptive commit messages\n6. **Submit a Pull Request**:\n   - Provide a clear title and description\n   - List what was added or changed\n   - Include links to authoritative sources\n   - Reference any related issues\n\n### Pull Request Template\n\n```markdown\n## Description\n\nBrief description of what this PR adds or changes\n\n## Changes\n\n- Added/Updated model: [Model Name]\n- Added/Updated organization: [Organization Name]\n- Added benchmark results for: [Benchmark Name]\n\n## Sources\n\n- [Source 1]: https://...\n- [Source 2]: https://...\n\n## Validation\n\n- [ ] Ran `node schemas/validator.js` successfully\n- [ ] All files follow the correct structure\n- [ ] All references (organization_id, license_id) are valid\n```\n\n### Example Pull Request\n\nFor reference, see this [example pull request](https://github.com/JonathanChavezTamales/llm-leaderboard/pull/1).\n\n## Questions?\n\nIf you have questions or need clarification, please:\n\n1. Check the schema files in `schemas/` for detailed field definitions\n2. Look at existing data files as examples\n3. Open an issue on GitHub\n\nThank you for contributing to LLM Stats!\n"
  },
  {
    "path": "LICENSE.md",
    "content": "Creative Commons Attribution 4.0 International License\n\nCopyright (c) 2024 jc\n\nThis work is licensed under the Creative Commons Attribution 4.0 International License.\nTo view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/\nor send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.\n\nYou are free to:\n\n- Share — copy and redistribute the material in any medium or format\n- Adapt — remix, transform, and build upon the material for any purpose, even commercially\n\nUnder the following terms:\n\n- Attribution — You must give appropriate credit, provide a link to the license, and indicate\n  if changes were made. You may do so in any reasonable manner, but not in any way that\n  suggests the licensor endorses you or your use.\n\nNo additional restrictions — You may not apply legal terms or technological measures that\nlegally restrict others from doing anything the license permits.\n\nNotices:\n\n- You do not have to comply with the license for elements of the material in the public domain\n  or where your use is permitted by an applicable exception or limitation.\n- No warranties are given. The license may not give you all of the permissions necessary for\n  your intended use. For example, other rights such as publicity, privacy, or moral rights\n  may limit how you use the material.\n"
  },
  {
    "path": "README.md",
    "content": "# DEPRECATED - Updates and contributions\n\nThis repository is now depracated and won't be getting any new updates. For contributions and corrections of the data seen in [LLM Stats](https://llm-stats.com/) please create a post with the tag \"Issue\" in the [official community section](https://llm-stats.com/posts) of the website.\n\nFor model and/or benchmark specific corrections, please visit create an Issue under the \"Discussion\" tab of the model/benchmark, as seen in the example below.\n\n<img width=\"1156\" height=\"575\" alt=\"Screenshot 2025-10-24 at 1 43 52 p m\" src=\"https://github.com/user-attachments/assets/b78f2cf3-f3ff-4a51-bba4-d8643865d16b\" />\n\n---\n\n<img width=\"1208\" alt=\"image\" src=\"https://github.com/user-attachments/assets/835f1e1b-73e6-405a-b7ad-096d5f5f567a\" />\n\n# LLM-Stats.com\n\n[![GitHub stars](https://img.shields.io/github/stars/JonathanChavezTamales/llm-leaderboard?style=social)](https://github.com/JonathanChavezTamales/llm-leaderboard/stargazers)\n[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![Discord](https://img.shields.io/badge/Discord-Join%20Us-7289da?logo=discord&logoColor=white)](https://discord.com/invite/RxGUBvE42d)\n[![Issues](https://img.shields.io/github/issues/JonathanChavezTamales/llm-leaderboard)](https://github.com/JonathanChavezTamales/llm-leaderboard/issues)\n\nA community-driven repository of LLM data and benchmarks. Compare and explore language models through our interactive dashboard at [llm-stats.com](https://llm-stats.com).\n\n## Found an issue or have a feature request?\n\n[Open an issue here](https://github.com/JonathanChavezTamales/llm-leaderboard/issues). Thank you!\n\n# Data\n\n## 🔍 What's Inside\n\nOur repository contains detailed information on hundreds of LLMs:\n\n- Model parameters, context window sizes, licensing details, capabilities, and more\n- Provider pricing and configurations\n- Performance metrics (throughput, latency)\n- Standardized benchmark results\n- Organization and license information\n\n## 📁 Data Structure\n\nAll data is organized in the `data/` directory:\n\n- `data/models/` - Model metadata and configurations\n- `data/providers/` - Provider information\n- `data/provider_models/` - Provider-specific model pricing and features\n- `data/benchmarks/` - Benchmark definitions\n- `data/model_benchmarks/` - Model benchmark scores\n- `data/organizations/` - Organization information\n- `data/licenses/` - License definitions\n\n## 🤝 How to Contribute\n\nWe welcome community contributions to keep our data accurate and up-to-date:\n\n1. **Update Model Data**\n\n   - Browse the [`data/`](data/) directory structure\n   - Submit a PR following our [contribution guidelines](CONTRIBUTING.md)\n   - Check [`schemas/`](schemas/) for JSON Schema validation\n\n## 📈 Data Quality\n\nAccuracy is our priority. To ensure reliable information:\n\n- All benchmark data requires verifiable source links\n- Community review process for all changes\n- Multiple source citations encouraged\n- Regular validation of submitted data\n\nThere's no guarantee that the data is 100% accurate, but we do our best to ensure it's as accurate as possible.\n\n## 🌟 Community\n\n- Join our [Discord](https://discord.gg/RxGUBvE42d) for discussions\n\n## Leaderboard\n\n| Name                                     | Release Date | Input Context | Output Context | GPQA  | MMLU  | MMLU-Pro | MATH  | HumanEval | MMMU  | LiveCodeBench |\n| ---------------------------------------- | ------------ | ------------- | -------------- | ----- | ----- | -------- | ----- | --------- | ----- | ------------- |\n| GPT-5                                    | 2025-08-07   | N/A           | N/A            | 0.857 | 0.925 | N/A      | 0.847 | 0.934     | 0.842 | N/A           |\n| o1                                       | 2024-12-17   | N/A           | N/A            | 0.780 | 0.918 | N/A      | 0.964 | 0.881     | 0.776 | N/A           |\n| GPT-4.5                                  | 2025-02-27   | N/A           | N/A            | 0.695 | 0.908 | N/A      | N/A   | 0.880     | 0.752 | N/A           |\n| o1-preview                               | 2024-09-12   | N/A           | N/A            | 0.733 | 0.908 | N/A      | 0.855 | N/A       | N/A   | N/A           |\n| Claude 3.5 Sonnet                        | 2024-10-22   | N/A           | N/A            | 0.672 | 0.904 | 0.776    | 0.783 | 0.937     | 0.683 | N/A           |\n| Claude 3.5 Sonnet                        | 2024-06-21   | N/A           | N/A            | 0.594 | 0.904 | 0.761    | 0.711 | 0.920     | N/A   | N/A           |\n| Kimi K2 0905                             | 2025-09-05   | N/A           | N/A            | 0.758 | 0.902 | 0.825    | 0.891 | 0.945     | N/A   | N/A           |\n| GPT-4.1                                  | 2025-04-14   | N/A           | N/A            | 0.663 | 0.902 | N/A      | N/A   | N/A       | 0.748 | N/A           |\n| Kimi K2 Instruct                         | 2025-07-11   | N/A           | N/A            | 0.751 | 0.895 | 0.811    | N/A   | 0.933     | N/A   | N/A           |\n| GPT-4o                                   | 2024-05-13   | N/A           | N/A            | 0.536 | 0.887 | 0.726    | 0.766 | 0.902     | N/A   | N/A           |\n| DeepSeek-V3                              | 2024-12-25   | N/A           | N/A            | 0.591 | 0.885 | 0.759    | N/A   | N/A       | N/A   | 0.376         |\n| Qwen3 235B A22B                          | 2025-04-29   | N/A           | N/A            | 0.475 | 0.878 | 0.682    | 0.718 | N/A       | N/A   | 0.707         |\n| Kimi K2 Base                             | 2025-07-11   | N/A           | N/A            | 0.481 | 0.878 | 0.692    | 0.702 | N/A       | N/A   | N/A           |\n| Grok-2                                   | 2024-08-13   | N/A           | N/A            | 0.560 | 0.875 | 0.755    | 0.761 | 0.884     | 0.661 | N/A           |\n| GPT-4.1 mini                             | 2025-04-14   | N/A           | N/A            | 0.650 | 0.875 | N/A      | N/A   | N/A       | 0.727 | N/A           |\n| Kimi-k1.5                                | 2025-01-20   | N/A           | N/A            | N/A   | 0.874 | N/A      | N/A   | N/A       | 0.700 | N/A           |\n| Llama 3.1 405B Instruct                  | 2024-07-23   | N/A           | N/A            | 0.507 | 0.873 | 0.733    | 0.738 | 0.890     | N/A   | N/A           |\n| o3-mini                                  | 2025-01-30   | N/A           | N/A            | 0.772 | 0.869 | N/A      | 0.979 | N/A       | N/A   | N/A           |\n| Claude 3 Opus                            | 2024-02-29   | N/A           | N/A            | 0.504 | 0.868 | 0.685    | 0.601 | 0.849     | N/A   | N/A           |\n| GPT-4 Turbo                              | 2024-04-09   | N/A           | N/A            | 0.480 | 0.865 | N/A      | 0.726 | 0.871     | N/A   | N/A           |\n| GPT-4                                    | 2023-06-13   | N/A           | N/A            | 0.357 | 0.864 | N/A      | 0.420 | 0.670     | N/A   | N/A           |\n| Grok-2 mini                              | 2024-08-13   | N/A           | N/A            | 0.510 | 0.862 | 0.720    | 0.730 | 0.857     | 0.632 | N/A           |\n| Llama 3.2 90B Instruct                   | 2024-09-25   | N/A           | N/A            | 0.467 | 0.860 | N/A      | 0.680 | N/A       | 0.603 | N/A           |\n| Llama 3.3 70B Instruct                   | 2024-12-06   | N/A           | N/A            | 0.505 | 0.860 | 0.689    | 0.770 | 0.884     | N/A   | N/A           |\n| Nova Pro                                 | 2024-11-20   | N/A           | N/A            | 0.469 | 0.859 | N/A      | 0.766 | 0.890     | 0.617 | N/A           |\n| Gemini 1.5 Pro                           | 2024-05-01   | N/A           | N/A            | 0.591 | 0.859 | 0.758    | 0.865 | 0.841     | 0.659 | N/A           |\n| GPT-4o                                   | 2024-08-06   | N/A           | N/A            | 0.701 | 0.857 | 0.747    | N/A   | N/A       | 0.722 | N/A           |\n| Llama 4 Maverick                         | 2025-04-05   | N/A           | N/A            | 0.698 | 0.855 | 0.805    | 0.612 | N/A       | 0.734 | 0.434         |\n| o1-mini                                  | 2024-09-12   | N/A           | N/A            | 0.600 | 0.852 | N/A      | N/A   | 0.924     | N/A   | N/A           |\n| Phi 4                                    | 2024-12-12   | N/A           | N/A            | 0.561 | 0.848 | 0.704    | 0.804 | 0.826     | N/A   | N/A           |\n| Mistral Large 2                          | 2024-07-24   | N/A           | N/A            | N/A   | 0.840 | N/A      | N/A   | 0.920     | N/A   | N/A           |\n| Llama 3.1 70B Instruct                   | 2024-07-23   | N/A           | N/A            | 0.417 | 0.836 | 0.664    | N/A   | 0.805     | N/A   | N/A           |\n| Qwen2.5 32B Instruct                     | 2024-09-19   | N/A           | N/A            | 0.495 | 0.833 | 0.690    | 0.831 | 0.884     | N/A   | N/A           |\n| Qwen2 72B Instruct                       | 2024-07-23   | N/A           | N/A            | 0.424 | 0.823 | 0.644    | 0.597 | 0.860     | N/A   | N/A           |\n| GPT-4o mini                              | 2024-07-18   | N/A           | N/A            | 0.402 | 0.820 | N/A      | 0.702 | 0.872     | 0.594 | N/A           |\n| Grok-1.5                                 | 2024-03-28   | N/A           | N/A            | 0.359 | 0.813 | 0.510    | 0.506 | 0.741     | 0.536 | N/A           |\n| Jamba 1.5 Large                          | 2024-08-22   | N/A           | N/A            | 0.369 | 0.812 | 0.535    | N/A   | N/A       | N/A   | N/A           |\n| Mistral Small 3.1 24B Base               | 2025-03-17   | N/A           | N/A            | 0.375 | 0.810 | 0.560    | N/A   | N/A       | 0.593 | N/A           |\n| Mistral Small 3 24B Base                 | 2025-01-30   | N/A           | N/A            | 0.344 | 0.807 | 0.544    | 0.460 | N/A       | N/A   | N/A           |\n| Mistral Small 3.1 24B Instruct           | 2025-03-17   | N/A           | N/A            | 0.460 | 0.806 | 0.668    | 0.693 | 0.884     | 0.593 | N/A           |\n| Nova Lite                                | 2024-11-20   | N/A           | N/A            | 0.420 | 0.805 | N/A      | 0.733 | 0.854     | 0.562 | N/A           |\n| Mistral Small 3.2 24B Instruct           | 2025-06-20   | N/A           | N/A            | 0.461 | 0.805 | 0.691    | 0.694 | N/A       | 0.625 | N/A           |\n| DeepSeek-V2.5                            | 2024-05-08   | N/A           | N/A            | N/A   | 0.804 | N/A      | 0.747 | 0.890     | N/A   | N/A           |\n| Llama 3.1 Nemotron 70B Instruct          | 2024-10-01   | N/A           | N/A            | N/A   | 0.802 | N/A      | N/A   | N/A       | N/A   | N/A           |\n| GPT-4.1 nano                             | 2025-04-14   | N/A           | N/A            | 0.503 | 0.801 | N/A      | N/A   | N/A       | 0.554 | N/A           |\n| Qwen2.5 14B Instruct                     | 2024-09-19   | N/A           | N/A            | 0.455 | 0.797 | 0.637    | 0.800 | 0.835     | N/A   | N/A           |\n| Llama 4 Scout                            | 2025-04-05   | N/A           | N/A            | 0.572 | 0.796 | 0.743    | 0.503 | N/A       | 0.694 | 0.328         |\n| Claude 3 Sonnet                          | 2024-02-29   | N/A           | N/A            | 0.404 | 0.790 | 0.568    | 0.431 | 0.730     | N/A   | N/A           |\n| Gemini 1.5 Flash                         | 2024-05-01   | N/A           | N/A            | 0.510 | 0.789 | 0.673    | 0.779 | 0.743     | 0.623 | N/A           |\n| Phi-3.5-MoE-instruct                     | 2024-08-23   | N/A           | N/A            | 0.368 | 0.789 | 0.453    | 0.595 | 0.707     | N/A   | N/A           |\n| Qwen2.5 VL 32B Instruct                  | 2025-02-28   | N/A           | N/A            | 0.460 | 0.784 | 0.688    | 0.822 | 0.915     | 0.700 | N/A           |\n| Nova Micro                               | 2024-11-20   | N/A           | N/A            | 0.400 | 0.776 | N/A      | 0.693 | 0.811     | N/A   | N/A           |\n| Command R+                               | 2024-08-30   | N/A           | N/A            | N/A   | 0.757 | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Gemma 2 27B                              | 2024-06-27   | N/A           | N/A            | N/A   | 0.752 | N/A      | 0.423 | 0.518     | N/A   | N/A           |\n| Claude 3 Haiku                           | 2024-03-13   | N/A           | N/A            | 0.333 | 0.752 | N/A      | 0.389 | 0.759     | N/A   | N/A           |\n| Qwen2.5-Coder 32B Instruct               | 2024-09-19   | N/A           | N/A            | N/A   | 0.751 | 0.504    | 0.572 | 0.927     | N/A   | 0.314         |\n| Llama 3.2 11B Instruct                   | 2024-09-25   | N/A           | N/A            | 0.328 | 0.730 | N/A      | 0.519 | N/A       | 0.507 | N/A           |\n| Gemini 1.0 Pro                           | 2024-02-15   | N/A           | N/A            | 0.279 | 0.718 | N/A      | 0.326 | N/A       | 0.479 | N/A           |\n| Gemma 2 9B                               | 2024-06-27   | N/A           | N/A            | N/A   | 0.713 | N/A      | 0.366 | 0.402     | N/A   | N/A           |\n| Qwen2 7B Instruct                        | 2024-07-23   | N/A           | N/A            | 0.253 | 0.705 | 0.441    | 0.496 | 0.799     | N/A   | 0.266         |\n| GPT-3.5 Turbo                            | 2023-03-21   | N/A           | N/A            | 0.308 | 0.698 | N/A      | 0.431 | 0.680     | 0.000 | N/A           |\n| Jamba 1.5 Mini                           | 2024-08-22   | N/A           | N/A            | 0.323 | 0.697 | 0.425    | N/A   | N/A       | N/A   | N/A           |\n| Llama 3.1 8B Instruct                    | 2024-07-23   | N/A           | N/A            | 0.304 | 0.694 | 0.483    | N/A   | 0.726     | N/A   | N/A           |\n| Pixtral-12B                              | 2024-09-17   | N/A           | N/A            | N/A   | 0.692 | N/A      | 0.481 | 0.720     | 0.525 | N/A           |\n| Phi-3.5-mini-instruct                    | 2024-08-23   | N/A           | N/A            | 0.304 | 0.690 | 0.474    | 0.485 | 0.628     | N/A   | N/A           |\n| Mistral NeMo Instruct                    | 2024-07-18   | N/A           | N/A            | N/A   | 0.680 | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Qwen2.5-Coder 7B Instruct                | 2024-09-19   | N/A           | N/A            | N/A   | 0.676 | 0.401    | 0.466 | 0.884     | N/A   | 0.182         |\n| Phi 4 Mini                               | 2025-02-01   | N/A           | N/A            | 0.252 | 0.673 | 0.528    | 0.640 | N/A       | N/A   | N/A           |\n| Granite 3.3 8B Instruct                  | 2025-04-16   | N/A           | N/A            | N/A   | 0.655 | N/A      | N/A   | 0.897     | N/A   | N/A           |\n| Ministral 8B Instruct                    | 2024-10-16   | N/A           | N/A            | N/A   | 0.650 | N/A      | 0.545 | 0.348     | N/A   | N/A           |\n| Gemma 3n E4B Instructed LiteRT Preview   | 2025-05-20   | N/A           | N/A            | 0.237 | 0.649 | 0.506    | N/A   | 0.750     | N/A   | 0.132         |\n| Gemma 3n E4B Instructed                  | 2025-06-26   | N/A           | N/A            | 0.237 | 0.649 | 0.506    | N/A   | 0.750     | N/A   | 0.132         |\n| Granite 3.3 8B Base                      | 2025-04-16   | N/A           | N/A            | N/A   | 0.639 | N/A      | N/A   | 0.897     | N/A   | N/A           |\n| Llama 3.2 3B Instruct                    | 2024-09-25   | N/A           | N/A            | 0.328 | 0.634 | N/A      | 0.480 | N/A       | N/A   | N/A           |\n| IBM Granite 4.0 Tiny Preview             | 2025-05-02   | N/A           | N/A            | N/A   | 0.604 | N/A      | N/A   | 0.824     | N/A   | N/A           |\n| Gemma 3n E2B Instructed LiteRT (Preview) | 2025-05-20   | N/A           | N/A            | 0.248 | 0.601 | 0.405    | N/A   | 0.665     | N/A   | 0.132         |\n| Gemma 3n E2B Instructed                  | 2025-06-26   | N/A           | N/A            | 0.248 | 0.601 | 0.405    | N/A   | 0.665     | N/A   | 0.132         |\n| Kimi K2-Instruct-0905                    | 2025-09-05   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Gemma 3n E4B                             | 2025-06-26   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Gemma 3 12B                              | 2025-03-12   | N/A           | N/A            | 0.409 | N/A   | 0.606    | 0.838 | 0.854     | N/A   | 0.246         |\n| Gemini 2.5 Pro                           | 2025-05-20   | N/A           | N/A            | 0.830 | N/A   | N/A      | N/A   | N/A       | 0.796 | N/A           |\n| Gemini 2.0 Flash-Lite                    | 2025-02-05   | N/A           | N/A            | 0.515 | N/A   | 0.716    | 0.868 | N/A       | 0.680 | N/A           |\n| Gemini 2.5 Flash-Lite                    | 2025-06-17   | N/A           | N/A            | 0.646 | N/A   | N/A      | N/A   | N/A       | 0.729 | 0.337         |\n| Gemini 2.5 Pro Preview 06-05             | 2025-06-05   | N/A           | N/A            | 0.864 | N/A   | N/A      | N/A   | N/A       | 0.820 | 0.690         |\n| Gemini 2.5 Flash                         | 2025-05-20   | N/A           | N/A            | 0.828 | N/A   | N/A      | N/A   | N/A       | 0.797 | N/A           |\n| Gemini 2.0 Flash Thinking                | 2025-01-21   | N/A           | N/A            | 0.742 | N/A   | N/A      | N/A   | N/A       | 0.754 | N/A           |\n| Gemma 3n E2B                             | 2025-06-26   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| MedGemma 4B IT                           | 2025-05-20   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Gemma 3 4B                               | 2025-03-12   | N/A           | N/A            | 0.308 | N/A   | 0.436    | 0.756 | 0.713     | N/A   | 0.126         |\n| Gemma 3 27B                              | 2025-03-12   | N/A           | N/A            | 0.424 | N/A   | 0.675    | 0.890 | 0.878     | N/A   | 0.297         |\n| Gemma 3 1B                               | 2025-03-12   | N/A           | N/A            | 0.192 | N/A   | 0.147    | 0.480 | 0.415     | N/A   | 0.019         |\n| Gemini 1.5 Flash 8B                      | 2024-03-15   | N/A           | N/A            | 0.384 | N/A   | 0.587    | 0.587 | N/A       | 0.537 | N/A           |\n| Gemini Diffusion                         | 2025-05-20   | N/A           | N/A            | 0.404 | N/A   | N/A      | N/A   | 0.896     | N/A   | 0.309         |\n| Gemini 2.0 Flash                         | 2024-12-01   | N/A           | N/A            | 0.621 | N/A   | 0.764    | 0.897 | N/A       | 0.707 | 0.351         |\n| Phi 4 Mini Reasoning                     | 2025-04-30   | N/A           | N/A            | 0.520 | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Phi-3.5-vision-instruct                  | 2024-08-23   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | 0.430 | N/A           |\n| Phi 4 Reasoning Plus                     | 2025-04-30   | N/A           | N/A            | 0.689 | N/A   | 0.760    | N/A   | N/A       | N/A   | 0.531         |\n| Phi-4-multimodal-instruct                | 2025-02-01   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | 0.551 | N/A           |\n| Phi 4 Reasoning                          | 2025-04-30   | N/A           | N/A            | 0.658 | N/A   | 0.743    | N/A   | N/A       | N/A   | 0.538         |\n| Qwen3-235B-A22B-Instruct-2507            | 2025-07-22   | N/A           | N/A            | 0.775 | N/A   | 0.830    | N/A   | N/A       | N/A   | N/A           |\n| QwQ-32B                                  | 2025-03-05   | N/A           | N/A            | 0.652 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.634         |\n| Qwen3-235B-A22B-Thinking-2507            | 2025-07-25   | N/A           | N/A            | 0.811 | N/A   | 0.844    | N/A   | N/A       | N/A   | N/A           |\n| QwQ-32B-Preview                          | 2024-11-28   | N/A           | N/A            | 0.652 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.500         |\n| Qwen3-Next-80B-A3B-Thinking              | 2025-09-10   | N/A           | N/A            | 0.772 | N/A   | 0.827    | N/A   | N/A       | N/A   | N/A           |\n| Qwen2-VL-72B-Instruct                    | 2024-08-29   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Qwen3 32B                                | 2025-04-29   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | 0.657         |\n| Qwen2.5 72B Instruct                     | 2024-09-19   | N/A           | N/A            | 0.490 | N/A   | 0.711    | 0.831 | 0.866     | N/A   | 0.555         |\n| Qwen3 30B A3B                            | 2025-04-29   | N/A           | N/A            | 0.658 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.626         |\n| Qwen2.5 VL 7B Instruct                   | 2025-01-26   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | 0.586 | N/A           |\n| Qwen3-Next-80B-A3B-Base                  | 2025-09-10   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| QvQ-72B-Preview                          | 2024-12-25   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | 0.703 | N/A           |\n| Qwen2.5-Omni-7B                          | 2025-03-27   | N/A           | N/A            | 0.308 | N/A   | 0.470    | 0.715 | 0.787     | 0.592 | N/A           |\n| Qwen2.5 7B Instruct                      | 2024-09-19   | N/A           | N/A            | 0.364 | N/A   | 0.563    | 0.755 | 0.848     | N/A   | 0.287         |\n| Qwen3-Next-80B-A3B-Instruct              | 2025-09-10   | N/A           | N/A            | 0.729 | N/A   | 0.806    | N/A   | N/A       | N/A   | N/A           |\n| Qwen2.5 VL 72B Instruct                  | 2025-01-26   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | 0.702 | N/A           |\n| DeepSeek-R1-0528                         | 2025-05-28   | N/A           | N/A            | N/A   | N/A   | 0.850    | N/A   | N/A       | N/A   | 0.733         |\n| DeepSeek VL2                             | 2024-12-13   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | 0.511 | N/A           |\n| DeepSeek VL2 Tiny                        | 2024-12-13   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | 0.407 | N/A           |\n| DeepSeek R1 Zero                         | 2025-01-20   | N/A           | N/A            | 0.733 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.500         |\n| DeepSeek VL2 Small                       | 2024-12-13   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | 0.480 | N/A           |\n| DeepSeek R1 Distill Qwen 7B              | 2025-01-20   | N/A           | N/A            | 0.491 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.376         |\n| DeepSeek R1 Distill Qwen 1.5B            | 2025-01-20   | N/A           | N/A            | 0.338 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.169         |\n| DeepSeek-R1                              | 2025-01-20   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| DeepSeek R1 Distill Llama 8B             | 2025-01-20   | N/A           | N/A            | 0.490 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.396         |\n| DeepSeek R1 Distill Llama 70B            | 2025-01-20   | N/A           | N/A            | 0.652 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.575         |\n| DeepSeek R1 Distill Qwen 14B             | 2025-01-20   | N/A           | N/A            | 0.591 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.531         |\n| DeepSeek R1 Distill Qwen 32B             | 2025-01-20   | N/A           | N/A            | 0.621 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.572         |\n| DeepSeek-V3.1                            | 2025-01-10   | N/A           | N/A            | N/A   | N/A   | 0.837    | N/A   | N/A       | N/A   | 0.564         |\n| DeepSeek-V3.2-Exp                        | 2025-09-29   | N/A           | N/A            | N/A   | N/A   | 0.850    | N/A   | N/A       | N/A   | 0.741         |\n| DeepSeek-V3 0324                         | 2025-03-25   | N/A           | N/A            | 0.684 | N/A   | 0.812    | N/A   | N/A       | N/A   | 0.492         |\n| Grok-3 Mini                              | 2025-02-17   | N/A           | N/A            | 0.840 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.804         |\n| Grok-4 Heavy                             | 2025-07-09   | N/A           | N/A            | 0.884 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.794         |\n| Grok-4                                   | 2025-07-09   | N/A           | N/A            | 0.875 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.790         |\n| Grok-3                                   | 2025-02-17   | N/A           | N/A            | 0.846 | N/A   | N/A      | N/A   | N/A       | 0.780 | 0.794         |\n| Grok-1.5V                                | 2024-04-12   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | 0.536 | N/A           |\n| GLM-4.5V                                 | 2025-08-11   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| GLM-4.5-Air                              | 2025-07-28   | N/A           | N/A            | 0.750 | N/A   | 0.814    | N/A   | N/A       | N/A   | 0.707         |\n| GLM-4.5                                  | 2025-07-28   | N/A           | N/A            | 0.791 | N/A   | 0.846    | N/A   | N/A       | N/A   | 0.729         |\n| Llama-3.3 Nemotron Super 49B v1          | 2025-03-18   | N/A           | N/A            | 0.667 | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Llama 3.1 Nemotron Nano 8B V1            | 2025-03-18   | N/A           | N/A            | 0.541 | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Llama 3.1 Nemotron Ultra 253B v1         | 2025-04-07   | N/A           | N/A            | 0.760 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.663         |\n| Claude Opus 4.1                          | 2025-08-05   | N/A           | N/A            | 0.809 | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Claude Sonnet 4.5                        | 2025-09-29   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Claude 3.5 Haiku                         | 2024-10-22   | N/A           | N/A            | 0.416 | N/A   | 0.650    | 0.694 | 0.881     | N/A   | N/A           |\n| Claude 3.7 Sonnet                        | 2025-02-24   | N/A           | N/A            | 0.848 | N/A   | N/A      | N/A   | N/A       | 0.750 | N/A           |\n| Claude Sonnet 4                          | 2025-05-22   | N/A           | N/A            | 0.754 | N/A   | N/A      | N/A   | N/A       | 0.744 | N/A           |\n| Claude Opus 4                            | 2025-05-22   | N/A           | N/A            | 0.796 | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Magistral Small 2506                     | 2025-06-10   | N/A           | N/A            | 0.682 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.513         |\n| Magistral Medium                         | 2025-06-10   | N/A           | N/A            | 0.708 | N/A   | N/A      | N/A   | N/A       | N/A   | 0.503         |\n| Devstral Medium                          | 2025-07-10   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Pixtral Large                            | 2024-11-18   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | 0.640 | N/A           |\n| Mistral Small 3 24B Instruct             | 2025-01-30   | N/A           | N/A            | 0.453 | N/A   | 0.663    | 0.706 | 0.848     | N/A   | N/A           |\n| Devstral Small 1.1                       | 2025-07-11   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| Codestral-22B                            | 2024-05-29   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | 0.811     | N/A   | N/A           |\n| Mistral Small                            | 2024-09-17   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| GPT OSS 120B                             | 2025-08-05   | N/A           | N/A            | 0.801 | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| o3                                       | 2025-04-16   | N/A           | N/A            | 0.833 | N/A   | N/A      | N/A   | N/A       | 0.829 | N/A           |\n| GPT OSS 20B                              | 2025-08-05   | N/A           | N/A            | 0.715 | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| o4-mini                                  | 2025-04-16   | N/A           | N/A            | 0.814 | N/A   | N/A      | N/A   | N/A       | 0.816 | N/A           |\n| o3-pro                                   | 2025-06-10   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| o1-pro                                   | 2024-12-17   | N/A           | N/A            | 0.790 | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| GPT-5 nano                               | 2025-08-07   | N/A           | N/A            | 0.712 | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| GPT-5 mini                               | 2025-08-07   | N/A           | N/A            | 0.823 | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n| GPT-5 Codex                              | 2025-09-15   | N/A           | N/A            | N/A   | N/A   | N/A      | N/A   | N/A       | N/A   | N/A           |\n\n<div align=\"center\">\nBuilt with 💙 by the AI community, for the AI community.<br>\nStar this repo if you find it useful!\n</div>\n"
  },
  {
    "path": "data/.github/CODEOWNERS",
    "content": "* @JonathanChavezTamales\n* @sebastiancrossa\n"
  },
  {
    "path": "data/benchmarks/aa-index.json",
    "content": "{\n  \"benchmark_id\": \"aa-index\",\n  \"name\": \"AA-Index\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"No official academic documentation found for this benchmark. Extensive research through ArXiv, IEEE/ACL/NeurIPS papers, and university research sites yielded no peer-reviewed sources for an 'aa-index' benchmark. This entry requires verification from official academic sources.\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/acebench.json",
    "content": "{\n  \"benchmark_id\": \"acebench\",\n  \"name\": \"ACEBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"ACEBench is a comprehensive benchmark for evaluating Large Language Models' tool usage capabilities across three primary evaluation types: Normal (basic tool usage scenarios), Special (tool usage with ambiguous or incomplete instructions), and Agent (multi-agent interactions simulating real-world dialogues). The benchmark covers 4,538 APIs across 8 major domains and 68 sub-domains including technology, finance, entertainment, society, health, culture, and environment, supporting both English and Chinese languages.\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.12851\",\n  \"implementation_link\": \"https://github.com/ACEBench/ACEBench\",\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-30T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/activitynet.json",
    "content": "{\n  \"benchmark_id\": \"activitynet\",\n  \"name\": \"ActivityNet\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"video\"],\n  \"modality\": \"video\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A large-scale video benchmark for human activity understanding. Provides samples from 203 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 video hours. The benchmark covers a wide range of complex human activities that are of interest to people in their daily living and can be used to compare algorithms for three scenarios: untrimmed video classification, trimmed activity classification, and activity detection.\",\n  \"paper_link\": \"https://openaccess.thecvf.com/content_cvpr_2015/html/Heilbron_ActivityNet_A_Large-Scale_2015_CVPR_paper.html\",\n  \"implementation_link\": \"https://github.com/activitynet/ActivityNet\",\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.378371+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.378371+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/agieval.json",
    "content": "{\n  \"benchmark_id\": \"agieval\",\n  \"name\": \"AGIEval\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\", \"math\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission tests (LSAT), math competitions, lawyer qualification tests, and civil service exams. Contains 20 tasks (18 multiple-choice, 2 cloze) designed to assess understanding, knowledge, reasoning, and calculation abilities in real-world academic and professional contexts.\",\n  \"paper_link\": \"https://arxiv.org/abs/2304.06364\",\n  \"implementation_link\": \"https://github.com/ruixiangcui/AGIEval\",\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.970928+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.970928+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/ai2-reasoning-challenge-(arc).json",
    "content": "{\n  \"benchmark_id\": \"ai2-reasoning-challenge-(arc)\",\n  \"name\": \"AI2 Reasoning Challenge (ARC)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A dataset of 7,787 genuine grade-school level, multiple-choice science questions assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and Easy Set, where the Challenge Set contains only questions answered incorrectly by both retrieval-based and word co-occurrence algorithms. Covers multiple scientific domains including biology, physics, earth science, and chemistry, requiring scientific reasoning, causal understanding, and conceptual knowledge beyond simple fact retrieval. Includes a supporting corpus of over 14 million science sentences.\",\n  \"paper_link\": \"https://arxiv.org/abs/1803.05457\",\n  \"implementation_link\": \"https://github.com/allenai/ARC-Solvers\",\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.419158+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.419158+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/ai2d.json",
    "content": "{\n  \"benchmark_id\": \"ai2d\",\n  \"name\": \"AI2D\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"reasoning\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over 15,000 multiple choice questions and answers. The benchmark evaluates diagram understanding and visual reasoning capabilities, requiring models to interpret diagrammatic elements, relationships, and structure to answer questions about scientific concepts represented in visual form.\",\n  \"paper_link\": \"https://arxiv.org/abs/1603.07396\",\n  \"implementation_link\": \"https://allenai.org/data/diagrams\",\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.618926+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.618926+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/aider-polyglot-edit.json",
    "content": "{\n  \"benchmark_id\": \"aider-polyglot-edit\",\n  \"name\": \"Aider-Polyglot Edit\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust. Contains 225 of Exercism's most difficult programming problems, selected as problems that were solved by 3 or fewer out of 7 top coding models. The benchmark focuses on code editing tasks and measures both correctness of solutions and proper edit format usage. Designed to re-calibrate evaluation scales so top models score between 5-50%.\",\n  \"paper_link\": null,\n  \"implementation_link\": \"https://github.com/Aider-AI/polyglot-benchmark\",\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.789839+00:00\",\n  \"updated_at\": \"2025-09-30T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/aider-polyglot.json",
    "content": "{\n  \"benchmark_id\": \"aider-polyglot\",\n  \"name\": \"Aider-Polyglot\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.\",\n  \"paper_link\": null,\n  \"implementation_link\": \"https://github.com/Aider-AI/polyglot-benchmark\",\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-30T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/aider.json",
    "content": "{\n  \"benchmark_id\": \"aider\",\n  \"name\": \"Aider\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Aider is a comprehensive code editing benchmark based on 133 practice exercises from Exercism's Python repository, designed to evaluate AI models' ability to translate natural language coding requests into executable code that passes unit tests. The benchmark measures end-to-end code editing capabilities, including GPT's ability to edit existing code and format code changes for automated saving to local files. The Aider Polyglot variant extends this evaluation across 225 challenging exercises spanning C++, Go, Java, JavaScript, Python, and Rust, making it a standard benchmark for assessing multilingual code editing performance in AI research.\",\n  \"paper_link\": null,\n  \"implementation_link\": \"https://github.com/Aider-AI/aider\",\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.566857+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.566857+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/aime-2024.json",
    "content": "{\n  \"benchmark_id\": \"aime-2024\",\n  \"name\": \"AIME 2024\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"American Invitational Mathematics Examination 2024, consisting of 30 challenging mathematical reasoning problems from AIME I and AIME II competitions. Each problem requires an integer answer between 0-999 and tests advanced mathematical reasoning across algebra, geometry, combinatorics, and number theory. Used as a benchmark for evaluating mathematical reasoning capabilities in large language models at Olympiad-level difficulty.\",\n  \"paper_link\": \"https://arxiv.org/html/2503.21380v2\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.941652+00:00\",\n  \"updated_at\": \"2025-09-30T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/aime-2025.json",
    "content": "{\n  \"benchmark_id\": \"aime-2025\",\n  \"name\": \"AIME 2025\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/2503.21380\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/aime.json",
    "content": "{\n  \"benchmark_id\": \"aime\",\n  \"name\": \"AIME\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"American Invitational Mathematics Examination (AIME) benchmark for evaluating mathematical reasoning capabilities of large language models. Contains 30 challenging mathematical problems from AIME 2024 competition that require multi-step reasoning and advanced mathematical insight. Each problem has an integer answer between 000-999.\",\n  \"paper_link\": \"https://arxiv.org/html/2503.21380v2\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.057279+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.057279+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/aitz-em.json",
    "content": "{\n  \"benchmark_id\": \"aitz-em\",\n  \"name\": \"AITZ_EM\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Android-In-The-Zoo (AitZ) benchmark for evaluating autonomous GUI agents on smartphones. Contains 18,643 screen-action pairs with chain-of-action-thought annotations spanning over 70 Android apps. Designed to connect perception (screen layouts and UI elements) with cognition (action decision-making) for natural language-triggered smartphone task completion.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.02713\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.785085+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.785085+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/alignbench.json",
    "content": "{\n  \"benchmark_id\": \"alignbench\",\n  \"name\": \"AlignBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"language\", \"math\", \"reasoning\", \"roleplay\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"AlignBench is a comprehensive multi-dimensional benchmark for evaluating Chinese alignment of Large Language Models. It contains 8 main categories: Fundamental Language Ability, Advanced Chinese Understanding, Open-ended Questions, Writing Ability, Logical Reasoning, Mathematics, Task-oriented Role Play, and Professional Knowledge. The benchmark includes 683 real-scenario rooted queries with human-verified references and uses a rule-calibrated multi-dimensional LLM-as-Judge approach with Chain-of-Thought for evaluation.\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.18743\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.542033+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.542033+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/alpacaeval-2.0.json",
    "content": "{\n  \"benchmark_id\": \"alpacaeval-2.0\",\n  \"name\": \"AlpacaEval 2.0\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"creativity\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"AlpacaEval 2.0 is a length-controlled automatic evaluator for instruction-following language models that uses GPT-4 Turbo to assess model responses against a baseline. It evaluates models on 805 diverse instruction-following tasks including creative writing, classification, programming, and general knowledge questions. The benchmark achieves 0.98 Spearman correlation with ChatBot Arena while being fast (< 3 minutes) and affordable (< $10 in OpenAI credits). It addresses length bias in automatic evaluation through length-controlled win-rates and uses weighted scoring based on response quality.\",\n  \"paper_link\": \"https://arxiv.org/abs/2404.04475\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.038178+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.038178+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/amc-2022-23.json",
    "content": "{\n  \"benchmark_id\": \"amc-2022-23\",\n  \"name\": \"AMC_2022_23\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"American Mathematics Competition problems from the 2022-23 academic year, consisting of multiple-choice mathematics competition problems designed for high school students. These problems require advanced mathematical reasoning, problem-solving strategies, and mathematical knowledge covering topics like algebra, geometry, number theory, and combinatorics. The benchmark is derived from the official AMC competitions sponsored by the Mathematical Association of America.\",\n  \"paper_link\": \"https://arxiv.org/abs/2103.03874\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.992903+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.992903+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/android-control-high-em.json",
    "content": "{\n  \"benchmark_id\": \"android-control-high-em\",\n  \"name\": \"Android Control High_EM\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Android device control benchmark using high exact match evaluation metric for assessing agent performance on mobile interface tasks\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.792498+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.792498+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/android-control-low-em.json",
    "content": "{\n  \"benchmark_id\": \"android-control-low-em\",\n  \"name\": \"Android Control Low_EM\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Android control benchmark evaluating autonomous agents on mobile device interaction tasks with low exact match scoring criteria\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.800337+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.800337+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/androidworld-sr.json",
    "content": "{\n  \"benchmark_id\": \"androidworld-sr\",\n  \"name\": \"AndroidWorld_SR\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"AndroidWorld Success Rate (SR) benchmark - A dynamic benchmarking environment for autonomous agents operating on Android devices. Evaluates agents on 116 programmatic tasks across 20 real-world Android apps using multimodal inputs (screen screenshots, accessibility trees, and natural language instructions). Measures success rate of agents completing tasks like sending messages, creating calendar events, and navigating mobile interfaces. Published at ICLR 2025. Best current performance: 30.6% success rate (M3A agent) vs 80.0% human performance.\",\n  \"paper_link\": \"https://arxiv.org/abs/2405.14573\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.808659+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.808659+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/api-bank.json",
    "content": "{\n  \"benchmark_id\": \"api-bank\",\n  \"name\": \"API-Bank\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive benchmark for tool-augmented LLMs that evaluates API planning, retrieval, and calling capabilities. Contains 314 tool-use dialogues with 753 API calls across 73 API tools, designed to assess how effectively LLMs can utilize external tools and overcome obstacles in tool leveraging.\",\n  \"paper_link\": \"https://arxiv.org/abs/2304.08244\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.374447+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.374447+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/arc-agi-v2.json",
    "content": "{\n  \"benchmark_id\": \"arc-agi-v2\",\n  \"name\": \"ARC-AGI v2\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"vision\", \"spatial_reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.\",\n  \"paper_link\": \"https://arxiv.org/abs/2505.11831\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.916360+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.916360+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/arc-agi.json",
    "content": "{\n  \"benchmark_id\": \"arc-agi\",\n  \"name\": \"ARC-AGI\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"vision\", \"spatial_reasoning\"],\n  \"modality\": \"image\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is a benchmark designed to test general intelligence and abstract reasoning capabilities through visual grid-based transformation tasks. Each task consists of 2-5 demonstration pairs showing input grids transformed into output grids according to underlying rules, with test-takers required to infer these rules and apply them to novel test inputs. The benchmark uses colored grids (up to 30x30) with 10 discrete colors/symbols, designed to measure human-like general fluid intelligence and skill-acquisition efficiency with minimal prior knowledge.\",\n  \"paper_link\": \"https://arxiv.org/abs/1911.01547\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.187761+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.187761+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/arc-c.json",
    "content": "{\n  \"benchmark_id\": \"arc-c\",\n  \"name\": \"ARC-C\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"The AI2 Reasoning Challenge (ARC) Challenge Set is a multiple-choice question-answering benchmark containing grade-school level science questions that require advanced reasoning capabilities. ARC-C specifically contains questions that were answered incorrectly by both retrieval-based and word co-occurrence algorithms, making it a particularly challenging subset designed to test commonsense reasoning abilities in AI systems.\",\n  \"paper_link\": \"https://arxiv.org/abs/1803.05457\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.052939+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.052939+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/arc-e.json",
    "content": "{\n  \"benchmark_id\": \"arc-e\",\n  \"name\": \"ARC-E\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"ARC-E (AI2 Reasoning Challenge - Easy Set) is a subset of grade-school level, multiple-choice science questions that requires knowledge and reasoning capabilities. Part of the AI2 Reasoning Challenge dataset containing 5,197 questions that test scientific reasoning and factual knowledge. The Easy Set contains questions that are answerable by retrieval-based and word co-occurrence algorithms, making them more accessible than the Challenge Set.\",\n  \"paper_link\": \"https://arxiv.org/abs/1803.05457\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.192662+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.192662+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/arc.json",
    "content": "{\n  \"benchmark_id\": \"arc\",\n  \"name\": \"Arc\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"The Abstraction and Reasoning Corpus (ARC) is a benchmark designed to measure human-like general fluid intelligence through grid-based reasoning tasks. It consists of 800 tasks (400 training, 400 evaluation) where each task presents input-output grids that require understanding abstract patterns and transformations. Test-takers must produce exactly correct output grids for all test inputs in a task to solve it, with 3 trials allowed per test input. ARC aims to enable fair comparisons of general intelligence between AI systems and humans using priors designed to be as close as possible to innate human priors.\",\n  \"paper_link\": \"https://arxiv.org/abs/1911.01547\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.967150+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.967150+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/arena-hard-v2.json",
    "content": "{\n  \"benchmark_id\": \"arena-hard-v2\",\n  \"name\": \"Arena-Hard v2\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\", \"creativity\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.11939\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-03T22:06:11.411643+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:11.411643+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/arena-hard.json",
    "content": "{\n  \"benchmark_id\": \"arena-hard\",\n  \"name\": \"Arena Hard\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\", \"creativity\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.11939\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.079874+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.079874+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/attaq.json",
    "content": "{\n  \"benchmark_id\": \"attaq\",\n  \"name\": \"AttaQ\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"safety\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"AttaQ is a unique dataset containing adversarial examples in the form of questions designed to provoke harmful or inappropriate responses from large language models. The benchmark evaluates safety vulnerabilities by using specialized clustering techniques that analyze both the semantic similarity of input attacks and the harmfulness of model responses, facilitating targeted improvements to model safety mechanisms.\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.04124\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.079764+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.079764+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/autologi.json",
    "content": "{\n  \"benchmark_id\": \"autologi\",\n  \"name\": \"AutoLogi\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"AutoLogi is an automated method for synthesizing open-ended logic puzzles to evaluate reasoning abilities of Large Language Models. The benchmark addresses limitations of existing multiple-choice reasoning evaluations by featuring program-based verification and controllable difficulty levels. It includes 1,575 English and 883 Chinese puzzles, enabling more reliable evaluation that better distinguishes models' reasoning capabilities across languages.\",\n  \"paper_link\": \"https://arxiv.org/abs/2502.16906\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/bbh.json",
    "content": "{\n  \"benchmark_id\": \"bbh\",\n  \"name\": \"BBH\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"math\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Big-Bench Hard (BBH) is a suite of 23 challenging tasks selected from BIG-Bench for which prior language model evaluations did not outperform the average human-rater. These tasks require multi-step reasoning across diverse domains including arithmetic, logical reasoning, reading comprehension, and commonsense reasoning. The benchmark was designed to test capabilities believed to be beyond current language models and focuses on evaluating complex reasoning skills including temporal understanding, spatial reasoning, causal understanding, and deductive logical reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/2210.09261\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.031859+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.031859+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/bfcl-v2.json",
    "content": "{\n  \"benchmark_id\": \"bfcl-v2\",\n  \"name\": \"BFCL v2\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Berkeley Function Calling Leaderboard (BFCL) v2 is a comprehensive benchmark for evaluating large language models' function calling capabilities. It features 2,251 question-function-answer pairs with enterprise and OSS-contributed functions, addressing data contamination and bias through live, user-contributed scenarios. The benchmark evaluates AST accuracy, executable accuracy, irrelevance detection, and relevance detection across multiple programming languages (Python, Java, JavaScript) and includes complex real-world function calling scenarios with multi-lingual prompts.\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.15334\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.444045+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.444045+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/bfcl-v3-multiturn.json",
    "content": "{\n  \"benchmark_id\": \"bfcl-v3-multiturn\",\n  \"name\": \"BFCL_v3_MultiTurn\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Berkeley Function Calling Leaderboard (BFCL) V3 MultiTurn benchmark that evaluates large language models' ability to handle multi-turn and multi-step function calling scenarios. The benchmark introduces complex interactions requiring models to manage sequential function calls, handle conversational context across multiple turns, and make dynamic decisions about when and how to use available functions. BFCL V3 uses state-based evaluation by verifying the actual state of API systems after function execution, providing more realistic assessment of function calling capabilities in agentic applications.\",\n  \"paper_link\": \"https://openreview.net/forum?id=2GmDdhBdDk\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.962161+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.962161+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/bfcl-v3.json",
    "content": "{\n  \"benchmark_id\": \"bfcl-v3\",\n  \"name\": \"BFCL-v3\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Berkeley Function Calling Leaderboard v3 (BFCL-v3) is an advanced benchmark that evaluates large language models' function calling capabilities through multi-turn and multi-step interactions. It introduces extended conversational exchanges where models must retain contextual information across turns and execute multiple internal function calls for complex user requests. The benchmark includes 1000 test cases across domains like vehicle control, trading bots, travel booking, and file system management, using state-based evaluation to verify both system state changes and execution path correctness.\",\n  \"paper_link\": \"https://openreview.net/forum?id=2GmDdhBdDk\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-03T22:06:11.216985+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:11.216985+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/bfcl.json",
    "content": "{\n  \"benchmark_id\": \"bfcl\",\n  \"name\": \"BFCL\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' ability to invoke functions. It evaluates serial and parallel function calls across multiple programming languages (Python, Java, JavaScript, REST API) using a novel Abstract Syntax Tree (AST) evaluation method. The benchmark consists of over 2,000 question-function-answer pairs covering diverse application domains and complex use cases including multiple function calls, parallel function calls, and multi-turn interactions.\",\n  \"paper_link\": \"https://openreview.net/pdf?id=2GmDdhBdDk\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.763704+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.763704+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/big-bench-extra-hard.json",
    "content": "{\n  \"benchmark_id\": \"big-bench-extra-hard\",\n  \"name\": \"BIG-Bench Extra Hard\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.\",\n  \"paper_link\": \"https://arxiv.org/abs/2502.19187\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.279517+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.279517+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/big-bench-hard.json",
    "content": "{\n  \"benchmark_id\": \"big-bench-hard\",\n  \"name\": \"BIG-Bench Hard\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"math\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.\",\n  \"paper_link\": \"https://arxiv.org/abs/2210.09261\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.222809+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.222809+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/big-bench.json",
    "content": "{\n  \"benchmark_id\": \"big-bench\",\n  \"name\": \"BIG-Bench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"math\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark consisting of 204+ tasks designed to probe large language models and extrapolate their future capabilities. It covers diverse domains including linguistics, mathematics, common-sense reasoning, biology, physics, social bias, software development, and more. The benchmark focuses on tasks believed to be beyond current language model capabilities and includes both English and non-English tasks across multiple languages.\",\n  \"paper_link\": \"https://arxiv.org/abs/2206.04615\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.926457+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.926457+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/bigcodebench-full.json",
    "content": "{\n  \"benchmark_id\": \"bigcodebench-full\",\n  \"name\": \"BigCodeBench-Full\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive benchmark that evaluates large language models' ability to solve complex, practical programming tasks via code generation. Contains 1,140 fine-grained tasks across 7 domains using function calls from 139 libraries. Challenges LLMs to invoke multiple function calls as tools and handle complex instructions for realistic software engineering and general-purpose reasoning tasks.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.15877\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.508830+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.508830+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/bigcodebench-hard.json",
    "content": "{\n  \"benchmark_id\": \"bigcodebench-hard\",\n  \"name\": \"BigCodeBench-Hard\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"BigCodeBench-Hard is a subset of 148 challenging programming tasks from BigCodeBench, designed to evaluate large language models' ability to solve complex, real-world programming problems. These tasks require diverse function calls from multiple libraries across 7 domains including computation, networking, data analysis, and visualization. The benchmark tests compositional reasoning and the ability to implement complex instructions that span 139 libraries with an average of 2.8 libraries per task.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.15877\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.512684+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.512684+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/bigcodebench.json",
    "content": "{\n  \"benchmark_id\": \"bigcodebench\",\n  \"name\": \"BigCodeBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. Evaluates code generation with diverse function calls and complex instructions, featuring two variants: Complete (code completion based on comprehensive docstrings) and Instruct (generating code from natural language instructions).\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.15877\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.048433+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.048433+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/bird-sql-(dev).json",
    "content": "{\n  \"benchmark_id\": \"bird-sql-(dev)\",\n  \"name\": \"Bird-SQL (dev)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQLs) is a comprehensive text-to-SQL benchmark containing 12,751 question-SQL pairs across 95 databases (33.4 GB total) spanning 37+ professional domains. It evaluates large language models' ability to convert natural language to executable SQL queries in real-world scenarios with complex database schemas and dirty data.\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.03111\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.410905+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.410905+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/blink.json",
    "content": "{\n  \"benchmark_id\": \"blink\",\n  \"name\": \"BLINK\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"BLINK: Multimodal Large Language Models Can See but Not Perceive. A benchmark for multimodal language models focusing on core visual perception abilities. Reformats 14 classic computer vision tasks into 3,807 multiple-choice questions paired with single or multiple images and visual prompting. Tasks include relative depth estimation, visual correspondence, forensics detection, multi-view reasoning, counting, object localization, and spatial reasoning that humans can solve 'within a blink'.\",\n  \"paper_link\": \"https://arxiv.org/abs/2404.12390\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.326398+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.326398+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/boolq.json",
    "content": "{\n  \"benchmark_id\": \"boolq\",\n  \"name\": \"BoolQ\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"BoolQ is a reading comprehension dataset for yes/no questions containing 15,942 naturally occurring examples. Each example consists of a question, passage, and boolean answer, where questions are generated in unprompted and unconstrained settings. The dataset challenges models with complex, non-factoid information requiring entailment-like inference to solve.\",\n  \"paper_link\": \"https://arxiv.org/abs/1905.10044\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.117325+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.117325+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/browsecomp-long-128k.json",
    "content": "{\n  \"benchmark_id\": \"browsecomp-long-128k\",\n  \"name\": \"BrowseComp Long Context 128k\",\n  \"parent_benchmark_id\": \"browsecomp\",\n  \"categories\": [\"reasoning\", \"search\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.\",\n  \"paper_link\": \"https://arxiv.org/abs/2504.12516\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/browsecomp-long-256k.json",
    "content": "{\n  \"benchmark_id\": \"browsecomp-long-256k\",\n  \"name\": \"BrowseComp Long Context 256k\",\n  \"parent_benchmark_id\": \"browsecomp\",\n  \"categories\": [\"reasoning\", \"search\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.\",\n  \"paper_link\": \"https://arxiv.org/abs/2504.12516\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/browsecomp-zh.json",
    "content": "{\n  \"benchmark_id\": \"browsecomp-zh\",\n  \"name\": \"BrowseComp-zh\",\n  \"parent_benchmark_id\": \"browsecomp\",\n  \"categories\": [\"reasoning\", \"search\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"zh\",\n  \"description\": \"A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.\",\n  \"paper_link\": \"https://arxiv.org/abs/2504.19314\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-15T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/browsecomp.json",
    "content": "{\n  \"benchmark_id\": \"browsecomp\",\n  \"name\": \"BrowseComp\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"search\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.\",\n  \"paper_link\": \"https://arxiv.org/abs/2504.12516\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/c-eval.json",
    "content": "{\n  \"benchmark_id\": \"c-eval\",\n  \"name\": \"C-Eval\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"C-Eval is a comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. It comprises 13,948 multiple-choice questions across 52 diverse disciplines spanning humanities, science, and engineering, with four difficulty levels: middle school, high school, college, and professional. The benchmark includes C-Eval Hard, a subset of very challenging subjects requiring advanced reasoning abilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.08322\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.917478+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.917478+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/cbnsl.json",
    "content": "{\n  \"benchmark_id\": \"cbnsl\",\n  \"name\": \"CBNSL\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Curriculum Learning of Bayesian Network Structures (CBNSL) benchmark for evaluating algorithms that learn Bayesian network structures from data using curriculum learning techniques. The benchmark uses networks from the bnlearn repository and evaluates structure learning performance using BDeu scoring metrics.\",\n  \"paper_link\": \"http://proceedings.mlr.press/v45/Zhao15a.pdf\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.590999+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.590999+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/cc-ocr.json",
    "content": "{\n  \"benchmark_id\": \"cc-ocr\",\n  \"name\": \"CC-OCR\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"text-to-image\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive OCR benchmark for evaluating Large Multimodal Models (LMMs) in literacy. Comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction. Contains 39 subsets with 7,058 fully annotated images, 41% sourced from real applications. Tests capabilities including text grounding, multi-orientation text recognition, and detecting hallucination/repetition across diverse visual challenges.\",\n  \"paper_link\": \"https://arxiv.org/abs/2412.02210\",\n  \"implementation_link\": \"https://github.com/AlibabaResearch/AdvancedLiterateMachinery\",\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.652986+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.652986+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/cfeval.json",
    "content": "{\n  \"benchmark_id\": \"cfeval\",\n  \"name\": \"CFEval\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 10000.0,\n  \"language\": \"en\",\n  \"description\": \"CFEval benchmark for evaluating code generation and problem-solving capabilities\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-15T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/charadessta.json",
    "content": "{\n  \"benchmark_id\": \"charadessta\",\n  \"name\": \"CharadesSTA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"video\", \"language\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Charades-STA is a benchmark dataset for temporal activity localization via language queries, extending the Charades dataset with sentence temporal annotations. It contains 12,408 training and 3,720 testing segment-sentence pairs from videos with natural language descriptions and precise temporal boundaries for localizing activities based on language queries.\",\n  \"paper_link\": \"https://arxiv.org/abs/1705.02101\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.760027+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.760027+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/chartqa.json",
    "content": "{\n  \"benchmark_id\": \"chartqa\",\n  \"name\": \"ChartQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to evaluate models' abilities in visual and logical reasoning over charts.\",\n  \"paper_link\": \"https://arxiv.org/abs/2203.10244\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.783541+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.783541+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/charxiv-d.json",
    "content": "{\n  \"benchmark_id\": \"charxiv-d\",\n  \"name\": \"CharXiv-D\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"CharXiv-D is the descriptive questions subset of the CharXiv benchmark, designed to assess multimodal large language models' ability to extract basic information from scientific charts. It contains descriptive questions covering information extraction, enumeration, pattern recognition, and counting across 2,323 diverse charts from arXiv papers, all curated and verified by human experts.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.18521\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.325204+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.325204+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/charxiv-r.json",
    "content": "{\n  \"benchmark_id\": \"charxiv-r\",\n  \"name\": \"CharXiv-R\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"CharXiv-R is the reasoning component of the CharXiv benchmark, focusing on complex reasoning questions that require synthesizing information across visual chart elements. It evaluates multimodal large language models on their ability to understand and reason about scientific charts from arXiv papers through various reasoning tasks.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.18521\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.191553+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.191553+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/chexpert-cxr.json",
    "content": "{\n  \"benchmark_id\": \"chexpert-cxr\",\n  \"name\": \"CheXpert CXR\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"healthcare\", \"vision\"],\n  \"modality\": \"image\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"CheXpert is a large dataset of 224,316 chest radiographs from 65,240 patients for automated chest X-ray interpretation. The dataset includes uncertainty labels for 14 medical observations extracted from radiology reports. It serves as a benchmark for developing and evaluating automated chest radiograph interpretation models.\",\n  \"paper_link\": \"https://arxiv.org/abs/1901.07031\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.021015+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.021015+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/cluewsc.json",
    "content": "{\n  \"benchmark_id\": \"cluewsc\",\n  \"name\": \"CLUEWSC\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"CLUEWSC2020 is the Chinese version of the Winograd Schema Challenge, part of the CLUE benchmark. It focuses on pronoun disambiguation and coreference resolution, requiring models to determine which noun a pronoun refers to in a sentence. The dataset contains 1,244 training samples and 304 development samples extracted from contemporary Chinese literature.\",\n  \"paper_link\": \"https://arxiv.org/abs/2004.05986\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.233189+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.233189+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/cmmlu.json",
    "content": "{\n  \"benchmark_id\": \"cmmlu\",\n  \"name\": \"CMMLU\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"CMMLU (Chinese Massive Multitask Language Understanding) is a comprehensive Chinese benchmark that evaluates the knowledge and reasoning capabilities of large language models across 67 different subject topics. The benchmark covers natural sciences, social sciences, engineering, and humanities with multiple-choice questions ranging from basic to advanced professional levels.\",\n  \"paper_link\": \"https://arxiv.org/abs/2306.09212\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.941108+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.941108+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/cnmo-2024.json",
    "content": "{\n  \"benchmark_id\": \"cnmo-2024\",\n  \"name\": \"CNMO 2024\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"China Mathematical Olympiad 2024 - A challenging mathematics competition.\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/codeforces.json",
    "content": "{\n  \"benchmark_id\": \"codeforces\",\n  \"name\": \"CodeForces\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 3000.0,\n  \"language\": \"en\",\n  \"description\": \"A competitive programming benchmark using problems from the CodeForces platform. The benchmark evaluates code generation capabilities of LLMs on algorithmic problems with difficulty ratings ranging from 800 to 2400. Problems cover diverse algorithmic categories including dynamic programming, graph algorithms, data structures, and mathematical problems with standardized evaluation through direct platform submission.\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.01257\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.624663+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.624663+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/codegolf-v2.2.json",
    "content": "{\n  \"benchmark_id\": \"codegolf-v2.2\",\n  \"name\": \"Codegolf v2.2\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Codegolf v2.2 benchmark\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.778275+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.778275+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/collie.json",
    "content": "{\n  \"benchmark_id\": \"collie\",\n  \"name\": \"COLLIE\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\", \"writing\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"COLLIE is a grammar-based framework for systematic construction of constrained text generation tasks. It allows specification of rich, compositional constraints across diverse generation levels and modeling challenges including language understanding, logical reasoning, and semantic planning. The COLLIE-v1 dataset contains 2,080 instances across 13 constraint structures.\",\n  \"paper_link\": \"https://arxiv.org/abs/2307.08689\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.250323+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.250323+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/common-voice-15.json",
    "content": "{\n  \"benchmark_id\": \"common-voice-15\",\n  \"name\": \"Common Voice 15\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"audio\", \"speech-to-text\", \"language\"],\n  \"modality\": \"audio\",\n  \"multilingual\": true,\n  \"max_score\": 100.0,\n  \"language\": \"en\",\n  \"description\": \"Common Voice is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Version 15.0 contains 28,750 recorded hours across 114 languages, consisting of crowdsourced voice recordings with corresponding transcriptions.\",\n  \"paper_link\": \"https://arxiv.org/abs/1912.06670\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.830793+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.830793+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/commonsenseqa.json",
    "content": "{\n  \"benchmark_id\": \"commonsenseqa\",\n  \"name\": \"CommonSenseQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"CommonSenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict correct answers. It contains 12,102 questions with one correct answer and four distractors, designed to test semantic reasoning and conceptual relationships. Questions are created based on ConceptNet concepts and require prior world knowledge for accurate reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/1811.00937\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.129679+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.129679+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/complexfuncbench.json",
    "content": "{\n  \"benchmark_id\": \"complexfuncbench\",\n  \"name\": \"ComplexFuncBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"ComplexFuncBench is a benchmark designed to evaluate large language models' capabilities in handling complex function calling scenarios. It encompasses multi-step and constrained function calling tasks that require long-parameter filling, parameter value reasoning, and managing contexts up to 128k tokens. The benchmark includes 1,000 samples across five real-world scenarios.\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.10132\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.336577+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.336577+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/covost2-en-zh.json",
    "content": "{\n  \"benchmark_id\": \"covost2-en-zh\",\n  \"name\": \"CoVoST2 en-zh\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"audio\", \"speech-to-text\", \"language\"],\n  \"modality\": \"audio\",\n  \"multilingual\": true,\n  \"max_score\": 100.0,\n  \"language\": \"en\",\n  \"description\": \"CoVoST 2 English-to-Chinese subset is part of the large-scale multilingual speech translation corpus derived from Common Voice. This subset focuses specifically on English to Chinese speech translation tasks within the broader CoVoST 2 dataset.\",\n  \"paper_link\": \"https://arxiv.org/abs/2007.10310\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.825578+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.825578+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/covost2.json",
    "content": "{\n  \"benchmark_id\": \"covost2\",\n  \"name\": \"CoVoST2\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"audio\", \"speech-to-text\", \"language\"],\n  \"modality\": \"audio\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"CoVoST 2 is a large-scale multilingual speech translation corpus derived from Common Voice, covering translations from 21 languages into English and from English into 15 languages. The dataset contains 2,880 hours of speech with 78K speakers for speech translation research.\",\n  \"paper_link\": \"https://arxiv.org/abs/2007.10310\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.958237+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.958237+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/crag.json",
    "content": "{\n  \"benchmark_id\": \"crag\",\n  \"name\": \"CRAG\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"search\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"CRAG (Comprehensive RAG Benchmark) is a factual question answering benchmark consisting of 4,409 question-answer pairs across 5 domains (finance, sports, music, movie, open domain) and 8 question categories. The benchmark includes mock APIs to simulate web and Knowledge Graph search, designed to represent the diverse and dynamic nature of real-world QA tasks with temporal dynamism ranging from years to seconds. It evaluates retrieval-augmented generation systems for trustworthy question answering.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.04744\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.741280+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.741280+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/creative-writing-v3.json",
    "content": "{\n  \"benchmark_id\": \"creative-writing-v3\",\n  \"name\": \"Creative Writing v3\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"creativity\", \"writing\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt. Uses a hybrid scoring system combining rubric assessment and Elo ratings through pairwise comparisons. Challenges models in areas like humor, romance, spatial awareness, and unique perspectives to assess emotional intelligence and creative writing capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2312.06281\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-03T22:06:11.157942+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:11.157942+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/crperelation.json",
    "content": "{\n  \"benchmark_id\": \"crperelation\",\n  \"name\": \"CRPErelation\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"healthcare\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Clinical reasoning problems evaluation benchmark for assessing diagnostic reasoning and medical knowledge application capabilities.\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.834739+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.834739+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/crux-o.json",
    "content": "{\n  \"benchmark_id\": \"crux-o\",\n  \"name\": \"CRUX-O\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 100.0,\n  \"language\": \"en\",\n  \"description\": \"CRUXEval-O (output prediction) is part of the CRUXEval benchmark consisting of 800 Python functions (3-13 lines) designed to evaluate AI models' capabilities in code reasoning, understanding, and execution. The benchmark tests models' ability to predict correct function outputs given function code and inputs, focusing on short problems that a good human programmer should be able to solve in a minute.\",\n  \"paper_link\": \"https://arxiv.org/abs/2401.03065\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.635245+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.635245+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/cruxeval-input-cot.json",
    "content": "{\n  \"benchmark_id\": \"cruxeval-input-cot\",\n  \"name\": \"CRUXEval-Input-CoT\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"CRUXEval input prediction task with Chain of Thought (CoT) prompting. Part of the CRUXEval benchmark for code reasoning, understanding, and execution evaluation. Given a Python function and its expected output, the task is to predict the appropriate input using chain-of-thought reasoning. Consists of 800 Python functions (3-13 lines) designed to evaluate code comprehension and reasoning capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2401.03065\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.551746+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.551746+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/cruxeval-o.json",
    "content": "{\n  \"benchmark_id\": \"cruxeval-o\",\n  \"name\": \"CruxEval-O\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"CruxEval-O is the output prediction task of the CRUXEval benchmark, designed to evaluate code reasoning, understanding, and execution capabilities. It consists of 800 Python functions (3-13 lines) where models must predict the output given a function and input. The benchmark tests fundamental code execution reasoning abilities and goes beyond simple code generation to assess deeper understanding of program behavior.\",\n  \"paper_link\": \"https://arxiv.org/abs/2401.03065\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.146592+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.146592+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/cruxeval-output-cot.json",
    "content": "{\n  \"benchmark_id\": \"cruxeval-output-cot\",\n  \"name\": \"CRUXEval-Output-CoT\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"CRUXEval-O (output prediction) with Chain-of-Thought prompting. Part of the CRUXEval benchmark consisting of 800 Python functions (3-13 lines) designed to evaluate code reasoning, understanding, and execution capabilities. The output prediction task requires models to predict the output of a given Python function with specific inputs, evaluated using chain-of-thought reasoning methodology.\",\n  \"paper_link\": \"https://arxiv.org/abs/2401.03065\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.555432+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.555432+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/csimpleqa.json",
    "content": "{\n  \"benchmark_id\": \"csimpleqa\",\n  \"name\": \"CSimpleQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions. It contains 3,000 high-quality questions spanning 6 major topics with 99 diverse subtopics, designed to assess Chinese factual knowledge across humanities, science, engineering, culture, and society.\",\n  \"paper_link\": \"https://arxiv.org/abs/2411.07140\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.931358+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.931358+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/cybersecurity-ctfs.json",
    "content": "{\n  \"benchmark_id\": \"cybersecurity-ctfs\",\n  \"name\": \"Cybersecurity CTFs\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"safety\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including cryptography, web exploitation, binary analysis, and forensics to assess AI capabilities in cybersecurity problem-solving.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.05590\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.387055+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.387055+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/dermmcqa.json",
    "content": "{\n  \"benchmark_id\": \"dermmcqa\",\n  \"name\": \"DermMCQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"healthcare\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Dermatology multiple choice question assessment benchmark for evaluating medical knowledge and diagnostic reasoning in dermatological conditions and treatments.\",\n  \"paper_link\": \"https://arxiv.org/abs/2309.06961\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.024498+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.024498+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/docvqa.json",
    "content": "{\n  \"benchmark_id\": \"docvqa\",\n  \"name\": \"DocVQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A dataset for Visual Question Answering on document images containing 50,000 questions defined on 12,000+ document images. The benchmark tests AI's ability to understand document structure and content, requiring models to comprehend document layout and perform information retrieval to answer questions about document images.\",\n  \"paper_link\": \"https://arxiv.org/abs/2007.00398\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.825214+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.825214+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/docvqatest.json",
    "content": "{\n  \"benchmark_id\": \"docvqatest\",\n  \"name\": \"DocVQAtest\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"DocVQA is a Visual Question Answering benchmark on document images containing 50,000 questions defined on 12,000+ document images. The benchmark focuses on understanding document structure and content to answer questions about various document types including letters, memos, notes, and reports from the UCSF Industry Documents Library.\",\n  \"paper_link\": \"https://arxiv.org/abs/2007.00398\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.579372+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.579372+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/drop.json",
    "content": "{\n  \"benchmark_id\": \"drop\",\n  \"name\": \"DROP\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"math\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowdsourced, adversarially-created questions that require resolving references and performing discrete operations like addition, counting, or sorting, demanding comprehensive paragraph understanding beyond paraphrase-and-entity-typing shortcuts.\",\n  \"paper_link\": \"https://arxiv.org/abs/1903.00161\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.981569+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.981569+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/ds-arena-code.json",
    "content": "{\n  \"benchmark_id\": \"ds-arena-code\",\n  \"name\": \"DS-Arena-Code\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Data Science Arena Code benchmark for evaluating LLMs on realistic data science code generation tasks. Tests capabilities in complex data processing, analysis, and programming across popular Python libraries used in data science workflows.\",\n  \"paper_link\": \"https://arxiv.org/abs/2505.15621\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.057744+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.057744+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/ds-fim-eval.json",
    "content": "{\n  \"benchmark_id\": \"ds-fim-eval\",\n  \"name\": \"DS-FIM-Eval\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"DeepSeek's internal Fill-in-the-Middle evaluation dataset for measuring code completion performance improvements in data science contexts\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.11931\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.053854+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.053854+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/eclektic.json",
    "content": "{\n  \"benchmark_id\": \"eclektic\",\n  \"name\": \"ECLeKTic\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A multilingual closed-book question answering dataset that evaluates cross-lingual knowledge transfer in large language models across 12 languages, using knowledge-seeking questions based on Wikipedia articles that exist only in one language\",\n  \"paper_link\": \"https://arxiv.org/abs/2502.21228\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.561292+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.561292+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/egoschema.json",
    "content": "{\n  \"benchmark_id\": \"egoschema\",\n  \"name\": \"EgoSchema\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"reasoning\", \"long_context\"],\n  \"modality\": \"video\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A diagnostic benchmark for very long-form video language understanding consisting of over 5000 human curated multiple choice questions based on 3-minute video clips from Ego4D, covering a broad range of natural human activities and behaviors\",\n  \"paper_link\": \"https://arxiv.org/abs/2308.09126\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.915240+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.915240+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/erqa.json",
    "content": "{\n  \"benchmark_id\": \"erqa\",\n  \"name\": \"ERQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"reasoning\", \"spatial_reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Embodied Reasoning Question Answering benchmark consisting of 400 multiple-choice visual questions across spatial reasoning, trajectory reasoning, action reasoning, state estimation, and multi-view reasoning for evaluating AI capabilities in physical world interactions\",\n  \"paper_link\": \"https://arxiv.org/abs/2503.20020\",\n  \"implementation_link\": \"https://github.com/embodiedreasoning/ERQA\",\n  \"verified\": false,\n  \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/evalplus.json",
    "content": "{\n  \"benchmark_id\": \"evalplus\",\n  \"name\": \"EvalPlus\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 100.0,\n  \"language\": \"en\",\n  \"description\": \"A rigorous code synthesis evaluation framework that augments existing datasets with extensive test cases generated by LLM and mutation-based strategies to better assess functional correctness of generated code, including HumanEval+ with 80x more test cases\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.01210\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.793176+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.793176+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/facts-grounding.json",
    "content": "{\n  \"benchmark_id\": \"facts-grounding\",\n  \"name\": \"FACTS Grounding\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A benchmark evaluating language models' ability to generate factually accurate and well-grounded responses based on long-form input context, comprising 1,719 examples with documents up to 32k tokens requiring detailed responses that are fully grounded in provided documents\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.03200\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.260285+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.260285+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/factscore.json",
    "content": "{\n  \"benchmark_id\": \"factscore\",\n  \"name\": \"FActScore\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A fine-grained atomic evaluation metric for factual precision in long-form text generation that breaks generated text into atomic facts and computes the percentage supported by reliable knowledge sources, with automated assessment using retrieval and language models\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.14251\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/finqa.json",
    "content": "{\n  \"benchmark_id\": \"finqa\",\n  \"name\": \"FinQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"finance\", \"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A large-scale dataset for numerical reasoning over financial data with question-answering pairs written by financial experts, featuring complex numerical reasoning and understanding of heterogeneous representations with annotated gold reasoning programs for full explainability\",\n  \"paper_link\": \"https://arxiv.org/abs/2109.00122\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.734486+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.734486+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/flenqa.json",
    "content": "{\n  \"benchmark_id\": \"flenqa\",\n  \"name\": \"FlenQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"long_context\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Flexible Length Question Answering dataset for evaluating the impact of input length on reasoning performance of language models, featuring True/False questions embedded in contexts of varying lengths (250-3000 tokens) across three reasoning tasks: Monotone Relations, People In Rooms, and simplified Ruletaker\",\n  \"paper_link\": \"https://arxiv.org/abs/2402.14848\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.277205+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.277205+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/fleurs.json",
    "content": "{\n  \"benchmark_id\": \"fleurs\",\n  \"name\": \"FLEURS\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"speech-to-text\"],\n  \"modality\": \"audio\",\n  \"multilingual\": true,\n  \"max_score\": 100.0,\n  \"language\": \"en\",\n  \"description\": \"Few-shot Learning Evaluation of Universal Representations of Speech - a parallel speech dataset in 102 languages built on FLoRes-101 with approximately 12 hours of speech supervision per language for tasks including ASR, speech language identification, translation and retrieval\",\n  \"paper_link\": \"https://arxiv.org/abs/2205.12446\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.943695+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.943695+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/frames.json",
    "content": "{\n  \"benchmark_id\": \"frames\",\n  \"name\": \"FRAMES\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"search\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Factuality, Retrieval, And reasoning MEasurement Set - a unified evaluation dataset of 824 challenging multi-hop questions for testing retrieval-augmented generation systems across factuality, retrieval accuracy, and reasoning capabilities, requiring integration of 2-15 Wikipedia articles per question\",\n  \"paper_link\": \"https://arxiv.org/abs/2409.12941\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.954436+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.954436+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/french-mmlu.json",
    "content": "{\n  \"benchmark_id\": \"french-mmlu\",\n  \"name\": \"French MMLU\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"language\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"French version of MMLU-Pro, a multilingual benchmark for evaluating language models' cross-lingual reasoning capabilities across 14 diverse domains including mathematics, physics, chemistry, law, engineering, psychology, and health.\",\n  \"paper_link\": \"https://arxiv.org/abs/2503.10497\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.134340+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.134340+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/frontiermath.json",
    "content": "{\n  \"benchmark_id\": \"frontiermath\",\n  \"name\": \"FrontierMath\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians, covering most major branches of modern mathematics from number theory and real analysis to algebraic geometry and category theory.\",\n  \"paper_link\": \"https://arxiv.org/abs/2411.04872\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.179213+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.179213+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/functionalmath.json",
    "content": "{\n  \"benchmark_id\": \"functionalmath\",\n  \"name\": \"FunctionalMATH\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A functional variant of the MATH benchmark that tests language models' ability to generalize reasoning patterns across different problem instances, revealing the reasoning gap between static and functional performance.\",\n  \"paper_link\": \"https://arxiv.org/abs/2402.19450\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.987516+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.987516+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/giantsteps-tempo.json",
    "content": "{\n  \"benchmark_id\": \"giantsteps-tempo\",\n  \"name\": \"GiantSteps Tempo\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"audio\"],\n  \"modality\": \"audio\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A dataset for tempo estimation in electronic dance music containing 664 2-minute audio previews from Beatport, annotated from user corrections for evaluating automatic tempo estimation algorithms.\",\n  \"paper_link\": \"https://archives.ismir.net/ismir2015/paper/000246.pdf\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.838584+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.838584+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/global-mmlu-lite.json",
    "content": "{\n  \"benchmark_id\": \"global-mmlu-lite\",\n  \"name\": \"Global-MMLU-Lite\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"language\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A lightweight version of Global MMLU benchmark that evaluates language models across multiple languages while addressing cultural and linguistic biases in multilingual evaluation.\",\n  \"paper_link\": \"https://arxiv.org/abs/2412.03304\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.534515+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.534515+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/global-mmlu.json",
    "content": "{\n  \"benchmark_id\": \"global-mmlu\",\n  \"name\": \"Global-MMLU\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"language\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive multilingual benchmark covering 42 languages that addresses cultural and linguistic biases in evaluation, with improved translation quality and culturally sensitive question subsets.\",\n  \"paper_link\": \"https://arxiv.org/abs/2412.03304\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.747524+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.747524+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/gorilla-benchmark-api-bench.json",
    "content": "{\n  \"benchmark_id\": \"gorilla-benchmark-api-bench\",\n  \"name\": \"Gorilla Benchmark API Bench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"APIBench, a comprehensive dataset of over 11,000 instruction-API pairs from HuggingFace, TorchHub, and TensorHub APIs for evaluating language models' ability to generate accurate API calls.\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.15334\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.383584+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.383584+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/govreport.json",
    "content": "{\n  \"benchmark_id\": \"govreport\",\n  \"name\": \"GovReport\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"summarization\", \"long_context\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A long document summarization dataset consisting of reports from government research agencies including Congressional Research Service and U.S. Government Accountability Office, with significantly longer documents and summaries than other datasets.\",\n  \"paper_link\": \"https://arxiv.org/abs/2104.02112\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.218809+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.218809+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/gpqa-biology.json",
    "content": "{\n  \"benchmark_id\": \"gpqa-biology\",\n  \"name\": \"GPQA Biology\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Biology subset of GPQA, containing challenging multiple-choice questions written by domain experts in biology. These Google-proof questions require graduate-level knowledge and reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.12022\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.391187+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.391187+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/gpqa-chemistry.json",
    "content": "{\n  \"benchmark_id\": \"gpqa-chemistry\",\n  \"name\": \"GPQA Chemistry\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"chemistry\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Chemistry subset of GPQA, containing challenging multiple-choice questions written by domain experts in chemistry. These Google-proof questions require graduate-level knowledge and reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.12022\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.395806+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.395806+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/gpqa-physics.json",
    "content": "{\n  \"benchmark_id\": \"gpqa-physics\",\n  \"name\": \"GPQA Physics\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"physics\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Physics subset of GPQA, containing challenging multiple-choice questions written by domain experts in physics. These Google-proof questions require graduate-level knowledge and reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.12022\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.400663+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.400663+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/gpqa.json",
    "content": "{\n  \"benchmark_id\": \"gpqa\",\n  \"name\": \"GPQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.12022\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.588605+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.588605+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/graphwalks-bfs-%3C128k.json",
    "content": "{\n  \"benchmark_id\": \"graphwalks-bfs-<128k\",\n  \"name\": \"Graphwalks BFS <128k\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"spatial_reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length under 128k tokens, returning nodes reachable at specified depths.\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.287324+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.287324+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/graphwalks-bfs-%3E128k.json",
    "content": "{\n  \"benchmark_id\": \"graphwalks-bfs->128k\",\n  \"name\": \"Graphwalks BFS >128k\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"spatial_reasoning\", \"long_context\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length over 128k tokens, testing long-context reasoning capabilities.\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.295876+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.295876+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/graphwalks-parents-%3C128k.json",
    "content": "{\n  \"benchmark_id\": \"graphwalks-parents-<128k\",\n  \"name\": \"Graphwalks parents <128k\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"spatial_reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A graph reasoning benchmark that evaluates language models' ability to find parent nodes in graphs with context length under 128k tokens, requiring understanding of graph structure and edge relationships.\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.303643+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.303643+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/graphwalks-parents-%3E128k.json",
    "content": "{\n  \"benchmark_id\": \"graphwalks-parents->128k\",\n  \"name\": \"Graphwalks parents >128k\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"spatial_reasoning\", \"long_context\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A graph reasoning benchmark that evaluates language models' ability to find parent nodes in graphs with context length over 128k tokens, testing long-context reasoning and graph structure understanding.\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.316836+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.316836+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/groundui-1k.json",
    "content": "{\n  \"benchmark_id\": \"groundui-1k\",\n  \"name\": \"GroundUI-1K\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"vision\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A subset of GroundUI-18K for UI grounding evaluation, where models must predict action coordinates on screenshots based on single-step instructions across web, desktop, and mobile platforms.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.17918\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.758595+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.758595+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/gsm-8k-(cot).json",
    "content": "{\n  \"benchmark_id\": \"gsm-8k-(cot)\",\n  \"name\": \"GSM-8K (CoT)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Grade School Math 8K with Chain-of-Thought prompting, featuring 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.\",\n  \"paper_link\": \"https://arxiv.org/abs/2110.14168\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.360381+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.360381+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/gsm8k-chat.json",
    "content": "{\n  \"benchmark_id\": \"gsm8k-chat\",\n  \"name\": \"GSM8K Chat\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Grade School Math 8K adapted for chat format evaluation, featuring 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.\",\n  \"paper_link\": \"https://arxiv.org/abs/2110.14168\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.101578+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.101578+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/gsm8k.json",
    "content": "{\n  \"benchmark_id\": \"gsm8k\",\n  \"name\": \"GSM8k\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.\",\n  \"paper_link\": \"https://arxiv.org/abs/2110.14168\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.397385+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.397385+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/hallusion-bench.json",
    "content": "{\n  \"benchmark_id\": \"hallusion-bench\",\n  \"name\": \"Hallusion Bench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive benchmark designed to evaluate image-context reasoning in large visual-language models (LVLMs) by challenging models with 346 images and 1,129 carefully crafted questions to assess language hallucination and visual illusion\",\n  \"paper_link\": \"https://arxiv.org/abs/2310.14566\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.689507+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.689507+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/healthbench-hard.json",
    "content": "{\n  \"benchmark_id\": \"healthbench-hard\",\n  \"name\": \"HealthBench Hard\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"healthcare\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A challenging variation of HealthBench that evaluates large language models' performance and safety in healthcare through 5,000 multi-turn conversations with particularly rigorous evaluation criteria validated by 262 physicians from 60 countries\",\n  \"paper_link\": \"https://arxiv.org/abs/2505.08775\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-05T19:56:13.424873+00:00\",\n  \"updated_at\": \"2025-08-05T19:56:13.424873+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/healthbench.json",
    "content": "{\n  \"benchmark_id\": \"healthbench\",\n  \"name\": \"HealthBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"healthcare\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"An open-source benchmark for measuring performance and safety of large language models in healthcare, consisting of 5,000 multi-turn conversations evaluated by 262 physicians using 48,562 unique rubric criteria across health contexts and behavioral dimensions\",\n  \"paper_link\": \"https://arxiv.org/abs/2505.08775\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-05T19:56:13.424873+00:00\",\n  \"updated_at\": \"2025-08-05T19:56:13.424873+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/hellaswag.json",
    "content": "{\n  \"benchmark_id\": \"hellaswag\",\n  \"name\": \"HellaSwag\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A challenging commonsense natural language inference dataset that uses Adversarial Filtering to create questions trivial for humans (>95% accuracy) but difficult for state-of-the-art models, requiring completion of sentence endings based on physical situations and everyday activities\",\n  \"paper_link\": \"https://arxiv.org/abs/1905.07830\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.145630+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.145630+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/hiddenmath.json",
    "content": "{\n  \"benchmark_id\": \"hiddenmath\",\n  \"name\": \"HiddenMath\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Google DeepMind's internal mathematical reasoning benchmark that introduces novel problems not encountered during model training to evaluate true mathematical reasoning capabilities rather than memorization\",\n  \"paper_link\": \"https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.424873+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.424873+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/hle.json",
    "content": "{\n  \"benchmark_id\": \"hle\",\n  \"name\": \"HLE\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"math\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.14249\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/hmmt-2025.json",
    "content": "{\n  \"benchmark_id\": \"hmmt-2025\",\n  \"name\": \"HMMT 2025\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Harvard-MIT Mathematics Tournament 2025 - A prestigious student-organized mathematics competition for high school students featuring two tournaments (November 2025 at MIT and February 2026 at Harvard) with individual tests, team rounds, and guts rounds\",\n  \"paper_link\": \"http://web.mit.edu/HMMT/www/\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/hmmt25.json",
    "content": "{\n  \"benchmark_id\": \"hmmt25\",\n  \"name\": \"HMMT25\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Harvard-MIT Mathematics Tournament 2025 - A prestigious student-organized mathematics competition for high school students featuring two tournaments (November 2025 at MIT and February 2026 at Harvard) with individual tests, team rounds, and guts rounds\",\n  \"paper_link\": \"http://web.mit.edu/HMMT/www/\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.061281+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.061281+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/humaneval+.json",
    "content": "{\n  \"benchmark_id\": \"humaneval+\",\n  \"name\": \"HumanEval+\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Enhanced version of HumanEval that extends the original test cases by 80x using EvalPlus framework for rigorous evaluation of LLM-synthesized code functional correctness, detecting previously undetected wrong code\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.01210\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.062352+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.062352+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/humaneval-average.json",
    "content": "{\n  \"benchmark_id\": \"humaneval-average\",\n  \"name\": \"HumanEval-Average\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics\",\n  \"paper_link\": \"https://arxiv.org/abs/2107.03374\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.171175+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.171175+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/humaneval-er.json",
    "content": "{\n  \"benchmark_id\": \"humaneval-er\",\n  \"name\": \"HumanEval-ER\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics\",\n  \"paper_link\": \"https://arxiv.org/abs/2107.03374\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.704744+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.704744+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/humaneval-mul.json",
    "content": "{\n  \"benchmark_id\": \"humaneval-mul\",\n  \"name\": \"HumanEval-Mul\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics\",\n  \"paper_link\": \"https://arxiv.org/abs/2107.03374\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.032472+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.032472+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/humaneval-plus.json",
    "content": "{\n  \"benchmark_id\": \"humaneval-plus\",\n  \"name\": \"HumanEval Plus\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Enhanced version of HumanEval that extends the original test cases by 80x using EvalPlus framework for rigorous evaluation of LLM-synthesized code functional correctness, detecting previously undetected wrong code\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.01210\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-03T22:06:10.921756+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:10.921756+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/humaneval.json",
    "content": "{\n  \"benchmark_id\": \"humaneval\",\n  \"name\": \"HumanEval\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics\",\n  \"paper_link\": \"https://arxiv.org/abs/2107.03374\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.595263+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.595263+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/humanevalfim-average.json",
    "content": "{\n  \"benchmark_id\": \"humanevalfim-average\",\n  \"name\": \"HumanEvalFIM-Average\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Average evaluation of HumanEval Fill-in-the-Middle benchmark variants (single-line, multi-line, random-span) for assessing code infilling capabilities of language models\",\n  \"paper_link\": \"https://arxiv.org/abs/2207.14255\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.160562+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.160562+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/humanity's-last-exam.json",
    "content": "{\n  \"benchmark_id\": \"humanity's-last-exam\",\n  \"name\": \"Humanity's Last Exam\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A multi-modal benchmark at the frontier of human knowledge with 2,500 questions across dozens of subjects including mathematics, humanities, and natural sciences, created by nearly 1000 subject expert contributors from over 500 institutions\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.14249\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.507693+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.507693+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/if.json",
    "content": "{\n  \"benchmark_id\": \"if\",\n  \"name\": \"IF\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.07911\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-03T22:06:11.089394+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:11.089394+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/ifeval.json",
    "content": "{\n  \"benchmark_id\": \"ifeval\",\n  \"name\": \"IFEval\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.07911\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.241350+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.241350+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/include.json",
    "content": "{\n  \"benchmark_id\": \"include\",\n  \"name\": \"Include\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Include benchmark - specific documentation not found in official sources\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.724387+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.724387+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/infinitebench-en.mc.json",
    "content": "{\n  \"benchmark_id\": \"infinitebench-en.mc\",\n  \"name\": \"InfiniteBench/En.MC\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"InfiniteBench English Multiple Choice variant - first LLM benchmark featuring average data length surpassing 100K tokens for evaluating long-context capabilities with 12 tasks spanning diverse domains\",\n  \"paper_link\": \"https://arxiv.org/abs/2402.13718\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.461508+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.461508+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/infinitebench-en.qa.json",
    "content": "{\n  \"benchmark_id\": \"infinitebench-en.qa\",\n  \"name\": \"InfiniteBench/En.QA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"InfiniteBench English Question Answering variant - first LLM benchmark featuring average data length surpassing 100K tokens for evaluating long-context capabilities with 12 tasks spanning diverse domains\",\n  \"paper_link\": \"https://arxiv.org/abs/2402.13718\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.457927+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.457927+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/infographicsqa.json",
    "content": "{\n  \"benchmark_id\": \"infographicsqa\",\n  \"name\": \"InfographicsQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"InfographicVQA dataset with 5,485 infographic images and over 30,000 questions requiring joint reasoning over document layout, textual content, graphical elements, and data visualizations with elementary reasoning and arithmetic skills\",\n  \"paper_link\": \"https://arxiv.org/abs/2104.12756\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.417669+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.417669+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/infovqa.json",
    "content": "{\n  \"benchmark_id\": \"infovqa\",\n  \"name\": \"InfoVQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"InfoVQA dataset with 30,000 questions and 5,000 infographic images requiring joint reasoning over document layout, textual content, graphical elements, and data visualizations with elementary reasoning and arithmetic skills\",\n  \"paper_link\": \"https://arxiv.org/abs/2104.12756\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.601294+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.601294+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/infovqatest.json",
    "content": "{\n  \"benchmark_id\": \"infovqatest\",\n  \"name\": \"InfoVQAtest\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"InfoVQA test set with infographic images requiring joint reasoning over document layout, textual content, graphical elements, and data visualizations with elementary reasoning and arithmetic skills\",\n  \"paper_link\": \"https://arxiv.org/abs/2104.12756\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.583939+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.583939+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/instruct-humaneval.json",
    "content": "{\n  \"benchmark_id\": \"instruct-humaneval\",\n  \"name\": \"Instruct HumanEval\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Instruction-based variant of HumanEval benchmark for evaluating large language models' code generation capabilities with functional correctness using pass@k metric on programming problems\",\n  \"paper_link\": \"https://arxiv.org/abs/2107.03374\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.105488+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.105488+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/intergps.json",
    "content": "{\n  \"benchmark_id\": \"intergps\",\n  \"name\": \"InterGPS\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"spatial_reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Interpretable Geometry Problem Solver (Inter-GPS) with Geometry3K dataset of 3,002 geometry problems with dense annotation in formal language using theorem knowledge and symbolic reasoning\",\n  \"paper_link\": \"https://arxiv.org/abs/2105.04165\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.259321+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.259321+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/internal-api-instruction-following-(hard).json",
    "content": "{\n  \"benchmark_id\": \"internal-api-instruction-following-(hard)\",\n  \"name\": \"Internal API instruction following (hard)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Internal API instruction following (hard) benchmark - specific documentation not found in official sources\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.222560+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.222560+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/lbpp-(v2).json",
    "content": "{\n  \"benchmark_id\": \"lbpp-(v2)\",\n  \"name\": \"LBPP (v2)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LBPP (v2) benchmark - specific documentation not found in official sources, possibly related to language-based planning problems\",\n  \"paper_link\": \"https://arxiv.org/abs/2206.10498\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.053535+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.053535+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/livebench-20241125.json",
    "content": "{\n  \"benchmark_id\": \"livebench-20241125\",\n  \"name\": \"LiveBench 20241125\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.19314\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-03T22:06:11.046321+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:11.046321+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/livebench.json",
    "content": "{\n  \"benchmark_id\": \"livebench\",\n  \"name\": \"LiveBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It comprises tasks across math, coding, reasoning, language, instruction following, and data analysis with verifiable, objective ground-truth answers.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.19314\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/livecodebench(01-09).json",
    "content": "{\n  \"benchmark_id\": \"livecodebench(01-09)\",\n  \"name\": \"LiveCodeBench(01-09)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.07974\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.049594+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.049594+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/livecodebench-v5-24.12-25.2.json",
    "content": "{\n  \"benchmark_id\": \"livecodebench-v5-24.12-25.2\",\n  \"name\": \"LiveCodeBench v5 24.12-25.2\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.07974\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.066180+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.066180+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/livecodebench-v5.json",
    "content": "{\n  \"benchmark_id\": \"livecodebench-v5\",\n  \"name\": \"LiveCodeBench v5\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.07974\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.759330+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.759330+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/livecodebench-v6.json",
    "content": "{\n  \"benchmark_id\": \"livecodebench-v6\",\n  \"name\": \"LiveCodeBench v6\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.07974\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.785682+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.785682+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/livecodebench.json",
    "content": "{\n  \"benchmark_id\": \"livecodebench\",\n  \"name\": \"LiveCodeBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.07974\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.292229+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.292229+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/longbench-v2.json",
    "content": "{\n  \"benchmark_id\": \"longbench-v2\",\n  \"name\": \"LongBench v2\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.\",\n  \"paper_link\": \"https://arxiv.org/abs/2412.15204\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.029281+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.029281+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/longfact-concepts.json",
    "content": "{\n  \"benchmark_id\": \"longfact-concepts\",\n  \"name\": \"LongFact Concepts\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LongFact is a benchmark for evaluating long-form factuality in large language models. It comprises 2,280 fact-seeking prompts spanning 38 topics, designed to test a model's ability to generate accurate, long-form responses. The benchmark uses SAFE (Search-Augmented Factuality Evaluator) to evaluate factual accuracy.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.18802\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/longfact-objects.json",
    "content": "{\n  \"benchmark_id\": \"longfact-objects\",\n  \"name\": \"LongFact Objects\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LongFact is a benchmark for evaluating long-form factuality in large language models. It comprises 2,280 fact-seeking prompts spanning 38 topics, designed to test a model's ability to generate accurate, long-form responses. The benchmark uses SAFE (Search-Augmented Factuality Evaluator) to evaluate factual accuracy.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.18802\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/longvideobench.json",
    "content": "{\n  \"benchmark_id\": \"longvideobench\",\n  \"name\": \"LongVideoBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"long_context\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LongVideoBench is a question-answering benchmark featuring video-language interleaved inputs up to an hour long. It includes 3,763 varying-length web-collected videos with subtitles across diverse themes and 6,678 human-annotated multiple-choice questions in 17 fine-grained categories for comprehensive evaluation of long-term multimodal understanding.\",\n  \"paper_link\": \"https://arxiv.org/abs/2407.15754\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.730349+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.730349+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/lsat.json",
    "content": "{\n  \"benchmark_id\": \"lsat\",\n  \"name\": \"LSAT\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"legal\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LSAT (Law School Admission Test) benchmark evaluating complex reasoning capabilities across three challenging tasks: analytical reasoning, logical reasoning, and reading comprehension. The LSAT measures skills considered essential for success in law school including critical thinking, reading comprehension of complex texts, and analysis of arguments.\",\n  \"paper_link\": \"https://arxiv.org/abs/2108.00648\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.409871+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.409871+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/lvbench.json",
    "content": "{\n  \"benchmark_id\": \"lvbench\",\n  \"name\": \"LVBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"long_context\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"LVBench is an extreme long video understanding benchmark designed to evaluate multimodal models on videos up to two hours in duration. It contains 6 major categories and 21 subcategories, with videos averaging five times longer than existing datasets. The benchmark addresses applications requiring comprehension of extremely long videos.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.08035\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.724041+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.724041+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/math-(cot).json",
    "content": "{\n  \"benchmark_id\": \"math-(cot)\",\n  \"name\": \"MATH (CoT)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects. This variant uses Chain-of-Thought prompting to encourage step-by-step reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/2103.03874\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.366159+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.366159+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/math-500.json",
    "content": "{\n  \"benchmark_id\": \"math-500\",\n  \"name\": \"MATH-500\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MATH-500 is a subset of the MATH dataset containing 500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.\",\n  \"paper_link\": \"https://arxiv.org/abs/2103.03874\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.027850+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.027850+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/math.json",
    "content": "{\n  \"benchmark_id\": \"math\",\n  \"name\": \"MATH\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.\",\n  \"paper_link\": \"https://arxiv.org/abs/2103.03874\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.804258+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.804258+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mathvision.json",
    "content": "{\n  \"benchmark_id\": \"mathvision\",\n  \"name\": \"MathVision\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MATH-Vision is a dataset designed to measure multimodal mathematical reasoning capabilities. It focuses on evaluating how well models can solve mathematical problems that require both visual understanding and mathematical reasoning, bridging the gap between visual and mathematical domains.\",\n  \"paper_link\": \"https://arxiv.org/abs/2402.14804\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.695583+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.695583+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mathvista-mini.json",
    "content": "{\n  \"benchmark_id\": \"mathvista-mini\",\n  \"name\": \"MathVista-Mini\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MathVista-Mini is a smaller version of the MathVista benchmark that evaluates mathematical reasoning in visual contexts. It consists of examples derived from multimodal datasets involving mathematics, combining challenges from diverse mathematical and visual tasks to assess foundation models' ability to solve problems requiring both visual understanding and mathematical reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/2310.02255\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.654470+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.654470+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mathvista.json",
    "content": "{\n  \"benchmark_id\": \"mathvista\",\n  \"name\": \"MathVista\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/2310.02255\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.069611+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.069611+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mbpp+.json",
    "content": "{\n  \"benchmark_id\": \"mbpp+\",\n  \"name\": \"MBPP+\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MBPP+ is an enhanced version of MBPP (Mostly Basic Python Problems) with significantly more test cases (35x) for more rigorous evaluation. MBPP is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers, covering programming fundamentals and standard library functionality.\",\n  \"paper_link\": \"https://arxiv.org/abs/2108.07732\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.501855+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.501855+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mbpp-++-base-version.json",
    "content": "{\n  \"benchmark_id\": \"mbpp-++-base-version\",\n  \"name\": \"MBPP ++ base version\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution, and 3 automated test cases covering programming fundamentals and standard library functionality. This is an enhanced version with additional test cases.\",\n  \"paper_link\": \"https://arxiv.org/abs/2108.07732\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.341560+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.341560+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mbpp-evalplus-(base).json",
    "content": "{\n  \"benchmark_id\": \"mbpp-evalplus-(base)\",\n  \"name\": \"MBPP EvalPlus (base)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. EvalPlus extends MBPP with significantly more test cases (35x) for more rigorous evaluation of LLM-synthesized code, providing high-quality and precise evaluation.\",\n  \"paper_link\": \"https://arxiv.org/abs/2108.07732\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.421722+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.421722+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mbpp-evalplus.json",
    "content": "{\n  \"benchmark_id\": \"mbpp-evalplus\",\n  \"name\": \"MBPP EvalPlus\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. EvalPlus extends MBPP with significantly more test cases (35x) for more rigorous evaluation of LLM-synthesized code, providing high-quality and precise evaluation.\",\n  \"paper_link\": \"https://arxiv.org/abs/2108.07732\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.425667+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.425667+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mbpp-pass@1.json",
    "content": "{\n  \"benchmark_id\": \"mbpp-pass@1\",\n  \"name\": \"MBPP pass@1\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution, and 3 automated test cases. This variant uses pass@1 evaluation metric measuring the percentage of problems solved correctly on the first attempt.\",\n  \"paper_link\": \"https://arxiv.org/abs/2108.07732\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.138778+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.138778+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mbpp-plus.json",
    "content": "{\n  \"benchmark_id\": \"mbpp-plus\",\n  \"name\": \"MBPP Plus\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution, and 3 automated test cases covering programming fundamentals and standard library functionality. This is an enhanced version with additional test cases for more rigorous evaluation.\",\n  \"paper_link\": \"https://arxiv.org/abs/2108.07732\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-03T22:06:11.143382+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:11.143382+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/mbpp.json",
    "content": "{\n  \"benchmark_id\": \"mbpp\",\n  \"name\": \"MBPP\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 100.0,\n  \"language\": \"en\",\n  \"description\": \"MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution, and 3 automated test cases covering programming fundamentals and standard library functionality.\",\n  \"paper_link\": \"https://arxiv.org/abs/2108.07732\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.453370+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.453370+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/medxpertqa.json",
    "content": "{\n  \"benchmark_id\": \"medxpertqa\",\n  \"name\": \"MedXpertQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"healthcare\", \"reasoning\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning, featuring 4,460 questions spanning 17 specialties and 11 body systems. Includes both text-only and multimodal subsets with expert-level exam questions incorporating diverse medical images and rich clinical information.\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.18362\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.040381+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.040381+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mega-mlqa.json",
    "content": "{\n  \"benchmark_id\": \"mega-mlqa\",\n  \"name\": \"MEGA MLQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MLQA as part of the MEGA (Multilingual Evaluation of Generative AI) benchmark suite. A multi-way aligned extractive QA evaluation benchmark for cross-lingual question answering across 7 languages (English, Arabic, German, Spanish, Hindi, Vietnamese, and Simplified Chinese) with over 12K QA instances in English and 5K in each other language.\",\n  \"paper_link\": \"https://arxiv.org/abs/2303.12528\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.187404+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.187404+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mega-tydi-qa.json",
    "content": "{\n  \"benchmark_id\": \"mega-tydi-qa\",\n  \"name\": \"MEGA TyDi QA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"TyDi QA as part of the MEGA benchmark suite. A question answering dataset covering 11 typologically diverse languages (Arabic, Bengali, English, Finnish, Indonesian, Japanese, Korean, Russian, Swahili, Telugu, and Thai) with 204K question-answer pairs. Features realistic information-seeking questions written by people who want to know the answer but don't know it yet.\",\n  \"paper_link\": \"https://arxiv.org/abs/2003.05002\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.192871+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.192871+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mega-udpos.json",
    "content": "{\n  \"benchmark_id\": \"mega-udpos\",\n  \"name\": \"MEGA UDPOS\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Universal Dependencies POS tagging as part of the MEGA benchmark suite. A multilingual part-of-speech tagging dataset based on Universal Dependencies treebanks, utilizing the universal POS tag set of 17 tags across 38 diverse languages from different language families. Used for evaluating multilingual POS tagging systems.\",\n  \"paper_link\": \"https://arxiv.org/abs/2004.10643\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.198318+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.198318+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mega-xcopa.json",
    "content": "{\n  \"benchmark_id\": \"mega-xcopa\",\n  \"name\": \"MEGA XCOPA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"XCOPA (Cross-lingual Choice of Plausible Alternatives) as part of the MEGA benchmark suite. A typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages, including resource-poor languages like Eastern Apurímac Quechua and Haitian Creole. Requires models to select which choice is the effect or cause of a given premise.\",\n  \"paper_link\": \"https://arxiv.org/abs/2005.00333\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.205296+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.205296+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mega-xstorycloze.json",
    "content": "{\n  \"benchmark_id\": \"mega-xstorycloze\",\n  \"name\": \"MEGA XStoryCloze\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"XStoryCloze as part of the MEGA benchmark suite. A cross-lingual story completion task that consists of professionally translated versions of the English StoryCloze dataset to 10 non-English languages. Requires models to predict the correct ending for a given four-sentence story, evaluating commonsense reasoning and narrative understanding.\",\n  \"paper_link\": \"https://arxiv.org/abs/2303.12528\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.212479+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.212479+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/meld.json",
    "content": "{\n  \"benchmark_id\": \"meld\",\n  \"name\": \"Meld\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"psychology\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MELD (Multimodal EmotionLines Dataset) is a multimodal multi-party dataset for emotion recognition in conversations. Contains approximately 13,000 utterances from 1,433 dialogues extracted from the TV series Friends. Each utterance is annotated with emotion (Anger, Disgust, Sadness, Joy, Neutral, Surprise, Fear) and sentiment labels across audio, visual, and textual modalities.\",\n  \"paper_link\": \"https://arxiv.org/abs/1810.02508\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.842977+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.842977+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mgsm.json",
    "content": "{\n  \"benchmark_id\": \"mgsm\",\n  \"name\": \"MGSM\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MGSM (Multilingual Grade School Math) is a benchmark of grade-school math problems. Contains 250 grade-school math problems manually translated from the GSM8K dataset into ten typologically diverse languages: Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, and Telugu. Evaluates multilingual mathematical reasoning capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2210.03057\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.669061+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.669061+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mimic-cxr.json",
    "content": "{\n  \"benchmark_id\": \"mimic-cxr\",\n  \"name\": \"MIMIC CXR\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"healthcare\", \"vision\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MIMIC-CXR is a large publicly available dataset of chest radiographs with free-text radiology reports. Contains 377,110 images corresponding to 227,835 radiographic studies from 65,379 patients at Beth Israel Deaconess Medical Center. The dataset is de-identified and widely used for medical imaging research, automated report generation, and medical AI development.\",\n  \"paper_link\": \"https://arxiv.org/abs/1901.07042\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.017221+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.017221+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mlvu-m.json",
    "content": "{\n  \"benchmark_id\": \"mlvu-m\",\n  \"name\": \"MLVU-M\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MLVU-M benchmark\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.931298+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.931298+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mlvu.json",
    "content": "{\n  \"benchmark_id\": \"mlvu\",\n  \"name\": \"MLVU\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"video\", \"multimodal\", \"long_context\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive benchmark for multi-task long video understanding that evaluates multimodal large language models on videos ranging from 3 minutes to 2 hours across 9 distinct tasks including reasoning, captioning, recognition, and summarization.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.04264\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.755571+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.755571+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mm-if-eval.json",
    "content": "{\n  \"benchmark_id\": \"mm-if-eval\",\n  \"name\": \"MM IF-Eval\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A challenging multimodal instruction-following benchmark that includes both compose-level constraints for output responses and perception-level constraints tied to input images, with comprehensive evaluation pipeline.\",\n  \"paper_link\": \"https://arxiv.org/abs/2504.07957\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.142939+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.142939+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mm-mind2web.json",
    "content": "{\n  \"benchmark_id\": \"mm-mind2web\",\n  \"name\": \"MM-Mind2Web\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"frontend_development\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A multimodal web navigation benchmark comprising 2,000 open-ended tasks spanning 137 websites across 31 domains. Each task includes HTML documents paired with webpage screenshots, action sequences, and complex web interactions.\",\n  \"paper_link\": \"https://arxiv.org/abs/2306.06070\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.753488+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.753488+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mm-mt-bench.json",
    "content": "{\n  \"benchmark_id\": \"mm-mt-bench\",\n  \"name\": \"MM-MT-Bench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"communication\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 100.0,\n  \"language\": \"en\",\n  \"description\": \"A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn dialogues and answer open-ended questions in a zero-shot manner.\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.880812+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.880812+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmau-music.json",
    "content": "{\n  \"benchmark_id\": \"mmau-music\",\n  \"name\": \"MMAU Music\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"audio\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A subset of the MMAU benchmark focused specifically on music understanding and reasoning tasks. Part of a comprehensive multimodal audio understanding benchmark that evaluates models on expert-level knowledge and complex reasoning across music audio clips.\",\n  \"paper_link\": \"https://arxiv.org/abs/2410.19168\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.851711+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.851711+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmau-sound.json",
    "content": "{\n  \"benchmark_id\": \"mmau-sound\",\n  \"name\": \"MMAU Sound\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"audio\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A subset of the MMAU benchmark focused specifically on environmental sound understanding and reasoning tasks. Part of a comprehensive multimodal audio understanding benchmark that evaluates models on expert-level knowledge and complex reasoning across environmental sound clips.\",\n  \"paper_link\": \"https://arxiv.org/abs/2410.19168\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.859503+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.859503+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmau-speech.json",
    "content": "{\n  \"benchmark_id\": \"mmau-speech\",\n  \"name\": \"MMAU Speech\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"audio\", \"multimodal\", \"reasoning\", \"speech-to-text\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A subset of the MMAU benchmark focused specifically on speech understanding and reasoning tasks. Part of a comprehensive multimodal audio understanding benchmark that evaluates models on expert-level knowledge and complex reasoning across speech audio clips.\",\n  \"paper_link\": \"https://arxiv.org/abs/2410.19168\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.863540+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.863540+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmau.json",
    "content": "{\n  \"benchmark_id\": \"mmau\",\n  \"name\": \"MMAU\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"audio\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A massive multi-task audio understanding and reasoning benchmark comprising 10,000 carefully curated audio clips paired with human-annotated natural language questions spanning speech, environmental sounds, and music. Requires expert-level knowledge and complex reasoning across 27 distinct skills.\",\n  \"paper_link\": \"https://arxiv.org/abs/2410.19168\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.846435+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.846435+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmbench-test.json",
    "content": "{\n  \"benchmark_id\": \"mmbench-test\",\n  \"name\": \"MMBench_test\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Test set of MMBench, a bilingual benchmark for assessing multi-modal capabilities of vision-language models through multiple-choice questions in both English and Chinese, providing systematic evaluation across diverse vision-language tasks.\",\n  \"paper_link\": \"https://arxiv.org/abs/2307.06281\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.607904+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.607904+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmbench-v1.1.json",
    "content": "{\n  \"benchmark_id\": \"mmbench-v1.1\",\n  \"name\": \"MMBench-V1.1\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Version 1.1 of MMBench, an improved bilingual benchmark for assessing multi-modal capabilities of vision-language models through multiple-choice questions in both English and Chinese, providing systematic evaluation across diverse vision-language tasks.\",\n  \"paper_link\": \"https://arxiv.org/abs/2307.06281\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.868950+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.868950+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmbench-video.json",
    "content": "{\n  \"benchmark_id\": \"mmbench-video\",\n  \"name\": \"MMBench-Video\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"video\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A long-form multi-shot benchmark for holistic video understanding that incorporates approximately 600 web videos from YouTube spanning 16 major categories, with each video ranging from 30 seconds to 6 minutes. Includes roughly 2,000 original question-answer pairs covering 26 fine-grained capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.14515\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.738914+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.738914+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmbench.json",
    "content": "{\n  \"benchmark_id\": \"mmbench\",\n  \"name\": \"MMBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A bilingual benchmark for assessing multi-modal capabilities of vision-language models through multiple-choice questions in both English and Chinese, providing systematic evaluation across diverse vision-language tasks with robust metrics.\",\n  \"paper_link\": \"https://arxiv.org/abs/2307.06281\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.235585+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.235585+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mme-realworld.json",
    "content": "{\n  \"benchmark_id\": \"mme-realworld\",\n  \"name\": \"MME-RealWorld\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"general\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive evaluation benchmark for Multimodal Large Language Models featuring over 13,366 high-resolution images and 29,429 question-answer pairs across 43 subtasks and 5 real-world scenarios. The largest manually annotated multimodal benchmark to date, designed to test MLLMs on challenging high-resolution real-world scenarios.\",\n  \"paper_link\": \"https://arxiv.org/abs/2408.13257\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.877676+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.877676+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mme.json",
    "content": "{\n  \"benchmark_id\": \"mme\",\n  \"name\": \"MME\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive evaluation benchmark for Multimodal Large Language Models measuring both perception and cognition abilities across 14 subtasks. Features manually designed instruction-answer pairs to avoid data leakage and provides systematic quantitative assessment of MLLM capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2306.13394\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.022505+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.022505+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmlu-(cot).json",
    "content": "{\n  \"benchmark_id\": \"mmlu-(cot)\",\n  \"name\": \"MMLU (CoT)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\", \"math\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Chain-of-Thought variant of the Massive Multitask Language Understanding benchmark, evaluating language models across 57 tasks including elementary mathematics, US history, computer science, law, and other professional and academic subjects. This version uses chain-of-thought prompting to elicit step-by-step reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/2009.03300\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.330830+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.330830+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmlu-base.json",
    "content": "{\n  \"benchmark_id\": \"mmlu-base\",\n  \"name\": \"MMLU-Base\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\", \"math\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Base version of the Massive Multitask Language Understanding benchmark, evaluating language models across 57 tasks including elementary mathematics, US history, computer science, law, and other professional and academic subjects. Designed to comprehensively measure the breadth and depth of a model's academic and professional understanding.\",\n  \"paper_link\": \"https://arxiv.org/abs/2009.03300\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.562710+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.562710+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmlu-chat.json",
    "content": "{\n  \"benchmark_id\": \"mmlu-chat\",\n  \"name\": \"MMLU Chat\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\", \"math\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Chat-format variant of the Massive Multitask Language Understanding benchmark, evaluating language models across 57 tasks including elementary mathematics, US history, computer science, law, and other professional and academic subjects. This version uses conversational prompting format for model evaluation.\",\n  \"paper_link\": \"https://arxiv.org/abs/2009.03300\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.095600+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.095600+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmlu-french.json",
    "content": "{\n  \"benchmark_id\": \"mmlu-french\",\n  \"name\": \"MMLU French\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\", \"math\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"fr\",\n  \"description\": \"French language variant of the Massive Multitask Language Understanding benchmark, evaluating language models across 57 tasks including elementary mathematics, US history, computer science, law, and other professional and academic subjects. This multilingual version tests model performance in French.\",\n  \"paper_link\": \"https://arxiv.org/abs/2009.03300\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.175211+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.175211+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmlu-pro.json",
    "content": "{\n  \"benchmark_id\": \"mmlu-pro\",\n  \"name\": \"MMLU-Pro\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\", \"math\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.01574\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.408351+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.408351+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmlu-prox.json",
    "content": "{\n  \"benchmark_id\": \"mmlu-prox\",\n  \"name\": \"MMLU-ProX\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\", \"math\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and professional domains. Built on the foundation of the Massive Multitask Language Understanding benchmark framework.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.01574\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.738623+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.738623+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmlu-redux-2.0.json",
    "content": "{\n  \"benchmark_id\": \"mmlu-redux-2.0\",\n  \"name\": \"MMLU-redux-2.0\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\", \"math\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A curated version of the MMLU benchmark featuring manually re-annotated 5,700 questions across 57 subjects to identify and correct errors in the original dataset. Addresses the 6.49% error rate found in MMLU and provides more reliable evaluation metrics for language models.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.04127\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.518552+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.518552+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmlu-redux.json",
    "content": "{\n  \"benchmark_id\": \"mmlu-redux\",\n  \"name\": \"MMLU-Redux\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\", \"math\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"An improved version of the MMLU benchmark featuring manually re-annotated questions to identify and correct errors in the original dataset. Provides more reliable evaluation metrics for language models by addressing dataset quality issues found in the original MMLU.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.04127\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/mmlu-stem.json",
    "content": "{\n  \"benchmark_id\": \"mmlu-stem\",\n  \"name\": \"MMLU-STEM\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\", \"physics\", \"chemistry\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"STEM-focused subset of the Massive Multitask Language Understanding benchmark, evaluating language models on science, technology, engineering, and mathematics topics including physics, chemistry, mathematics, and other technical subjects.\",\n  \"paper_link\": \"https://arxiv.org/abs/2009.03300\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.495405+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.495405+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmlu.json",
    "content": "{\n  \"benchmark_id\": \"mmlu\",\n  \"name\": \"MMLU\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\", \"language\", \"math\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains\",\n  \"paper_link\": \"https://arxiv.org/abs/2009.03300\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.200416+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.200416+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmmlu.json",
    "content": "{\n  \"benchmark_id\": \"mmmlu\",\n  \"name\": \"MMMLU\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\", \"math\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Multilingual Massive Multitask Language Understanding dataset released by OpenAI, featuring professionally translated MMLU test questions across 14 languages including Arabic, Bengali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Swahili, Yoruba, and Chinese. Contains approximately 15,908 multiple-choice questions per language covering 57 subjects.\",\n  \"paper_link\": \"https://arxiv.org/abs/2009.03300\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.144789+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.144789+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmmu-(val).json",
    "content": "{\n  \"benchmark_id\": \"mmmu-(val)\",\n  \"name\": \"MMMU (val)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\", \"general\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Validation set of the Massive Multi-discipline Multimodal Understanding and Reasoning benchmark. Features college-level multimodal questions across 6 core disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering) spanning 30 subjects and 183 subfields with diverse image types including charts, diagrams, maps, and tables.\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.16502\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.593262+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.593262+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmmu-(validation).json",
    "content": "{\n  \"benchmark_id\": \"mmmu-(validation)\",\n  \"name\": \"MMMU (validation)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\", \"general\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Validation set of the Massive Multi-discipline Multimodal Understanding and Reasoning benchmark. Features college-level multimodal questions across 6 core disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering) spanning 30 subjects and 183 subfields with diverse image types including charts, diagrams, maps, and tables.\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.16502\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.118197+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.118197+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmmu-pro.json",
    "content": "{\n  \"benchmark_id\": \"mmmu-pro\",\n  \"name\": \"MMMU-Pro\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\", \"general\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.\",\n  \"paper_link\": \"https://arxiv.org/abs/2409.02813\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.282252+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.282252+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmmu.json",
    "content": "{\n  \"benchmark_id\": \"mmmu\",\n  \"name\": \"MMMU\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"reasoning\", \"general\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to evaluate multimodal models on college-level subject knowledge and deliberate reasoning. Contains 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering across 30 subjects and 183 subfields.\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.16502\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.130105+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.130105+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmmuval.json",
    "content": "{\n  \"benchmark_id\": \"mmmuval\",\n  \"name\": \"MMMUval\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"general\", \"reasoning\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Validation set for MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) benchmark, designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning across Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.16502\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.575948+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.575948+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmstar.json",
    "content": "{\n  \"benchmark_id\": \"mmstar\",\n  \"name\": \"MMStar\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\", \"general\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MMStar is an elite vision-indispensable multimodal benchmark comprising 1,500 challenge samples meticulously selected by humans to evaluate 6 core capabilities and 18 detailed axes. The benchmark addresses issues of visual content unnecessity and unintentional data leakage in existing multimodal evaluations.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.20330\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.660584+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.660584+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmt-bench.json",
    "content": "{\n  \"benchmark_id\": \"mmt-bench\",\n  \"name\": \"MMT-Bench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\", \"general\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MMT-Bench is a comprehensive multimodal benchmark for evaluating Large Vision-Language Models towards multitask AGI. It comprises 31,325 meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering 32 core meta-tasks and 162 subtasks in multimodal understanding.\",\n  \"paper_link\": \"https://arxiv.org/abs/2404.16006\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.674184+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.674184+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmvet.json",
    "content": "{\n  \"benchmark_id\": \"mmvet\",\n  \"name\": \"MMVet\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\", \"general\", \"spatial_reasoning\", \"math\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MM-Vet is an evaluation benchmark that examines large multimodal models on complicated multimodal tasks requiring integrated capabilities. It assesses six core vision-language capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math through questions that require one or more of these capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2308.02490\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.684742+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.684742+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mmvetgpt4turbo.json",
    "content": "{\n  \"benchmark_id\": \"mmvetgpt4turbo\",\n  \"name\": \"MMVetGPT4Turbo\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\", \"general\", \"spatial_reasoning\", \"math\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MM-Vet evaluation using GPT-4 Turbo for scoring. This variant of MM-Vet examines large multimodal models on complicated multimodal tasks requiring integrated capabilities across six core vision-language abilities: recognition, knowledge, spatial awareness, language generation, OCR, and math.\",\n  \"paper_link\": \"https://arxiv.org/abs/2308.02490\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.611567+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.611567+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mobileminiwob++-sr.json",
    "content": "{\n  \"benchmark_id\": \"mobileminiwob++-sr\",\n  \"name\": \"MobileMiniWob++_SR\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"frontend_development\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MobileMiniWob++ SR (Success Rate) is an adaptation of the MiniWob++ web interaction benchmark for mobile Android environments within AndroidWorld. It comprises 92 web interaction tasks adapted for touch-based mobile interfaces, evaluating agents' ability to navigate and interact with web applications on mobile devices.\",\n  \"paper_link\": \"https://arxiv.org/abs/2405.14573\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.816755+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.816755+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/mrcr-1m-(pointwise).json",
    "content": "{\n  \"benchmark_id\": \"mrcr-1m-(pointwise)\",\n  \"name\": \"MRCR 1M (pointwise)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MRCR 1M (pointwise) is a variant of the Multi-Round Coreference Resolution benchmark that uses pointwise evaluation for ultra-long contexts (~1M tokens). This version evaluates each response independently rather than comparatively, testing models' absolute performance on long-context reasoning tasks.\",\n  \"paper_link\": \"https://arxiv.org/abs/2409.12640\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.912789+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.912789+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mrcr-1m.json",
    "content": "{\n  \"benchmark_id\": \"mrcr-1m\",\n  \"name\": \"MRCR 1M\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MRCR 1M is a variant of the Multi-Round Coreference Resolution benchmark designed for testing extremely long context capabilities with approximately 1 million tokens. It evaluates models' ability to maintain reasoning and attention across ultra-long conversations.\",\n  \"paper_link\": \"https://arxiv.org/abs/2409.12640\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.954336+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.954336+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mrcr-v2-(8-needle).json",
    "content": "{\n  \"benchmark_id\": \"mrcr-v2-(8-needle)\",\n  \"name\": \"MRCR v2 (8-needle)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations.\",\n  \"paper_link\": \"https://arxiv.org/abs/2409.12640\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.010914+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.010914+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mrcr-v2.json",
    "content": "{\n  \"benchmark_id\": \"mrcr-v2\",\n  \"name\": \"MRCR v2\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MRCR v2 (Multi-Round Coreference Resolution version 2) is an enhanced version of the synthetic long-context reasoning task. It extends the original MRCR framework with improved evaluation criteria and additional complexity for testing models' ability to maintain attention and reasoning across extended contexts.\",\n  \"paper_link\": \"https://arxiv.org/abs/2409.12640\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.963241+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.963241+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mrcr.json",
    "content": "{\n  \"benchmark_id\": \"mrcr\",\n  \"name\": \"MRCR\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MRCR (Multi-Round Coreference Resolution) is a synthetic long-context reasoning task where models must navigate long conversations to reproduce specific model outputs. It tests the ability to distinguish between similar requests and reason about ordering while maintaining attention across extended contexts.\",\n  \"paper_link\": \"https://arxiv.org/abs/2409.12640\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.887445+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.887445+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mt-bench.json",
    "content": "{\n  \"benchmark_id\": \"mt-bench\",\n  \"name\": \"MT-Bench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"communication\", \"reasoning\", \"general\", \"roleplay\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 100.0,\n  \"language\": \"en\",\n  \"description\": \"MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2306.05685\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.516415+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.516415+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mtvqa.json",
    "content": "{\n  \"benchmark_id\": \"mtvqa\",\n  \"name\": \"MTVQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"text-to-image\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MTVQA (Multilingual Text-Centric Visual Question Answering) is the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. It addresses visual-textual misalignment problems in multilingual text-centric VQA.\",\n  \"paper_link\": \"https://arxiv.org/abs/2405.11985\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.587333+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.587333+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/muirbench.json",
    "content": "{\n  \"benchmark_id\": \"muirbench\",\n  \"name\": \"MuirBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive benchmark for robust multi-image understanding capabilities of multimodal LLMs. Consists of 12 diverse multi-image tasks involving 10 categories of multi-image relations (e.g., multiview, temporal relations, narrative, complementary). Comprises 11,264 images and 2,600 multiple-choice questions created in a pairwise manner, where each standard instance is paired with an unanswerable variant for reliable assessment.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.09411\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.888428+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.888428+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/multi-if.json",
    "content": "{\n  \"benchmark_id\": \"multi-if\",\n  \"name\": \"Multi-IF\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"communication\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Multi-IF benchmarks LLMs on multi-turn and multilingual instruction following. It expands upon IFEval by incorporating multi-turn sequences and translating English prompts into 7 other languages, resulting in 4,501 multilingual conversations with three turns each. The benchmark reveals that current leading LLMs struggle with maintaining accuracy in multi-turn instructions and shows higher error rates for non-Latin script languages.\",\n  \"paper_link\": \"https://arxiv.org/abs/2410.15553\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.638787+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.638787+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/multi-swe-bench.json",
    "content": "{\n  \"benchmark_id\": \"multi-swe-bench\",\n  \"name\": \"Multi-SWE-Bench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecosystems. Covers 7 programming languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances carefully annotated by 68 expert annotators. Addresses limitations of existing benchmarks that focus almost exclusively on Python.\",\n  \"paper_link\": \"https://arxiv.org/abs/2504.02605\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-15T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/multichallenge-(o3-mini-grader).json",
    "content": "{\n  \"benchmark_id\": \"multichallenge-(o3-mini-grader)\",\n  \"name\": \"MultiChallenge (o3-mini grader)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key areas: instruction retention, inference memory, reliable versioned editing, and self-coherence. Despite near-perfect scores on existing benchmarks, frontier models achieve less than 50% accuracy on MultiChallenge.\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.17399\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.235758+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.235758+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/multichallenge.json",
    "content": "{\n  \"benchmark_id\": \"multichallenge\",\n  \"name\": \"Multi-Challenge\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"communication\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MultiChallenge is a realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key categories: instruction retention (maintaining instructions throughout conversations), inference memory (recalling and connecting details from previous turns), reliable versioned editing (adapting to evolving instructions during collaborative editing), and self-coherence (avoiding contradictions in responses). The benchmark evaluates models on sustained, contextually complex dialogues across diverse topics including travel planning, technical documentation, and professional communication.\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.17399\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/multilf.json",
    "content": "{\n  \"benchmark_id\": \"multilf\",\n  \"name\": \"MultiLF\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MultiLF benchmark\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.628191+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.628191+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/multilingual-mgsm-(cot).json",
    "content": "{\n  \"benchmark_id\": \"multilingual-mgsm-(cot)\",\n  \"name\": \"Multilingual MGSM (CoT)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Multilingual Grade School Math (MGSM) benchmark evaluates language models' chain-of-thought reasoning abilities across ten typologically diverse languages. Contains 250 grade-school math problems manually translated from GSM8K dataset into languages including Bengali and Swahili.\",\n  \"paper_link\": \"https://arxiv.org/abs/2210.03057\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.402248+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.402248+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/multilingual-mmlu.json",
    "content": "{\n  \"benchmark_id\": \"multilingual-mmlu\",\n  \"name\": \"Multilingual MMLU\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MMLU-ProX is a comprehensive multilingual benchmark covering 29 typologically diverse languages, building upon MMLU-Pro. Each language version consists of 11,829 identical questions enabling direct cross-linguistic comparisons. The benchmark evaluates large language models' reasoning capabilities across linguistic and cultural boundaries through challenging, reasoning-focused questions with 10 answer choices.\",\n  \"paper_link\": \"https://arxiv.org/abs/2503.10497\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.139086+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.139086+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/multipl-e-humaneval.json",
    "content": "{\n  \"benchmark_id\": \"multipl-e-humaneval\",\n  \"name\": \"Multipl-E HumanEval\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MultiPL-E is a scalable and extensible approach to benchmarking neural code generation that translates unit test-driven code generation benchmarks across multiple programming languages. It extends the HumanEval benchmark to 18 additional programming languages, enabling evaluation of code generation models across diverse programming paradigms and providing insights into how models generalize programming knowledge across language boundaries.\",\n  \"paper_link\": \"https://arxiv.org/abs/2208.08227\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.345081+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.345081+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/multipl-e-mbpp.json",
    "content": "{\n  \"benchmark_id\": \"multipl-e-mbpp\",\n  \"name\": \"Multipl-E MBPP\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MultiPL-E extends the Mostly Basic Python Problems (MBPP) benchmark to 18+ programming languages for evaluating multilingual code generation capabilities. MBPP contains 974 crowd-sourced programming problems designed to be solvable by entry-level programmers, covering programming fundamentals and standard library functionality. Each problem includes a task description, code solution, and automated test cases.\",\n  \"paper_link\": \"https://arxiv.org/abs/2208.08227\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.353635+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.353635+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/multipl-e.json",
    "content": "{\n  \"benchmark_id\": \"multipl-e\",\n  \"name\": \"MultiPL-E\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MultiPL-E is a scalable and extensible system for translating unit test-driven code generation benchmarks to multiple programming languages. It extends HumanEval and MBPP Python benchmarks to 18 additional programming languages, enabling evaluation of neural code generation models across diverse programming paradigms and language features.\",\n  \"paper_link\": \"https://arxiv.org/abs/2208.08227\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.311919+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.311919+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/musiccaps.json",
    "content": "{\n  \"benchmark_id\": \"musiccaps\",\n  \"name\": \"MusicCaps\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"audio\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MusicCaps is a dataset composed of 5,521 music examples, each labeled with an English aspect list and a free text caption written by musicians. The dataset contains 10-second music clips from AudioSet paired with rich textual descriptions that capture sonic qualities and musical elements like genre, mood, tempo, instrumentation, and rhythm. Created to support research in music-text understanding and generation tasks.\",\n  \"paper_link\": \"https://arxiv.org/abs/2301.11325\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.892085+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.892085+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/musr.json",
    "content": "{\n  \"benchmark_id\": \"musr\",\n  \"name\": \"MuSR\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MuSR (Multistep Soft Reasoning) is a benchmark for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Created through a neurosymbolic synthetic-to-natural generation algorithm, it generates complex reasoning scenarios like murder mysteries roughly 1000 words in length that challenge current LLMs including GPT-4. The benchmark tests chain-of-thought reasoning capabilities across domains involving commonsense reasoning about physical and social situations.\",\n  \"paper_link\": \"https://arxiv.org/abs/2310.16049\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.708705+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.708705+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/mvbench.json",
    "content": "{\n  \"benchmark_id\": \"mvbench\",\n  \"name\": \"MVBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"video\", \"multimodal\", \"spatial_reasoning\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive multi-modal video understanding benchmark covering 20 challenging video tasks that require temporal understanding beyond single-frame analysis. Tasks span from perception to cognition, including action recognition, temporal reasoning, spatial reasoning, object interaction, scene transition, and counterfactual inference. Uses a novel static-to-dynamic method to systematically generate video tasks from existing annotations.\",\n  \"paper_link\": \"https://arxiv.org/abs/2311.17005\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.615534+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.615534+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/natural-questions.json",
    "content": "{\n  \"benchmark_id\": \"natural-questions\",\n  \"name\": \"Natural Questions\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\", \"search\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Natural Questions is a question answering dataset featuring real anonymized queries issued to Google search engine. It contains 307,373 training examples where annotators provide long answers (passages) and short answers (entities) from Wikipedia pages, or mark them as unanswerable.\",\n  \"paper_link\": \"https://arxiv.org/abs/1901.08634\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.178778+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.178778+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/natural2code.json",
    "content": "{\n  \"benchmark_id\": \"natural2code\",\n  \"name\": \"Natural2Code\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"NaturalCodeBench (NCB) is a challenging code benchmark designed to mirror the complexity and variety of real-world coding tasks. It comprises 402 high-quality problems in Python and Java, selected from natural user queries from online coding services, covering 6 different domains.\",\n  \"paper_link\": \"https://arxiv.org/abs/2405.04520\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.518784+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.518784+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/nexus.json",
    "content": "{\n  \"benchmark_id\": \"nexus\",\n  \"name\": \"Nexus\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"NexusRaven benchmark for evaluating function calling capabilities of large language models in zero-shot scenarios across cybersecurity tools and API interactions\",\n  \"paper_link\": \"https://openreview.net/pdf?id=5lcPe6DqfI\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.391550+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.391550+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/nih-multi-needle.json",
    "content": "{\n  \"benchmark_id\": \"nih-multi-needle\",\n  \"name\": \"NIH/Multi-needle\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Multi-needle in a haystack benchmark for evaluating long-context comprehension capabilities of language models by testing retrieval of multiple target pieces of information from extended documents\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.11230\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.465778+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.465778+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/nmos.json",
    "content": "{\n  \"benchmark_id\": \"nmos\",\n  \"name\": \"NMOS\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 100.0,\n  \"language\": \"en\",\n  \"description\": \"NMOS evaluation benchmark for assessing model performance on specialized tasks\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.895373+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.895373+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/nq.json",
    "content": "{\n  \"benchmark_id\": \"nq\",\n  \"name\": \"NQ\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Natural Questions (NQ) benchmark containing real user questions issued to Google search with answers found from Wikipedia, designed for training and evaluation of automatic question answering systems\",\n  \"paper_link\": \"https://aclanthology.org/Q19-1026/\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.088246+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.088246+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/ocrbench-v2-(en).json",
    "content": "{\n  \"benchmark_id\": \"ocrbench-v2-(en)\",\n  \"name\": \"OCRBench-V2 (en)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"image-to-text\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"OCRBench v2 English subset: Enhanced benchmark for evaluating Large Multimodal Models on visual text localization and reasoning with English text content\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.00321\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.926330+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.926330+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/ocrbench-v2-(zh).json",
    "content": "{\n  \"benchmark_id\": \"ocrbench-v2-(zh)\",\n  \"name\": \"OCRBench-V2 (zh)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"image-to-text\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"zh\",\n  \"description\": \"OCRBench v2 Chinese subset: Enhanced benchmark for evaluating Large Multimodal Models on visual text localization and reasoning with Chinese text content\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.00321\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.944963+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.944963+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/ocrbench-v2.json",
    "content": "{\n  \"benchmark_id\": \"ocrbench-v2\",\n  \"name\": \"OCRBench_V2\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"image-to-text\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"OCRBench v2: Enhanced large-scale bilingual benchmark for evaluating Large Multimodal Models on visual text localization and reasoning with 10,000 human-verified question-answering pairs across 8 core OCR capabilities\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.00321\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.898625+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.898625+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/ocrbench.json",
    "content": "{\n  \"benchmark_id\": \"ocrbench\",\n  \"name\": \"OCRBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"image-to-text\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"OCRBench: Comprehensive evaluation benchmark for assessing Optical Character Recognition (OCR) capabilities in Large Multimodal Models across text recognition, scene text VQA, and document understanding tasks\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.07895\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.304601+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.304601+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/odinw.json",
    "content": "{\n  \"benchmark_id\": \"odinw\",\n  \"name\": \"ODinW\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\"],\n  \"modality\": \"image\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Object Detection in the Wild (ODinW) benchmark for evaluating object detection models' task-level transfer ability across diverse real-world datasets in terms of prediction accuracy and adaptation efficiency\",\n  \"paper_link\": \"https://arxiv.org/abs/2112.03857\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.902703+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.902703+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/ojbench.json",
    "content": "{\n  \"benchmark_id\": \"ojbench\",\n  \"name\": \"OJBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"OJBench is a competition-level code benchmark designed to assess the competitive-level code reasoning abilities of large language models. It comprises 232 programming competition problems from NOI and ICPC, categorized into Easy, Medium, and Hard difficulty levels. The benchmark evaluates models' ability to solve complex competitive programming challenges using Python and C++.\",\n  \"paper_link\": \"https://arxiv.org/abs/2506.16395\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/olympiadbench.json",
    "content": "{\n  \"benchmark_id\": \"olympiadbench\",\n  \"name\": \"OlympiadBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\", \"physics\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. Comprises 8,476 math and physics problems from international and Chinese Olympiads and the Chinese college entrance exam, featuring expert-level annotations for step-by-step reasoning. Includes both text-only and multimodal problems in English and Chinese.\",\n  \"paper_link\": \"https://arxiv.org/abs/2402.14008\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.821916+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.821916+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/omnibench-music.json",
    "content": "{\n  \"benchmark_id\": \"omnibench-music\",\n  \"name\": \"OmniBench Music\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"audio\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Music component of OmniBench, a comprehensive benchmark for evaluating omni-language models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. The music category includes various compositions and performances that require integrated understanding across text, image, and audio modalities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2409.15272\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.911093+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.911093+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/omnibench.json",
    "content": "{\n  \"benchmark_id\": \"omnibench\",\n  \"name\": \"OmniBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A novel multimodal benchmark designed to evaluate large language models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. Comprises 1,142 question-answer pairs covering 8 task categories from basic perception to complex inference, with a unique constraint that accurate responses require integrated understanding of all three modalities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2409.15272\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.906402+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.906402+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/omnimath.json",
    "content": "{\n  \"benchmark_id\": \"omnimath\",\n  \"name\": \"OmniMath\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A Universal Olympiad Level Mathematic Benchmark for Large Language Models containing 4,428 competition-level problems with rigorous human annotation, categorized into over 33 sub-domains and spanning more than 10 distinct difficulty levels\",\n  \"paper_link\": \"https://arxiv.org/abs/2410.07985\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.271468+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.271468+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/open-rewrite.json",
    "content": "{\n  \"benchmark_id\": \"open-rewrite\",\n  \"name\": \"Open-rewrite\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"writing\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"OpenRewriteEval is a benchmark for evaluating open-ended rewriting of long-form texts, covering a wide variety of rewriting types expressed through natural language instructions including formality, expansion, conciseness, paraphrasing, and tone and style transfer.\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.15685\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.435616+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.435616+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/openai-mmlu.json",
    "content": "{\n  \"benchmark_id\": \"openai-mmlu\",\n  \"name\": \"OpenAI MMLU\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\", \"math\", \"legal\", \"healthcare\", \"finance\", \"physics\", \"chemistry\", \"economics\", \"psychology\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark that measures a text model's multitask accuracy across 57 diverse academic and professional subjects. The test covers elementary mathematics, US history, computer science, law, morality, business ethics, clinical knowledge, and many other domains spanning STEM, humanities, social sciences, and professional fields. To attain high accuracy, models must possess extensive world knowledge and problem-solving ability.\",\n  \"paper_link\": \"https://arxiv.org/abs/2009.03300\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.043675+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.043675+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/openai-mrcr%3A-2-needle-128k.json",
    "content": "{\n  \"benchmark_id\": \"openai-mrcr:-2-needle-128k\",\n  \"name\": \"OpenAI-MRCR: 2 needle 128k\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.05530\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.266878+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.266878+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/openai-mrcr%3A-2-needle-1m.json",
    "content": "{\n  \"benchmark_id\": \"openai-mrcr:-2-needle-1m\",\n  \"name\": \"OpenAI-MRCR: 2 needle 1M\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Multi-Round Co-reference Resolution benchmark that tests an LLM's ability to distinguish between multiple similar needles hidden in long conversations. Models must reproduce specific instances of content (e.g., 'Return the 2nd poem about tapirs') from multi-turn synthetic conversations, requiring reasoning about context, ordering, and subtle differences between similar outputs.\",\n  \"paper_link\": \"https://arxiv.org/abs/2409.12640\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.280285+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.280285+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/openai-mrcr%3A-2-needle-256k.json",
    "content": "{\n  \"benchmark_id\": \"openai-mrcr:-2-needle-256k\",\n  \"name\": \"OpenAI-MRCR: 2 needle 256k\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Multi-Round Co-reference Resolution (MRCR) benchmark that tests long-context reasoning by evaluating a model's ability to distinguish between similar outputs, reason about ordering, and reproduce specific content from multi-turn conversations containing multiple writing requests on overlapping topics at 256k tokens.\",\n  \"paper_link\": \"https://arxiv.org/abs/2409.12640\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/openbookqa.json",
    "content": "{\n  \"benchmark_id\": \"openbookqa\",\n  \"name\": \"OpenBookQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding. It contains 5,957 multiple-choice elementary-level science questions that probe understanding of 1,326 core science facts and their application to novel situations, requiring combination of open book facts with broad common knowledge through multi-hop reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/1809.02789\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.129348+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.129348+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/osworld-extended.json",
    "content": "{\n  \"benchmark_id\": \"osworld-extended\",\n  \"name\": \"OSWorld Extended\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"OSWorld is a scalable, real computer environment benchmark for evaluating multimodal agents on open-ended tasks across Ubuntu, Windows, and macOS. It comprises 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows. The benchmark evaluates agents' ability to interact with computer interfaces using screenshots and actions in realistic computing environments.\",\n  \"paper_link\": \"https://arxiv.org/abs/2404.07972\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.113488+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.113488+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/osworld-screenshot-only.json",
    "content": "{\n  \"benchmark_id\": \"osworld-screenshot-only\",\n  \"name\": \"OSWorld Screenshot-only\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"vision\", \"general\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"OSWorld Screenshot-only: A variant of the OSWorld benchmark that evaluates multimodal AI agents using only screenshot observations to complete open-ended computer tasks across real operating systems (Ubuntu, Windows, macOS). Tests agents' ability to perform complex workflows involving web apps, desktop applications, file I/O, and multi-application tasks through visual interface understanding and GUI grounding.\",\n  \"paper_link\": \"https://arxiv.org/abs/2404.07972\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.109647+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.109647+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/osworld.json",
    "content": "{\n  \"benchmark_id\": \"osworld\",\n  \"name\": \"OSWorld\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"general\", \"vision\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows\",\n  \"paper_link\": \"https://arxiv.org/abs/2404.07972\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.935426+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.935426+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/pathmcqa.json",
    "content": "{\n  \"benchmark_id\": \"pathmcqa\",\n  \"name\": \"PathMCQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"healthcare\", \"vision\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"PathMMU is a massive multimodal expert-level benchmark for understanding and reasoning in pathology, containing 33,428 multimodal multi-choice questions and 24,067 images validated by seven pathologists. It evaluates Large Multimodal Models (LMMs) performance on pathology tasks, with the top-performing model GPT-4V achieving only 49.8% zero-shot performance compared to 71.8% for human pathologists.\",\n  \"paper_link\": \"https://arxiv.org/abs/2401.16355\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.036453+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.036453+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/perceptiontest.json",
    "content": "{\n  \"benchmark_id\": \"perceptiontest\",\n  \"name\": \"PerceptionTest\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"video\", \"multimodal\", \"reasoning\", \"physics\", \"spatial_reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A novel multimodal video benchmark designed to evaluate perception and reasoning skills of pre-trained models across video, audio, and text modalities. Contains 11.6k real-world videos (average 23 seconds) filmed by participants worldwide, densely annotated with six types of labels. Focuses on skills (Memory, Abstraction, Physics, Semantics) and reasoning types (descriptive, explanatory, predictive, counterfactual). Shows significant performance gap between human baseline (91.4%) and state-of-the-art video QA models (46.2%).\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.13786\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.708910+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.708910+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/phibench.json",
    "content": "{\n  \"benchmark_id\": \"phibench\",\n  \"name\": \"PhiBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"math\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"PhiBench is an internal benchmark designed to evaluate diverse skills and reasoning abilities of language models, covering a wide range of tasks including coding (debugging, extending incomplete code, explaining code snippets) and mathematics (identifying proof errors, generating related problems). Created by Microsoft's research team to address limitations of standard academic benchmarks and guide the development of the Phi-4 model.\",\n  \"paper_link\": \"https://arxiv.org/abs/2412.08905\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.121593+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.121593+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/physicsfinals.json",
    "content": "{\n  \"benchmark_id\": \"physicsfinals\",\n  \"name\": \"PhysicsFinals\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"physics\", \"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"PHYSICS is a comprehensive benchmark for university-level physics problem solving, containing 1,297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Each problem requires advanced physics knowledge and mathematical reasoning. Even advanced models like o3-mini achieve only 59.9% accuracy.\",\n  \"paper_link\": \"https://arxiv.org/abs/2503.21821\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.981919+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.981919+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/piqa.json",
    "content": "{\n  \"benchmark_id\": \"piqa\",\n  \"name\": \"PIQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"physics\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"PIQA (Physical Interaction: Question Answering) is a benchmark dataset for physical commonsense reasoning in natural language. It tests AI systems' ability to answer questions requiring physical world knowledge through multiple choice questions with everyday situations, focusing on atypical solutions inspired by instructables.com. The dataset contains 21,000 multiple choice questions where models must choose the most appropriate solution for physical interactions.\",\n  \"paper_link\": \"https://arxiv.org/abs/1911.11641\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.133817+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.133817+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/pointgrounding.json",
    "content": "{\n  \"benchmark_id\": \"pointgrounding\",\n  \"name\": \"PointGrounding\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"spatial_reasoning\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"PointArena is a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. It includes Point-Bench, a curated dataset of ~1,000 pointing tasks across five categories: Spatial (positional references), Affordance (functional part identification), Counting (attribute-based grouping), Steerable (relative pointing), and Reasoning (open-ended visual inference). The benchmark evaluates language-guided pointing capabilities in vision-language models.\",\n  \"paper_link\": \"https://arxiv.org/abs/2505.09990\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.914897+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.914897+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/polymath-en.json",
    "content": "{\n  \"benchmark_id\": \"polymath-en\",\n  \"name\": \"PolyMath-en\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 difficulty levels from easy to hard, ensuring difficulty comprehensiveness, language diversity, and high-quality translation. The benchmark evaluates mathematical reasoning capabilities of large language models across diverse linguistic contexts, making it a highly discriminative multilingual mathematical benchmark.\",\n  \"paper_link\": \"https://arxiv.org/abs/2504.18428\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/polymath.json",
    "content": "{\n  \"benchmark_id\": \"polymath\",\n  \"name\": \"PolyMATH\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\", \"spatial_reasoning\", \"multimodal\", \"vision\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Polymath is a challenging multi-modal mathematical reasoning benchmark designed to evaluate the general cognitive reasoning abilities of Multi-modal Large Language Models (MLLMs). The benchmark comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/2410.14702\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-03T22:06:11.108063+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:11.108063+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/pope.json",
    "content": "{\n  \"benchmark_id\": \"pope\",\n  \"name\": \"POPE\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"safety\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Polling-based Object Probing Evaluation (POPE) is a benchmark for evaluating object hallucination in Large Vision-Language Models (LVLMs). POPE addresses the problem where LVLMs generate objects inconsistent with target images by using a polling-based query method that asks yes/no questions about object presence in images, providing more stable and flexible evaluation of object hallucination.\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.10355\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.264312+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.264312+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/popqa.json",
    "content": "{\n  \"benchmark_id\": \"popqa\",\n  \"name\": \"PopQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"PopQA is an entity-centric open-domain question-answering dataset consisting of 14,000 QA pairs designed to evaluate language models' ability to memorize and recall factual knowledge across entities with varying popularity levels. The dataset probes both parametric memory (stored in model parameters) and non-parametric memory effectiveness, with questions covering 16 diverse relationship types from Wikidata converted to natural language using templates. Created by sampling knowledge triples from Wikidata and converting them to natural language questions, focusing on long-tail entities to understand LMs' strengths and limitations in memorizing factual knowledge.\",\n  \"paper_link\": \"https://arxiv.org/abs/2212.10511\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.072897+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.072897+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/qasper.json",
    "content": "{\n  \"benchmark_id\": \"qasper\",\n  \"name\": \"Qasper\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"long_context\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"QASPER is a dataset of 5,049 information-seeking questions and answers anchored in 1,585 NLP research papers. Questions are written by NLP practitioners who read only titles and abstracts, while answers require understanding the full paper text and provide supporting evidence. The dataset challenges models with complex reasoning across document sections for academic document question answering. Each question seeks information present in the full text and is answered by a separate set of NLP practitioners who also provide supporting evidence to answers.\",\n  \"paper_link\": \"https://arxiv.org/abs/2105.03011\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.166932+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.166932+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/qmsum.json",
    "content": "{\n  \"benchmark_id\": \"qmsum\",\n  \"name\": \"QMSum\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"summarization\", \"long_context\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"QMSum is a benchmark for query-based multi-domain meeting summarization consisting of 1,808 query-summary pairs over 232 meetings across academic, product, and committee domains. The dataset enables models to select and summarize relevant spans of meetings in response to specific queries. Published at NAACL 2021, QMSum presents significant challenges in long meeting summarization where models must identify and summarize relevant content based on user queries.\",\n  \"paper_link\": \"https://arxiv.org/abs/2104.05938\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.223595+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.223595+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/realworldqa.json",
    "content": "{\n  \"benchmark_id\": \"realworldqa\",\n  \"name\": \"RealWorldQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"spatial_reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"RealWorldQA is a benchmark designed to evaluate basic real-world spatial understanding capabilities of multimodal models. The initial release consists of over 700 anonymized images taken from vehicles and other real-world scenarios, each accompanied by a question and easily verifiable answer. Released by xAI as part of their Grok-1.5 Vision preview to test models' ability to understand natural scenes and spatial relationships in everyday visual contexts.\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.595271+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.595271+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/repobench.json",
    "content": "{\n  \"benchmark_id\": \"repobench\",\n  \"name\": \"RepoBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"RepoBench is a benchmark for evaluating repository-level code auto-completion systems through three interconnected tasks: RepoBench-R (retrieval of relevant code snippets across files), RepoBench-C (code completion with cross-file and in-file context), and RepoBench-P (pipeline combining retrieval and prediction). Supports Python and Java programming languages and addresses the gap in evaluating real-world, multi-file programming scenarios by providing a more complete comparison of performance in auto-completion systems.\",\n  \"paper_link\": \"https://arxiv.org/abs/2306.03091\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.152588+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.152588+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/repoqa.json",
    "content": "{\n  \"benchmark_id\": \"repoqa\",\n  \"name\": \"RepoQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"RepoQA is a benchmark for evaluating long-context code understanding capabilities of Large Language Models through the Searching Needle Function (SNF) task, where LLMs must locate specific functions in code repositories using natural language descriptions. The benchmark contains 500 code search tasks spanning 50 repositories across 5 modern programming languages (Python, Java, TypeScript, C++, and Rust), tested on 26 general and code-specific LLMs to assess their ability to comprehend and navigate code repositories.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.06025\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.180278+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.180278+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/ruler.json",
    "content": "{\n  \"benchmark_id\": \"ruler\",\n  \"name\": \"RULER\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"long_context\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"RULER (What's the Real Context Size of Your Long-Context Language Models?) is a synthetic benchmark designed to comprehensively evaluate the long-context capabilities of language models. It expands on needle-in-a-haystack (NIAH) testing by introducing new task categories including multi-hop tracing and aggregation tasks. The benchmark provides flexible configurations for customized sequence length and task complexity, evaluating 17 long-context language models across 13 representative tasks to reveal that despite models claiming 32K+ token context sizes, only half maintain satisfactory performance at 32K length.\",\n  \"paper_link\": \"https://arxiv.org/abs/2404.06654\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.175181+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.175181+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/sat-math.json",
    "content": "{\n  \"benchmark_id\": \"sat-math\",\n  \"name\": \"SAT Math\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"SAT Math benchmark from AGIEval containing standardized mathematics questions from the College Board SAT examination, designed to evaluate mathematical reasoning capabilities of foundation models using human-centric assessment methods.\",\n  \"paper_link\": \"https://arxiv.org/abs/2304.06364\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.414463+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.414463+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/scale-multichallenge.json",
    "content": "{\n  \"benchmark_id\": \"scale-multichallenge\",\n  \"name\": \"Scale MultiChallenge\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"communication\", \"general\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"MultiChallenge is a realistic multi-turn conversation evaluation benchmark developed by Scale AI that evaluates large language models on four challenging conversation categories: instruction retention, inference memory of user information, reliable versioned editing, and self-coherence. Each challenge requires accurate instruction-following, context allocation, and in-context reasoning. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge.\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.17399\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.205789+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.205789+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/scicode.json",
    "content": "{\n  \"benchmark_id\": \"scicode\",\n  \"name\": \"SciCode\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"math\", \"physics\", \"chemistry\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"SciCode is a research coding benchmark curated by scientists that challenges language models to code solutions for scientific problems. It contains 338 subproblems decomposed from 80 challenging main problems across 16 natural science sub-fields including mathematics, physics, chemistry, biology, and materials science. Problems require knowledge recall, reasoning, and code synthesis skills.\",\n  \"paper_link\": \"https://arxiv.org/abs/2407.13168\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/scienceqa-visual.json",
    "content": "{\n  \"benchmark_id\": \"scienceqa-visual\",\n  \"name\": \"ScienceQA Visual\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"reasoning\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"ScienceQA Visual is a multimodal science question answering benchmark consisting of 21,208 multiple-choice questions from elementary and high school science curricula. The dataset covers 3 subjects (natural science, language science, social science), 26 topics, 127 categories, and 379 skills. 48.7% of questions include image context requiring multimodal reasoning. Questions are annotated with lectures (83.9%) and explanations (90.5%) to support chain-of-thought reasoning for science question answering.\",\n  \"paper_link\": \"https://arxiv.org/abs/2209.09513\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.300722+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.300722+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/scienceqa.json",
    "content": "{\n  \"benchmark_id\": \"scienceqa\",\n  \"name\": \"ScienceQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"math\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"ScienceQA is the first large-scale multimodal science question answering benchmark with 21,208 multiple-choice questions covering 3 subjects (natural science, language science, social science), 26 topics, 127 categories, and 379 skills. The benchmark includes both text and image modalities, featuring detailed explanations and Chain-of-Thought reasoning to diagnose multi-hop reasoning ability.\",\n  \"paper_link\": \"https://arxiv.org/abs/2209.09513\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.255251+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.255251+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/screenspot-pro.json",
    "content": "{\n  \"benchmark_id\": \"screenspot-pro\",\n  \"name\": \"ScreenSpot Pro\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"spatial_reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"ScreenSpot-Pro is a novel GUI grounding benchmark designed to rigorously evaluate the grounding capabilities of multimodal large language models (MLLMs) in professional high-resolution computing environments. The benchmark comprises 1,581 instructions across 23 applications spanning 5 industries and 3 operating systems, featuring authentic high-resolution images from professional domains with expert annotations. Unlike previous benchmarks that focus on cropped screenshots in consumer applications, ScreenSpot-Pro addresses the complexity and diversity of real-world professional software scenarios, revealing significant performance gaps in current MLLM GUI perception capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2504.07981\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.776671+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.776671+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/screenspot.json",
    "content": "{\n  \"benchmark_id\": \"screenspot\",\n  \"name\": \"ScreenSpot\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"spatial_reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"ScreenSpot is the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. The dataset comprises over 1,200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (text and icon/widget), designed to evaluate visual GUI agents' ability to accurately locate screen elements based on natural language instructions.\",\n  \"paper_link\": \"https://arxiv.org/abs/2401.10935\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.766976+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.766976+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/simpleqa.json",
    "content": "{\n  \"benchmark_id\": \"simpleqa\",\n  \"name\": \"SimpleQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.\",\n  \"paper_link\": \"https://arxiv.org/abs/2411.04368\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/slakevqa.json",
    "content": "{\n  \"benchmark_id\": \"slakevqa\",\n  \"name\": \"SlakeVQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"healthcare\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A semantically-labeled knowledge-enhanced dataset for medical visual question answering. Contains 642 radiology images (CT scans, MRI scans, X-rays) covering five body parts and 14,028 bilingual English-Chinese question-answer pairs annotated by experienced physicians. Features comprehensive semantic labels and a structural medical knowledge base with both vision-only and knowledge-based questions requiring external medical knowledge reasoning.\",\n  \"paper_link\": \"https://arxiv.org/abs/2102.09542\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.027646+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.027646+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/social-iqa.json",
    "content": "{\n  \"benchmark_id\": \"social-iqa\",\n  \"name\": \"Social IQa\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"psychology\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"The first large-scale benchmark for commonsense reasoning about social situations. Contains 38,000 multiple choice questions probing emotional and social intelligence in everyday situations, testing commonsense understanding of social interactions and theory of mind reasoning about the implied emotions and behavior of others.\",\n  \"paper_link\": \"https://arxiv.org/abs/1904.09728\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.155825+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.155825+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/spider.json",
    "content": "{\n  \"benchmark_id\": \"spider\",\n  \"name\": \"Spider\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A large-scale, complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 college students. Contains 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables, covering 138 different domains. Requires models to generalize to both new SQL queries and new database schemas, making it distinct from previous semantic parsing tasks that use single databases.\",\n  \"paper_link\": \"https://arxiv.org/abs/1809.08887\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.156791+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.156791+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/squality.json",
    "content": "{\n  \"benchmark_id\": \"squality\",\n  \"name\": \"SQuALITY\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"summarization\", \"long_context\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"SQuALITY (Summarization-format QUestion Answering with Long Input Texts, Yes!) is a long-document summarization dataset built by hiring highly-qualified contractors to read public-domain short stories (3000-6000 words) and write original summaries from scratch. Each document has five summaries: one overview and four question-focused summaries. Designed to address limitations in existing summarization datasets by providing high-quality, faithful summaries.\",\n  \"paper_link\": \"https://arxiv.org/abs/2205.11465\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.712415+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.712415+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/stem.json",
    "content": "{\n  \"benchmark_id\": \"stem\",\n  \"name\": \"STEM\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive multimodal benchmark dataset with 448 skills and 1,073,146 questions spanning all STEM subjects (Science, Technology, Engineering, Mathematics), designed to test neural models' vision-language STEM skills based on K-12 curriculum. Unlike existing datasets that focus on expert-level ability, this dataset includes fundamental skills designed around educational standards.\",\n  \"paper_link\": \"https://arxiv.org/abs/2402.17205\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.559354+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.559354+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/summscreenfd.json",
    "content": "{\n  \"benchmark_id\": \"summscreenfd\",\n  \"name\": \"SummScreenFD\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"summarization\", \"long_context\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"SummScreenFD is the ForeverDreaming subset of the SummScreen dataset for abstractive screenplay summarization, comprising pairs of TV series transcripts and human-written recaps from 88 different shows. The dataset provides a challenging testbed for abstractive summarization where plot details are often expressed indirectly in character dialogues and scattered across the entirety of the transcript, requiring models to find and integrate these details to form succinct plot descriptions.\",\n  \"paper_link\": \"https://arxiv.org/abs/2104.07091\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.229354+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.229354+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/superglue.json",
    "content": "{\n  \"benchmark_id\": \"superglue\",\n  \"name\": \"SuperGLUE\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"language\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"SuperGLUE is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. It includes 8 primary tasks: BoolQ (Boolean Questions), CB (CommitmentBank), COPA (Choice of Plausible Alternatives), MultiRC (Multi-Sentence Reading Comprehension), ReCoRD (Reading Comprehension with Commonsense Reasoning), RTE (Recognizing Textual Entailment), WiC (Word-in-Context), and WSC (Winograd Schema Challenge). The benchmark evaluates diverse language understanding capabilities including reading comprehension, commonsense reasoning, causal reasoning, coreference resolution, textual entailment, and word sense disambiguation across multiple domains.\",\n  \"paper_link\": \"https://arxiv.org/abs/1905.00537\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.382590+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.382590+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/supergpqa.json",
    "content": "{\n  \"benchmark_id\": \"supergpqa\",\n  \"name\": \"SuperGPQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"general\", \"math\", \"legal\", \"healthcare\", \"finance\", \"chemistry\", \"economics\", \"physics\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"SuperGPQA is a comprehensive benchmark that evaluates large language models across 285 graduate-level academic disciplines. The benchmark contains 25,957 questions covering 13 broad disciplinary areas including Engineering, Medicine, Science, and Law, with specialized fields in light industry, agriculture, and service-oriented domains. It employs a Human-LLM collaborative filtering mechanism with over 80 expert annotators to create challenging questions that assess graduate-level knowledge and reasoning capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2502.14739\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/swe-bench-multilingual.json",
    "content": "{\n  \"benchmark_id\": \"swe-bench-multilingual\",\n  \"name\": \"SWE-bench Multilingual\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.\",\n  \"paper_link\": \"https://arxiv.org/abs/2504.02605\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.340903+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.340903+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/swe-bench-verified-(agentic-coding).json",
    "content": "{\n  \"benchmark_id\": \"swe-bench-verified-(agentic-coding)\",\n  \"name\": \"SWE-bench Verified (Agentic Coding)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"SWE-bench Verified is a human-filtered subset of 500 software engineering problems drawn from real GitHub issues across 12 popular Python repositories. Given a codebase and an issue description, language models are tasked with generating patches that resolve the described problems. This benchmark evaluates AI's real-world agentic coding skills by requiring models to navigate complex codebases, understand software engineering problems, and coordinate changes across multiple functions, classes, and files to fix well-defined issues with clear descriptions.\",\n  \"paper_link\": \"https://arxiv.org/abs/2310.06770\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.331440+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.331440+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/swe-bench-verified-(agentless).json",
    "content": "{\n  \"benchmark_id\": \"swe-bench-verified-(agentless)\",\n  \"name\": \"SWE-bench Verified (Agentless)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A human-validated subset of SWE-bench that evaluates language models' ability to resolve real-world GitHub issues using an agentless approach. The benchmark tests models on software engineering problems requiring understanding and coordinating changes across multiple functions, classes, and files simultaneously.\",\n  \"paper_link\": \"https://arxiv.org/abs/2407.01489\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.328122+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.328122+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/swe-bench-verified-(multiple-attempts).json",
    "content": "{\n  \"benchmark_id\": \"swe-bench-verified-(multiple-attempts)\",\n  \"name\": \"SWE-bench Verified (Multiple Attempts)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"SWE-bench Verified is a human-validated subset of 500 test samples from the original SWE-bench dataset that evaluates AI systems' ability to automatically resolve real GitHub issues in Python repositories. Given a codebase and issue description, models must edit the code to successfully resolve the problem, requiring understanding and coordination of changes across multiple functions, classes, and files. The Verified version provides more reliable evaluation through manual validation of test samples.\",\n  \"paper_link\": \"https://arxiv.org/abs/2310.06770\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.336780+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.336780+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/swe-bench-verified.json",
    "content": "{\n  \"benchmark_id\": \"swe-bench-verified\",\n  \"name\": \"SWE-Bench Verified\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"frontend_development\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.\",\n  \"paper_link\": \"https://arxiv.org/abs/2310.06770\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.812805+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.812805+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/swe-dev.json",
    "content": "{\n  \"benchmark_id\": \"swe-dev\",\n  \"name\": \"SWE-Dev\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"frontend_development\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"SWE-bench development split consisting of 225 software engineering problems drawn from real GitHub issues across 12 popular Python repositories. Language models are given a codebase along with a description of an issue to be resolved and must edit the codebase to address the issue, often requiring understanding and coordinating changes across multiple functions, classes, and files.\",\n  \"paper_link\": \"https://arxiv.org/abs/2310.06770\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-15T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/swe-lancer-(ic-diamond-subset).json",
    "content": "{\n  \"benchmark_id\": \"swe-lancer-(ic-diamond-subset)\",\n  \"name\": \"SWE-Lancer (IC-Diamond subset)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"SWE-Lancer (IC-Diamond subset) is a benchmark of real-world freelance software engineering tasks from Upwork, ranging from $50 bug fixes to $32,000 feature implementations. It evaluates AI models on independent engineering tasks using end-to-end tests triple-verified by experienced software engineers, and includes managerial tasks where models choose between technical implementation proposals.\",\n  \"paper_link\": \"https://arxiv.org/abs/2502.12115\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.359574+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.359574+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/swe-lancer.json",
    "content": "{\n  \"benchmark_id\": \"swe-lancer\",\n  \"name\": \"SWE-Lancer\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A benchmark for evaluating large language models on real-world freelance software engineering tasks from Upwork. Contains over 1,400 tasks valued at $1 million USD total, ranging from $50 bug fixes to $32,000 feature implementations. Includes both independent engineering tasks graded via end-to-end tests and managerial tasks assessed against original engineering managers' choices.\",\n  \"paper_link\": \"https://arxiv.org/abs/2502.12115\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.352660+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.352660+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/tau-bench-airline.json",
    "content": "{\n  \"benchmark_id\": \"tau-bench-airline\",\n  \"name\": \"TAU-bench Airline\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"communication\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains. The airline domain evaluates language agents' ability to interact with users through dynamic conversations while following domain-specific rules and using API tools. Agents must handle airline-related tasks and policies reliably.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.12045\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.993213+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.993213+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/tau-bench-retail.json",
    "content": "{\n  \"benchmark_id\": \"tau-bench-retail\",\n  \"name\": \"TAU-bench Retail\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"communication\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.12045\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.965635+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.965635+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/tau-bench.json",
    "content": "{\n  \"benchmark_id\": \"tau-bench\",\n  \"name\": \"Tau-bench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"τ-bench: A benchmark for tool-agent-user interaction in real-world domains. Tests language agents' ability to interact with users and follow domain-specific rules through dynamic conversations using API tools and policy guidelines across retail and airline domains. Evaluates consistency and reliability of agent behavior over multiple trials.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.12045\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.219001+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.219001+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/tau2-airline.json",
    "content": "{\n  \"benchmark_id\": \"tau2-airline\",\n  \"name\": \"Tau2 Airline\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"communication\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.\",\n  \"paper_link\": \"https://arxiv.org/abs/2506.07982\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/tau2-retail.json",
    "content": "{\n  \"benchmark_id\": \"tau2-retail\",\n  \"name\": \"Tau2 Retail\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"communication\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.\",\n  \"paper_link\": \"https://arxiv.org/abs/2506.07982\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/tau2-telecom.json",
    "content": "{\n  \"benchmark_id\": \"tau2-telecom\",\n  \"name\": \"Tau2 Telecom\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"communication\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2506.07982\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/tempcompass.json",
    "content": "{\n  \"benchmark_id\": \"tempcompass\",\n  \"name\": \"TempCompass\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"TempCompass is a comprehensive benchmark for evaluating temporal perception capabilities of Video Large Language Models (Video LLMs). It constructs conflicting videos that share identical static content but differ in specific temporal aspects to prevent models from exploiting single-frame bias. The benchmark evaluates multiple temporal aspects including action, motion, speed, temporal order, and attribute changes across diverse task formats including multi-choice QA, yes/no QA, caption matching, and caption generation.\",\n  \"paper_link\": \"https://arxiv.org/abs/2403.00476\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.748364+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.748364+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/terminal-bench.json",
    "content": "{\n  \"benchmark_id\": \"terminal-bench\",\n  \"name\": \"Terminal-Bench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.\",\n  \"paper_link\": null,\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/terminus.json",
    "content": "{\n  \"benchmark_id\": \"terminus\",\n  \"name\": \"Terminus\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"code\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Terminal-Bench is a benchmark for testing AI agents in real terminal environments, evaluating how well agents can handle real-world, end-to-end tasks autonomously. The benchmark includes tasks spanning coding, system administration, security, data science, model training, file operations, version control, and web development. Terminus is the neutral test-bed agent designed to work with Terminal-Bench, operating purely through tmux sessions without dedicated tools.\",\n  \"paper_link\": \"https://github.com/laude-institute/terminal-bench\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.355994+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.355994+00:00\"\n}\n"
  },
  {
    "path": "data/benchmarks/textvqa.json",
    "content": "{\n  \"benchmark_id\": \"textvqa\",\n  \"name\": \"TextVQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"image-to-text\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Introduced to benchmark VQA models' ability to read and reason about text within images, particularly for assistive technologies for visually impaired users. The dataset addresses the gap where existing VQA datasets had few text-based questions or were too small.\",\n  \"paper_link\": \"https://arxiv.org/abs/1904.08920\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.875287+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.875287+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/theoremqa.json",
    "content": "{\n  \"benchmark_id\": \"theoremqa\",\n  \"name\": \"TheoremQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\", \"physics\", \"finance\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A theorem-driven question answering dataset containing 800 high-quality questions covering 350+ theorems from Math, Physics, EE&CS, and Finance. Designed to evaluate AI models' capabilities to apply theorems to solve challenging university-level science problems.\",\n  \"paper_link\": \"https://arxiv.org/abs/2305.12524\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.479157+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.479157+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/tldr9+-(test).json",
    "content": "{\n  \"benchmark_id\": \"tldr9+-(test)\",\n  \"name\": \"TLDR9+ (test)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"summarization\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A large-scale summarization dataset containing over 9 million training instances extracted from Reddit, designed for extreme summarization (generating one-sentence summaries with high compression and abstraction). More than twice larger than previously proposed datasets.\",\n  \"paper_link\": \"https://arxiv.org/abs/2110.01159\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.439927+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.439927+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/translation-en-to-set1-comet22.json",
    "content": "{\n  \"benchmark_id\": \"translation-en\\u2192set1-comet22\",\n  \"name\": \"Translation en\\u2192Set1 COMET22\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"COMET-22 is an ensemble machine translation evaluation metric combining a COMET estimator model trained with Direct Assessments and a multitask model that predicts sentence-level scores and word-level OK/BAD tags. It demonstrates improved correlations compared to state-of-the-art metrics and increased robustness to critical errors.\",\n  \"paper_link\": \"https://aclanthology.org/2022.wmt-1.52/\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.959436+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.959436+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/translation-en-to-set1-spbleu.json",
    "content": "{\n  \"benchmark_id\": \"translation-en\\u2192set1-spbleu\",\n  \"name\": \"Translation en\\u2192Set1 spBleu\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Translation evaluation using spBLEU (SentencePiece BLEU), a BLEU metric computed over text tokenized with a language-agnostic SentencePiece subword model. Introduced in the FLORES-101 evaluation benchmark for low-resource and multilingual machine translation.\",\n  \"paper_link\": \"https://arxiv.org/abs/2106.03193\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.936891+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.936891+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/translation-set1-to-en-comet22.json",
    "content": "{\n  \"benchmark_id\": \"translation-set1\\u2192en-comet22\",\n  \"name\": \"Translation Set1\\u2192en COMET22\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"COMET-22 is a neural machine translation evaluation metric that uses an ensemble of two models: a COMET estimator trained with Direct Assessments and a multitask model that predicts sentence-level scores and word-level OK/BAD tags. It provides improved correlations with human judgments and increased robustness to critical errors compared to previous metrics.\",\n  \"paper_link\": \"https://aclanthology.org/2022.wmt-1.52/\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.974744+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.974744+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/translation-set1-to-en-spbleu.json",
    "content": "{\n  \"benchmark_id\": \"translation-set1\\u2192en-spbleu\",\n  \"name\": \"Translation Set1\\u2192en spBleu\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"spBLEU (SentencePiece BLEU) evaluation metric for machine translation quality assessment, using language-agnostic SentencePiece tokenization with BLEU scoring. Part of the FLORES-101 evaluation benchmark for low-resource and multilingual machine translation.\",\n  \"paper_link\": \"https://arxiv.org/abs/2106.03193\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.967240+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.967240+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/triviaqa.json",
    "content": "{\n  \"benchmark_id\": \"triviaqa\",\n  \"name\": \"TriviaQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A large-scale reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents (six per question on average) that provide high quality distant supervision for answering the questions. The dataset features relatively complex, compositional questions with considerable syntactic and lexical variability, requiring cross-sentence reasoning to find answers.\",\n  \"paper_link\": \"https://arxiv.org/abs/1705.03551\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.563587+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.563587+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/truthfulqa.json",
    "content": "{\n  \"benchmark_id\": \"truthfulqa\",\n  \"name\": \"TruthfulQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\", \"legal\", \"healthcare\", \"finance\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.\",\n  \"paper_link\": \"https://arxiv.org/abs/2109.07958\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.339268+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.339268+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/tydiqa.json",
    "content": "{\n  \"benchmark_id\": \"tydiqa\",\n  \"name\": \"TydiQA\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A multilingual question answering benchmark covering 11 typologically diverse languages with 204K question-answer pairs. Questions are written by people seeking genuine information and data is collected directly in each language without translation to test model generalization across diverse linguistic structures.\",\n  \"paper_link\": \"https://arxiv.org/abs/2003.05002\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.470500+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.470500+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/uniform-bar-exam.json",
    "content": "{\n  \"benchmark_id\": \"uniform-bar-exam\",\n  \"name\": \"Uniform Bar Exam\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"legal\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"The Uniform Bar Examination (UBE) benchmark evaluates language models on the complete bar exam including multiple-choice Multistate Bar Examination (MBE), open-ended Multistate Essay Exam (MEE), and Multistate Performance Test (MPT) components. Used to assess legal reasoning capabilities across seven subject areas including Evidence, Torts, Constitutional Law, Contracts, Criminal Law and Procedure, Real Property, and Civil Procedure.\",\n  \"paper_link\": \"https://royalsocietypublishing.org/doi/10.1098/rsta.2023.0254\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.404860+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.404860+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/usamo25.json",
    "content": "{\n  \"benchmark_id\": \"usamo25\",\n  \"name\": \"USAMO25\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"math\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"The 2025 United States of America Mathematical Olympiad (USAMO) benchmark consists of six challenging mathematical problems requiring rigorous proof-based reasoning. USAMO is the most prestigious high school mathematics competition in the United States, serving as the final round of the American Mathematics Competitions series. This benchmark evaluates models on mathematical problem-solving capabilities beyond simple numerical computation, focusing on formal mathematical reasoning and proof generation.\",\n  \"paper_link\": \"https://arxiv.org/abs/2503.21934\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.067604+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.067604+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/vatex.json",
    "content": "{\n  \"benchmark_id\": \"vatex\",\n  \"name\": \"VATEX\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"video\", \"language\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. Contains over 41,250 videos and 825,000 captions in both English and Chinese, with over 206,000 English-Chinese parallel translation pairs. Supports multilingual video captioning and video-guided machine translation tasks.\",\n  \"paper_link\": \"https://arxiv.org/abs/1904.03493\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.909879+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.909879+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/vcr-en-easy.json",
    "content": "{\n  \"benchmark_id\": \"vcr-en-easy\",\n  \"name\": \"VCR_en_easy\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Visual Commonsense Reasoning (VCR) benchmark that tests higher-order cognition and commonsense reasoning beyond simple object recognition. Models must answer challenging questions about images and provide rationales justifying their answers. The benchmark measures the ability to infer people's actions, goals, and mental states from visual context.\",\n  \"paper_link\": \"https://arxiv.org/abs/1811.10830\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.592175+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.592175+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/vibe-eval.json",
    "content": "{\n  \"benchmark_id\": \"vibe-eval\",\n  \"name\": \"Vibe-Eval\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"vision\", \"general\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"VIBE-Eval is a hard evaluation suite for measuring progress of multimodal language models, consisting of 269 visual understanding prompts with gold-standard responses authored by experts. The benchmark has dual objectives: vibe checking multimodal chat models for day-to-day tasks and rigorously testing frontier models, with the hard set containing >50% questions that all frontier models answer incorrectly.\",\n  \"paper_link\": \"https://arxiv.org/abs/2405.02287\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.871369+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.871369+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/video-mme-(long,-no-subtitles).json",
    "content": "{\n  \"benchmark_id\": \"video-mme-(long,-no-subtitles)\",\n  \"name\": \"Video-MME (long, no subtitles)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"video\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Video-MME is the first-ever comprehensive evaluation benchmark for Multi-modal Large Language Models (MLLMs) in video analysis. This variant focuses on long-term videos (30min-60min) without subtitle inputs, testing robust contextual dynamics across 6 primary visual domains with 30 subfields including knowledge, film & television, sports competition, life record, and multilingual content.\",\n  \"paper_link\": \"https://arxiv.org/abs/2405.21075\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.374053+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.374053+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/video-mme.json",
    "content": "{\n  \"benchmark_id\": \"video-mme\",\n  \"name\": \"Video-MME\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"vision\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Video-MME is the first-ever comprehensive evaluation benchmark of Multi-modal Large Language Models (MLLMs) in video analysis. It features 900 videos totaling 254 hours with 2,700 human-annotated question-answer pairs across 6 primary visual domains (Knowledge, Film & Television, Sports Competition, Life Record, Multilingual, and others) and 30 subfields. The benchmark evaluates models across diverse temporal dimensions (11 seconds to 1 hour), integrates multi-modal inputs including video frames, subtitles, and audio, and uses rigorous manual labeling by expert annotators for precise assessment.\",\n  \"paper_link\": \"https://arxiv.org/abs/2405.21075\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.901883+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.901883+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/video-mmew-sub.json",
    "content": "{\n  \"benchmark_id\": \"video-mmew-sub\",\n  \"name\": \"Video-MMEw sub\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"reasoning\", \"vision\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Video-MME is the first comprehensive evaluation benchmark for multi-modal large language models in video analysis. It consists of 900 videos (254 hours total) across 6 domains and 30 sub-categories, with 2,700 high-quality multiple-choice questions. The benchmark evaluates MLLMs on diverse video types of varying durations (11 seconds to 1 hour) with multi-modal inputs including video frames, subtitles, and audio to assess perception, reasoning, and temporal understanding capabilities.\",\n  \"paper_link\": \"https://arxiv.org/abs/2405.21075\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-03T22:06:11.276310+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:11.276310+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/videomme-w-o-sub..json",
    "content": "{\n  \"benchmark_id\": \"videomme-w-o-sub.\",\n  \"name\": \"VideoMME w/o sub.\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"video\", \"vision\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Video-MME is a comprehensive evaluation benchmark for multi-modal large language models in video analysis. It features 900 videos across 6 primary visual domains with 30 subfields, ranging from 11 seconds to 1 hour in duration, with 2,700 question-answer pairs. The benchmark evaluates MLLMs' capabilities in processing sequential visual data and multi-modal content including video frames, subtitles, and audio.\",\n  \"paper_link\": \"https://arxiv.org/abs/2405.21075\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.715184+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.715184+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/videomme-w-sub..json",
    "content": "{\n  \"benchmark_id\": \"videomme-w-sub.\",\n  \"name\": \"VideoMME w sub.\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"video\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"The first-ever comprehensive evaluation benchmark of Multi-modal LLMs in Video analysis. Features 900 videos (254 hours) with 2,700 question-answer pairs covering 6 primary visual domains and 30 subfields. Evaluates temporal understanding across short (11 seconds) to long (1 hour) videos with multi-modal inputs including video frames, subtitles, and audio.\",\n  \"paper_link\": \"https://arxiv.org/abs/2405.21075\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.723259+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.723259+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/videommmu.json",
    "content": "{\n  \"benchmark_id\": \"videommmu\",\n  \"name\": \"VideoMMMU\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"multimodal\", \"vision\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Video-MMMU evaluates Large Multimodal Models' ability to acquire knowledge from expert-level professional videos across six disciplines through three cognitive stages: perception, comprehension, and adaptation. Contains 300 videos and 900 human-annotated questions spanning Art, Business, Science, Medicine, Humanities, and Engineering.\",\n  \"paper_link\": \"https://arxiv.org/abs/2501.13826\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.007381+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.007381+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/visualwebbench.json",
    "content": "{\n  \"benchmark_id\": \"visualwebbench\",\n  \"name\": \"VisualWebBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"frontend_development\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A multimodal benchmark designed to assess the capabilities of multimodal large language models (MLLMs) across web page understanding and grounding tasks. Comprises 7 tasks (captioning, webpage QA, heading OCR, element OCR, element grounding, action prediction, and action grounding) with 1.5K human-curated instances from 139 real websites across 87 sub-domains.\",\n  \"paper_link\": \"https://arxiv.org/abs/2404.05955\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:12.747583+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:12.747583+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/vocalsound.json",
    "content": "{\n  \"benchmark_id\": \"vocalsound\",\n  \"name\": \"VocalSound\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"audio\"],\n  \"modality\": \"audio\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A dataset for improving human vocal sounds recognition, containing over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. Used for audio event classification and recognition of human non-speech vocalizations.\",\n  \"paper_link\": \"https://arxiv.org/abs/2205.03433\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.919198+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.919198+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/voicebench-avg.json",
    "content": "{\n  \"benchmark_id\": \"voicebench-avg\",\n  \"name\": \"VoiceBench Avg\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\", \"safety\", \"communication\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"VoiceBench is the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants, evaluating capabilities including general knowledge, instruction-following, reasoning, and safety using both synthetic and real spoken instruction data with diverse speaker characteristics and environmental conditions.\",\n  \"paper_link\": \"https://arxiv.org/abs/2410.17196\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.922519+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.922519+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/vqa-rad.json",
    "content": "{\n  \"benchmark_id\": \"vqa-rad\",\n  \"name\": \"VQA-Rad\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"healthcare\", \"multimodal\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"VQA-RAD (Visual Question Answering in Radiology) is the first manually constructed dataset of medical visual question answering containing 3,515 clinically generated visual questions and answers about radiology images. The dataset includes questions created by clinical trainees on 315 radiology images from MedPix covering head, chest, and abdominal scans, designed to support AI development for medical image analysis and improve patient care.\",\n  \"paper_link\": \"https://doi.org/10.1038/sdata.2018.251\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.031802+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.031802+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/vqav2-(test).json",
    "content": "{\n  \"benchmark_id\": \"vqav2-(test)\",\n  \"name\": \"VQAv2 (test)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"VQA v2.0 (Visual Question Answering v2.0) is a balanced dataset designed to counter language priors in visual question answering. It consists of complementary image pairs where the same question yields different answers, forcing models to rely on visual understanding rather than language bias. The dataset contains 1,105,904 questions across 204,721 COCO images, requiring understanding of vision, language, and commonsense knowledge.\",\n  \"paper_link\": \"https://arxiv.org/abs/1612.00837\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.430940+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.430940+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/vqav2-(val).json",
    "content": "{\n  \"benchmark_id\": \"vqav2-(val)\",\n  \"name\": \"VQAv2 (val)\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"language\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"VQAv2 is a balanced Visual Question Answering dataset containing open-ended questions about images that require understanding of vision, language, and commonsense knowledge to answer. VQAv2 addresses bias issues from the original VQA dataset by collecting complementary images such that every question is associated with similar images that result in different answers, forcing models to actually understand visual content rather than relying on language priors.\",\n  \"paper_link\": \"https://arxiv.org/abs/1612.00837\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.647852+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.647852+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/vqav2.json",
    "content": "{\n  \"benchmark_id\": \"vqav2\",\n  \"name\": \"VQAv2\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"vision\", \"multimodal\", \"reasoning\"],\n  \"modality\": \"multimodal\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"VQAv2 is a balanced Visual Question Answering dataset that addresses language bias by providing complementary images for each question, forcing models to rely on visual understanding rather than language priors. It contains approximately twice the number of image-question pairs compared to the original VQA dataset.\",\n  \"paper_link\": \"https://arxiv.org/abs/1612.00837\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:14.410411+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:14.410411+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/wild-bench.json",
    "content": "{\n  \"benchmark_id\": \"wild-bench\",\n  \"name\": \"Wild Bench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"general\", \"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"WildBench is an automated evaluation framework that benchmarks large language models using 1,024 challenging, real-world tasks selected from over one million human-chatbot conversation logs. It introduces two evaluation metrics (WB-Reward and WB-Score) that achieve high correlation with human preferences and uses task-specific checklists for systematic evaluation.\",\n  \"paper_link\": \"https://arxiv.org/abs/2406.04770\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.122112+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.122112+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/winogrande.json",
    "content": "{\n  \"benchmark_id\": \"winogrande\",\n  \"name\": \"Winogrande\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"WinoGrande: An Adversarial Winograd Schema Challenge at Scale. A large-scale dataset of 44,000 pronoun resolution problems designed to test machine commonsense reasoning. Uses adversarial filtering to reduce spurious biases and provides a more robust evaluation of whether AI systems truly understand commonsense or exploit statistical shortcuts. Current best AI methods achieve 59.4-79.1% accuracy, significantly below human performance of 94.0%.\",\n  \"paper_link\": \"https://arxiv.org/abs/1907.10641\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:11.370408+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:11.370408+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/wmt23.json",
    "content": "{\n  \"benchmark_id\": \"wmt23\",\n  \"name\": \"WMT23\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"The Eighth Conference on Machine Translation (WMT23) benchmark evaluating machine translation systems across 8 language pairs (14 translation directions) including general, biomedical, literary, and low-resource language translation tasks. Features specialized shared tasks for quality estimation, metrics evaluation, sign language translation, and discourse-level literary translation with professional human assessment.\",\n  \"paper_link\": \"https://aclanthology.org/2023.wmt-1.1/\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.934606+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.934606+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/wmt24++.json",
    "content": "{\n  \"benchmark_id\": \"wmt24++\",\n  \"name\": \"WMT24++\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"WMT24++ is a comprehensive multilingual machine translation benchmark that expands the WMT24 dataset to cover 55 languages and dialects. It includes human-written references and post-edits across four domains (literary, news, social, and speech) to evaluate machine translation systems and large language models across diverse linguistic contexts.\",\n  \"paper_link\": \"https://arxiv.org/abs/2502.12404\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.576712+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.576712+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/writingbench.json",
    "content": "{\n  \"benchmark_id\": \"writingbench\",\n  \"name\": \"WritingBench\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"writing\", \"creativity\", \"communication\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, Advertising & Marketing) and 100 subdomains. Contains 1,239 queries with a query-dependent evaluation framework that dynamically generates 5 instance-specific assessment criteria for each writing task, using a fine-tuned critic model to score responses on style, format, and length dimensions.\",\n  \"paper_link\": \"https://arxiv.org/abs/2503.05244\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-08-03T22:06:11.074130+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:11.074130+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/xlsum-english.json",
    "content": "{\n  \"benchmark_id\": \"xlsum-english\",\n  \"name\": \"XLSum English\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"summarization\", \"language\"],\n  \"modality\": \"text\",\n  \"multilingual\": true,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"Large-scale multilingual abstractive summarization dataset comprising 1 million professionally annotated article-summary pairs from BBC, covering 44 languages. XL-Sum is highly abstractive, concise, and of high quality, designed to encourage research on multilingual abstractive summarization tasks.\",\n  \"paper_link\": \"https://arxiv.org/abs/2106.13822\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:15.092213+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:15.092213+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/xstest.json",
    "content": "{\n  \"benchmark_id\": \"xstest\",\n  \"name\": \"XSTest\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"safety\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"XSTest is a test suite designed to identify exaggerated safety behaviours in large language models. It comprises 450 prompts: 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models should refuse. The benchmark systematically evaluates whether models refuse to respond to clearly safe prompts due to overly cautious safety mechanisms.\",\n  \"paper_link\": \"https://arxiv.org/abs/2308.01263\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-07-19T19:56:13.998594+00:00\",\n  \"updated_at\": \"2025-07-19T19:56:13.998594+00:00\"\n}"
  },
  {
    "path": "data/benchmarks/zebralogic.json",
    "content": "{\n  \"benchmark_id\": \"zebralogic\",\n  \"name\": \"ZebraLogic\",\n  \"parent_benchmark_id\": null,\n  \"categories\": [\"reasoning\"],\n  \"modality\": \"text\",\n  \"multilingual\": false,\n  \"max_score\": 1.0,\n  \"language\": \"en\",\n  \"description\": \"ZebraLogic is an evaluation framework for assessing large language models' logical reasoning capabilities through logic grid puzzles derived from constraint satisfaction problems (CSPs). The benchmark consists of 1,000 programmatically generated puzzles with controllable and quantifiable complexity, revealing a 'curse of complexity' where model accuracy declines significantly as problem complexity grows.\",\n  \"paper_link\": \"https://arxiv.org/abs/2502.01100\",\n  \"implementation_link\": null,\n  \"verified\": false,\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-05T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/licenses/apache_2_0.json",
    "content": "{\n  \"license_id\": \"apache_2_0\",\n  \"name\": \"Apache 2.0\",\n  \"allow_commercial\": true,\n  \"description\": \"Apache License 2.0 - allows commercial use\",\n  \"created_at\": \"2025-07-19T19:49:05.605369+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.605369+00:00\"\n}"
  },
  {
    "path": "data/licenses/cc_by_nc.json",
    "content": "{\n  \"license_id\": \"cc_by_nc\",\n  \"name\": \"CC BY-NC\",\n  \"allow_commercial\": false,\n  \"description\": \"Creative Commons Non-Commercial\",\n  \"created_at\": \"2025-07-19T19:49:05.408956+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.408956+00:00\"\n}"
  },
  {
    "path": "data/licenses/creative_commons_attribution_4_0_license.json",
    "content": "{\n  \"license_id\": \"creative_commons_attribution_4_0_license\",\n  \"name\": \"Creative Commons Attribution 4.0 License\",\n  \"allow_commercial\": false,\n  \"description\": \"Creative Commons Attribution 4.0 License license\",\n  \"created_at\": \"2025-07-19T19:49:05.471773+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.471773+00:00\"\n}"
  },
  {
    "path": "data/licenses/deepseek.json",
    "content": "{\n  \"license_id\": \"deepseek\",\n  \"name\": \"deepseek\",\n  \"allow_commercial\": false,\n  \"description\": \"deepseek license\",\n  \"created_at\": \"2025-07-19T19:49:05.656652+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.656652+00:00\"\n}"
  },
  {
    "path": "data/licenses/gemma.json",
    "content": "{\n  \"license_id\": \"gemma\",\n  \"name\": \"Gemma\",\n  \"allow_commercial\": true,\n  \"description\": \"Google Gemma Terms of Use\",\n  \"created_at\": \"2025-07-19T19:49:05.442645+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.442645+00:00\"\n}"
  },
  {
    "path": "data/licenses/health_ai_developer_foundations_terms_of_use.json",
    "content": "{\n  \"license_id\": \"health_ai_developer_foundations_terms_of_use\",\n  \"name\": \"Health AI Developer Foundations terms of use\",\n  \"allow_commercial\": false,\n  \"description\": \"Health AI Developer Foundations terms of use license\",\n  \"created_at\": \"2025-07-19T19:49:05.510423+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.510423+00:00\"\n}"
  },
  {
    "path": "data/licenses/jamba_open_model_license.json",
    "content": "{\n  \"license_id\": \"jamba_open_model_license\",\n  \"name\": \"Jamba Open Model License\",\n  \"allow_commercial\": false,\n  \"description\": \"Jamba Open Model License license\",\n  \"created_at\": \"2025-07-19T19:49:05.763778+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.763778+00:00\"\n}"
  },
  {
    "path": "data/licenses/llama3_2.json",
    "content": "{\n  \"license_id\": \"llama3_2\",\n  \"name\": \"Llama 3.2\",\n  \"allow_commercial\": true,\n  \"description\": \"Meta Llama 3.2 Community License\",\n  \"created_at\": \"2025-07-19T19:49:05.578287+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.578287+00:00\"\n}"
  },
  {
    "path": "data/licenses/llama_3_1_community_license.json",
    "content": "{\n  \"license_id\": \"llama_3_1_community_license\",\n  \"name\": \"Llama 3.1 Community License\",\n  \"allow_commercial\": false,\n  \"description\": \"Llama 3.1 Community License license\",\n  \"created_at\": \"2025-07-19T19:49:05.574080+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.574080+00:00\"\n}"
  },
  {
    "path": "data/licenses/llama_3_2_community_license.json",
    "content": "{\n  \"license_id\": \"llama_3_2_community_license\",\n  \"name\": \"Llama 3.2 Community License\",\n  \"allow_commercial\": false,\n  \"description\": \"Llama 3.2 Community License license\",\n  \"created_at\": \"2025-07-19T19:49:05.587308+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.587308+00:00\"\n}"
  },
  {
    "path": "data/licenses/llama_3_3_community_license_agreement.json",
    "content": "{\n  \"license_id\": \"llama_3_3_community_license_agreement\",\n  \"name\": \"Llama 3.3 Community License Agreement\",\n  \"allow_commercial\": false,\n  \"description\": \"Llama 3.3 Community License Agreement license\",\n  \"created_at\": \"2025-07-19T19:49:05.602167+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.602167+00:00\"\n}"
  },
  {
    "path": "data/licenses/llama_4_community_license_agreement.json",
    "content": "{\n  \"license_id\": \"llama_4_community_license_agreement\",\n  \"name\": \"Llama 4 Community License Agreement\",\n  \"allow_commercial\": false,\n  \"description\": \"Llama 4 Community License Agreement license\",\n  \"created_at\": \"2025-07-19T19:49:05.593881+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.593881+00:00\"\n}"
  },
  {
    "path": "data/licenses/mistral_research_license.json",
    "content": "{\n  \"license_id\": \"mistral_research_license\",\n  \"name\": \"Mistral Research License\",\n  \"allow_commercial\": false,\n  \"description\": \"Mistral Research License license\",\n  \"created_at\": \"2025-07-19T19:49:05.785093+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.785093+00:00\"\n}"
  },
  {
    "path": "data/licenses/mistral_research_license_(mrl)_for_research;_mistral_commercial_license_for_commercial_use.json",
    "content": "{\n  \"license_id\": \"mistral_research_license_(mrl)_for_research;_mistral_commercial_license_for_commercial_use\",\n  \"name\": \"Mistral Research License (MRL) for research; Mistral Commercial License for commercial use\",\n  \"allow_commercial\": false,\n  \"description\": \"Mistral Research License (MRL) for research; Mistral Commercial License for commercial use license\",\n  \"created_at\": \"2025-07-19T19:49:05.911442+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.911442+00:00\"\n}"
  },
  {
    "path": "data/licenses/mit.json",
    "content": "{\n  \"license_id\": \"mit\",\n  \"name\": \"MIT\",\n  \"allow_commercial\": true,\n  \"description\": \"MIT License - allows commercial use\",\n  \"created_at\": \"2025-07-19T19:49:05.544627+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.544627+00:00\"\n}"
  },
  {
    "path": "data/licenses/mit_+_model_license_(commercial_use_allowed).json",
    "content": "{\n  \"license_id\": \"mit_+_model_license_(commercial_use_allowed)\",\n  \"name\": \"MIT + Model License (Commercial use allowed)\",\n  \"allow_commercial\": false,\n  \"description\": \"MIT + Model License (Commercial use allowed) license\",\n  \"created_at\": \"2025-07-19T19:49:05.676049+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.676049+00:00\"\n}"
  },
  {
    "path": "data/licenses/mit_license.json",
    "content": "{\n  \"license_id\": \"mit_license\",\n  \"name\": \"MIT License\",\n  \"allow_commercial\": false,\n  \"description\": \"MIT License license\",\n  \"created_at\": \"2025-07-19T19:49:05.897679+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.897679+00:00\"\n}"
  },
  {
    "path": "data/licenses/mnpl_0_1.json",
    "content": "{\n  \"license_id\": \"mnpl_0_1\",\n  \"name\": \"MNPL-0.1\",\n  \"allow_commercial\": false,\n  \"description\": \"MNPL-0.1 license\",\n  \"created_at\": \"2025-07-19T19:49:05.804469+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.804469+00:00\"\n}"
  },
  {
    "path": "data/licenses/modified_mit_license.json",
    "content": "{\n  \"license_id\": \"modified_mit_license\",\n  \"name\": \"Modified MIT License\",\n  \"allow_commercial\": false,\n  \"description\": \"Modified MIT License license\",\n  \"created_at\": \"2025-07-19T19:49:05.420757+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.420757+00:00\"\n}"
  },
  {
    "path": "data/licenses/nvidia_open_model_license_agreement.json",
    "content": "{\n  \"license_id\": \"nvidia_open_model_license_agreement\",\n  \"name\": \"NVIDIA Open Model License Agreement \",\n  \"allow_commercial\": true,\n  \"description\": \"NVIDIA Open Model License Agreement \",\n  \"created_at\": \"2025-10-02T21:51:16.835+00:00\",\n  \"updated_at\": \"2025-10-02T21:51:16.835+00:00\"\n}"
  },
  {
    "path": "data/licenses/proprietary.json",
    "content": "{\n  \"license_id\": \"proprietary\",\n  \"name\": \"Proprietary\",\n  \"allow_commercial\": false,\n  \"description\": \"Proprietary license - usage restrictions apply\",\n  \"created_at\": \"2025-07-19T19:49:05.425183+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.425183+00:00\"\n}"
  },
  {
    "path": "data/licenses/qwen.json",
    "content": "{\n  \"license_id\": \"qwen\",\n  \"name\": \"Qwen\",\n  \"allow_commercial\": true,\n  \"description\": \"Alibaba Qwen License\",\n  \"created_at\": \"2025-07-19T19:49:05.626726+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.626726+00:00\"\n}"
  },
  {
    "path": "data/licenses/tongyi_qianwen.json",
    "content": "{\n  \"license_id\": \"tongyi_qianwen\",\n  \"name\": \"tongyi-qianwen\",\n  \"allow_commercial\": false,\n  \"description\": \"tongyi-qianwen license\",\n  \"created_at\": \"2025-07-19T19:49:05.618579+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.618579+00:00\"\n}"
  },
  {
    "path": "data/licenses/unknown.json",
    "content": "{\n  \"license_id\": \"unknown\",\n  \"name\": \"Unknown\",\n  \"allow_commercial\": false,\n  \"description\": \"Unknown license\",\n  \"created_at\": \"2025-08-03T22:06:10.793734+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:10.793734+00:00\"\n}"
  },
  {
    "path": "data/organizations/ai21/models/jamba-1.5-large/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 28,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"jamba-1.5-large\",\n    \"score\": 0.93,\n    \"normalized_score\": 0.93,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.139664+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.139664+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1462,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"jamba-1.5-large\",\n    \"score\": 0.654,\n    \"normalized_score\": 0.654,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.114965+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.114965+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 338,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"jamba-1.5-large\",\n    \"score\": 0.369,\n    \"normalized_score\": 0.369,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.736664+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.736664+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1011,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"jamba-1.5-large\",\n    \"score\": 0.87,\n    \"normalized_score\": 0.87,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.109009+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.109009+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 108,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"jamba-1.5-large\",\n    \"score\": 0.812,\n    \"normalized_score\": 0.812,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chain-of-Thought accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.302578+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.302578+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 213,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"jamba-1.5-large\",\n    \"score\": 0.535,\n    \"normalized_score\": 0.535,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chain-of-Thought accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.505024+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.505024+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 144,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"jamba-1.5-large\",\n    \"score\": 0.583,\n    \"normalized_score\": 0.583,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.365684+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.365684+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 1816,\n    \"benchmark_id\": \"wild-bench\",\n    \"model_id\": \"jamba-1.5-large\",\n    \"score\": 0.485,\n    \"normalized_score\": 0.485,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.125090+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.125090+00:00\",\n    \"benchmark_name\": \"Wild Bench\"\n  }\n]"
  },
  {
    "path": "data/organizations/ai21/models/jamba-1.5-large/model.json",
    "content": "{\n  \"model_id\": \"jamba-1.5-large\",\n  \"name\": \"Jamba 1.5 Large\",\n  \"organization_id\": \"ai21\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"State-of-the-art hybrid SSM-Transformer instruction following foundation model, offering superior long context handling, speed, and quality.\",\n  \"release_date\": \"2024-08-22\",\n  \"announcement_date\": \"2024-08-22\",\n  \"license_id\": \"jamba_open_model_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2024-03-05\",\n  \"param_count\": 398000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.ai21.com/reference/jamba-15-api-ref\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://www.ai21.com/blog/announcing-jamba-model-family\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large\",\n  \"created_at\": \"2025-07-19T19:49:05.764734+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.764734+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/ai21/models/jamba-1.5-mini/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 29,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"jamba-1.5-mini\",\n    \"score\": 0.857,\n    \"normalized_score\": 0.857,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.141043+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.141043+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1463,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"jamba-1.5-mini\",\n    \"score\": 0.461,\n    \"normalized_score\": 0.461,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.117178+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.117178+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 339,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"jamba-1.5-mini\",\n    \"score\": 0.323,\n    \"normalized_score\": 0.323,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.739037+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.739037+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1012,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"jamba-1.5-mini\",\n    \"score\": 0.758,\n    \"normalized_score\": 0.758,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.110443+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.110443+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 109,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"jamba-1.5-mini\",\n    \"score\": 0.697,\n    \"normalized_score\": 0.697,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chain-of-Thought accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.304017+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.304017+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 214,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"jamba-1.5-mini\",\n    \"score\": 0.425,\n    \"normalized_score\": 0.425,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chain-of-Thought accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.506893+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.506893+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 145,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"jamba-1.5-mini\",\n    \"score\": 0.541,\n    \"normalized_score\": 0.541,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.367476+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.367476+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 1817,\n    \"benchmark_id\": \"wild-bench\",\n    \"model_id\": \"jamba-1.5-mini\",\n    \"score\": 0.424,\n    \"normalized_score\": 0.424,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.127075+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.127075+00:00\",\n    \"benchmark_name\": \"Wild Bench\"\n  }\n]"
  },
  {
    "path": "data/organizations/ai21/models/jamba-1.5-mini/model.json",
    "content": "{\n  \"model_id\": \"jamba-1.5-mini\",\n  \"name\": \"Jamba 1.5 Mini\",\n  \"organization_id\": \"ai21\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Part of the Jamba 1.5 family, a state-of-the-art hybrid SSM-Transformer instruction following foundation model offering superior long context handling, speed, and quality.\",\n  \"release_date\": \"2024-08-22\",\n  \"announcement_date\": \"2024-08-22\",\n  \"license_id\": \"jamba_open_model_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2024-03-05\",\n  \"param_count\": 52000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.ai21.com/reference/jamba-15-api-ref\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2408.12570\",\n  \"source_scorecard_blog_link\": \"https://www.ai21.com/blog/announcing-jamba-model-family\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini\",\n  \"created_at\": \"2025-07-19T19:49:05.767535+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.767535+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/ai21/organization.json",
    "content": "{\n  \"organization_id\": \"ai21\",\n  \"name\": \"AI21 Labs\",\n  \"website\": \"https://ai21.com\",\n  \"description\": \"NLP AI company\",\n  \"country\": null,\n  \"created_at\": \"2025-07-19T19:49:05.762555+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.762555+00:00\"\n}"
  },
  {
    "path": "data/organizations/amazon/models/nova-lite/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 2,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.924,\n    \"normalized_score\": 0.924,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.080108+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.080108+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 967,\n    \"benchmark_id\": \"bbh\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.824,\n    \"normalized_score\": 0.824,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.034481+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.034481+00:00\",\n    \"benchmark_name\": \"BBH\"\n  },\n  {\n    \"model_benchmark_id\": 843,\n    \"benchmark_id\": \"bfcl\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.666,\n    \"normalized_score\": 0.666,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.766776+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.766776+00:00\",\n    \"benchmark_name\": \"BFCL\"\n  },\n  {\n    \"model_benchmark_id\": 853,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.868,\n    \"normalized_score\": 0.868,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"relaxed accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.786772+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.786772+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 834,\n    \"benchmark_id\": \"crag\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.438,\n    \"normalized_score\": 0.438,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.743484+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.743484+00:00\",\n    \"benchmark_name\": \"CRAG\"\n  },\n  {\n    \"model_benchmark_id\": 876,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.924,\n    \"normalized_score\": 0.924,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"ANLS\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.827478+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.827478+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 939,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.802,\n    \"normalized_score\": 0.802,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.984716+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.984716+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 918,\n    \"benchmark_id\": \"egoschema\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.714,\n    \"normalized_score\": 0.714,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.918221+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.918221+00:00\",\n    \"benchmark_name\": \"EgoSchema\"\n  },\n  {\n    \"model_benchmark_id\": 831,\n    \"benchmark_id\": \"finqa\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.736,\n    \"normalized_score\": 0.736,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.736609+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.736609+00:00\",\n    \"benchmark_name\": \"FinQA\"\n  },\n  {\n    \"model_benchmark_id\": 258,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.42,\n    \"normalized_score\": 0.42,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"6-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.594691+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.594691+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 841,\n    \"benchmark_id\": \"groundui-1k\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.802,\n    \"normalized_score\": 0.802,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.761300+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.761300+00:00\",\n    \"benchmark_name\": \"GroundUI-1K\"\n  },\n  {\n    \"model_benchmark_id\": 160,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.945,\n    \"normalized_score\": 0.945,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.407299+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.407299+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 759,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.854,\n    \"normalized_score\": 0.854,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.601822+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.601822+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 604,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.897,\n    \"normalized_score\": 0.897,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.248959+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.248959+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 826,\n    \"benchmark_id\": \"lvbench\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.404,\n    \"normalized_score\": 0.404,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.726573+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.726573+00:00\",\n    \"benchmark_name\": \"LVBench\"\n  },\n  {\n    \"model_benchmark_id\": 374,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.810622+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.810622+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 60,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.805,\n    \"normalized_score\": 0.805,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.212315+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.212315+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 839,\n    \"benchmark_id\": \"mm-mind2web\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.607,\n    \"normalized_score\": 0.607,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.755878+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.755878+00:00\",\n    \"benchmark_name\": \"MM-Mind2Web\"\n  },\n  {\n    \"model_benchmark_id\": 550,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.562,\n    \"normalized_score\": 0.562,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"CoT accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.134288+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.134288+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 821,\n    \"benchmark_id\": \"squality\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.192,\n    \"normalized_score\": 0.192,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"rouge-l\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.715662+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.715662+00:00\",\n    \"benchmark_name\": \"SQuALITY\"\n  },\n  {\n    \"model_benchmark_id\": 901,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.802,\n    \"normalized_score\": 0.802,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"weighted accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.878076+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.878076+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  },\n  {\n    \"model_benchmark_id\": 930,\n    \"benchmark_id\": \"translation-en\\u2192set1-comet22\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.888,\n    \"normalized_score\": 0.888,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"COMET22\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.962491+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.962491+00:00\",\n    \"benchmark_name\": \"Translation en\\u2192Set1 COMET22\"\n  },\n  {\n    \"model_benchmark_id\": 927,\n    \"benchmark_id\": \"translation-en\\u2192set1-spbleu\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.415,\n    \"normalized_score\": 0.415,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"spBleu\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.942744+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.942744+00:00\",\n    \"benchmark_name\": \"Translation en\\u2192Set1 spBleu\"\n  },\n  {\n    \"model_benchmark_id\": 936,\n    \"benchmark_id\": \"translation-set1\\u2192en-comet22\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.888,\n    \"normalized_score\": 0.888,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"COMET22\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.977060+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.977060+00:00\",\n    \"benchmark_name\": \"Translation Set1\\u2192en COMET22\"\n  },\n  {\n    \"model_benchmark_id\": 933,\n    \"benchmark_id\": \"translation-set1\\u2192en-spbleu\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.431,\n    \"normalized_score\": 0.431,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"spBleu\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.969524+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.969524+00:00\",\n    \"benchmark_name\": \"Translation Set1\\u2192en spBleu\"\n  },\n  {\n    \"model_benchmark_id\": 916,\n    \"benchmark_id\": \"vatex\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.778,\n    \"normalized_score\": 0.778,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"CIDEr\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.912261+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.912261+00:00\",\n    \"benchmark_name\": \"VATEX\"\n  },\n  {\n    \"model_benchmark_id\": 837,\n    \"benchmark_id\": \"visualwebbench\",\n    \"model_id\": \"nova-lite\",\n    \"score\": 0.777,\n    \"normalized_score\": 0.777,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"composite step accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.750738+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.750738+00:00\",\n    \"benchmark_name\": \"VisualWebBench\"\n  }\n]"
  },
  {
    "path": "data/organizations/amazon/models/nova-lite/model.json",
    "content": "{\n  \"model_id\": \"nova-lite\",\n  \"name\": \"Nova Lite\",\n  \"organization_id\": \"amazon\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A low-cost multimodal model that is lightning fast for processing images, video, documents, and text.\",\n  \"release_date\": \"2024-11-20\",\n  \"announcement_date\": \"2024-11-20\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://aws.amazon.com/bedrock/amazon-nova-lite\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.429271+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.429271+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/amazon/models/nova-micro/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 4,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.902,\n    \"normalized_score\": 0.902,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.088301+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.088301+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 969,\n    \"benchmark_id\": \"bbh\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.795,\n    \"normalized_score\": 0.795,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.038288+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.038288+00:00\",\n    \"benchmark_name\": \"BBH\"\n  },\n  {\n    \"model_benchmark_id\": 845,\n    \"benchmark_id\": \"bfcl\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.562,\n    \"normalized_score\": 0.562,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.770319+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.770319+00:00\",\n    \"benchmark_name\": \"BFCL\"\n  },\n  {\n    \"model_benchmark_id\": 836,\n    \"benchmark_id\": \"crag\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.431,\n    \"normalized_score\": 0.431,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.746657+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.746657+00:00\",\n    \"benchmark_name\": \"CRAG\"\n  },\n  {\n    \"model_benchmark_id\": 941,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.793,\n    \"normalized_score\": 0.793,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"6-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.987950+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.987950+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 833,\n    \"benchmark_id\": \"finqa\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.652,\n    \"normalized_score\": 0.652,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.740201+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.740201+00:00\",\n    \"benchmark_name\": \"FinQA\"\n  },\n  {\n    \"model_benchmark_id\": 260,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.4,\n    \"normalized_score\": 0.4,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.598530+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.598530+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 976,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.923,\n    \"normalized_score\": 0.923,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.051041+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.051041+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 761,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.811,\n    \"normalized_score\": 0.811,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.605066+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.605066+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 606,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.872,\n    \"normalized_score\": 0.872,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.252589+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.252589+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 376,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.693,\n    \"normalized_score\": 0.693,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.814150+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.814150+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 62,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.776,\n    \"normalized_score\": 0.776,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.217284+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.217284+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 823,\n    \"benchmark_id\": \"squality\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.188,\n    \"normalized_score\": 0.188,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"rouge-l\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.719314+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.719314+00:00\",\n    \"benchmark_name\": \"SQuALITY\"\n  },\n  {\n    \"model_benchmark_id\": 932,\n    \"benchmark_id\": \"translation-en\\u2192set1-comet22\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.885,\n    \"normalized_score\": 0.885,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"COMET22\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.966157+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.966157+00:00\",\n    \"benchmark_name\": \"Translation en\\u2192Set1 COMET22\"\n  },\n  {\n    \"model_benchmark_id\": 929,\n    \"benchmark_id\": \"translation-en\\u2192set1-spbleu\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.402,\n    \"normalized_score\": 0.402,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"spBleu\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.958167+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.958167+00:00\",\n    \"benchmark_name\": \"Translation en\\u2192Set1 spBleu\"\n  },\n  {\n    \"model_benchmark_id\": 938,\n    \"benchmark_id\": \"translation-set1\\u2192en-comet22\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.887,\n    \"normalized_score\": 0.887,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"COMET22\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.980365+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.980365+00:00\",\n    \"benchmark_name\": \"Translation Set1\\u2192en COMET22\"\n  },\n  {\n    \"model_benchmark_id\": 935,\n    \"benchmark_id\": \"translation-set1\\u2192en-spbleu\",\n    \"model_id\": \"nova-micro\",\n    \"score\": 0.426,\n    \"normalized_score\": 0.426,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"spBleu\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.973209+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.973209+00:00\",\n    \"benchmark_name\": \"Translation Set1\\u2192en spBleu\"\n  }\n]"
  },
  {
    "path": "data/organizations/amazon/models/nova-micro/model.json",
    "content": "{\n  \"model_id\": \"nova-micro\",\n  \"name\": \"Nova Micro\",\n  \"organization_id\": \"amazon\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A text-only model that delivers lowest-latency responses at very low cost while maintaining strong performance on core language tasks. Optimized for speed and efficiency while preserving high accuracy on key benchmarks.\",\n  \"release_date\": \"2024-11-20\",\n  \"announcement_date\": \"2024-11-20\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-nova.html\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://huggingface.co/amazon-agi\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.435386+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.435386+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/amazon/models/nova-pro/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 3,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.948,\n    \"normalized_score\": 0.948,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.085849+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.085849+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 968,\n    \"benchmark_id\": \"bbh\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.869,\n    \"normalized_score\": 0.869,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.036192+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.036192+00:00\",\n    \"benchmark_name\": \"BBH\"\n  },\n  {\n    \"model_benchmark_id\": 844,\n    \"benchmark_id\": \"bfcl\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.684,\n    \"normalized_score\": 0.684,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.768714+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.768714+00:00\",\n    \"benchmark_name\": \"BFCL\"\n  },\n  {\n    \"model_benchmark_id\": 854,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.892,\n    \"normalized_score\": 0.892,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"relaxed accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.788270+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.788270+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 835,\n    \"benchmark_id\": \"crag\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.503,\n    \"normalized_score\": 0.503,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.744994+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.744994+00:00\",\n    \"benchmark_name\": \"CRAG\"\n  },\n  {\n    \"model_benchmark_id\": 877,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.935,\n    \"normalized_score\": 0.935,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"ANLS\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.829064+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.829064+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 940,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.854,\n    \"normalized_score\": 0.854,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.986311+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.986311+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 919,\n    \"benchmark_id\": \"egoschema\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.721,\n    \"normalized_score\": 0.721,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.920400+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.920400+00:00\",\n    \"benchmark_name\": \"EgoSchema\"\n  },\n  {\n    \"model_benchmark_id\": 832,\n    \"benchmark_id\": \"finqa\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.772,\n    \"normalized_score\": 0.772,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.738456+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.738456+00:00\",\n    \"benchmark_name\": \"FinQA\"\n  },\n  {\n    \"model_benchmark_id\": 259,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.469,\n    \"normalized_score\": 0.469,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"6-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.596541+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.596541+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 842,\n    \"benchmark_id\": \"groundui-1k\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.814,\n    \"normalized_score\": 0.814,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.762846+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.762846+00:00\",\n    \"benchmark_name\": \"GroundUI-1K\"\n  },\n  {\n    \"model_benchmark_id\": 975,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.948,\n    \"normalized_score\": 0.948,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.049455+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.049455+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 760,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.89,\n    \"normalized_score\": 0.89,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.603428+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.603428+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 605,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.921,\n    \"normalized_score\": 0.921,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.250818+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.250818+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 827,\n    \"benchmark_id\": \"lvbench\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.416,\n    \"normalized_score\": 0.416,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.728104+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.728104+00:00\",\n    \"benchmark_name\": \"LVBench\"\n  },\n  {\n    \"model_benchmark_id\": 375,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.766,\n    \"normalized_score\": 0.766,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.812663+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.812663+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 61,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.859,\n    \"normalized_score\": 0.859,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.214544+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.214544+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 840,\n    \"benchmark_id\": \"mm-mind2web\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.637,\n    \"normalized_score\": 0.637,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"step accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.757670+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.757670+00:00\",\n    \"benchmark_name\": \"MM-Mind2Web\"\n  },\n  {\n    \"model_benchmark_id\": 551,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.617,\n    \"normalized_score\": 0.617,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.135953+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.135953+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 822,\n    \"benchmark_id\": \"squality\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.198,\n    \"normalized_score\": 0.198,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"ROUGE-L\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.717624+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.717624+00:00\",\n    \"benchmark_name\": \"SQuALITY\"\n  },\n  {\n    \"model_benchmark_id\": 902,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.815,\n    \"normalized_score\": 0.815,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"weighted accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.880228+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.880228+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  },\n  {\n    \"model_benchmark_id\": 931,\n    \"benchmark_id\": \"translation-en\\u2192set1-comet22\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.891,\n    \"normalized_score\": 0.891,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"COMET22\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.964047+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.964047+00:00\",\n    \"benchmark_name\": \"Translation en\\u2192Set1 COMET22\"\n  },\n  {\n    \"model_benchmark_id\": 928,\n    \"benchmark_id\": \"translation-en\\u2192set1-spbleu\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.434,\n    \"normalized_score\": 0.434,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"spBleu\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.950458+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.950458+00:00\",\n    \"benchmark_name\": \"Translation en\\u2192Set1 spBleu\"\n  },\n  {\n    \"model_benchmark_id\": 937,\n    \"benchmark_id\": \"translation-set1\\u2192en-comet22\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.89,\n    \"normalized_score\": 0.89,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"COMET22\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.978787+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.978787+00:00\",\n    \"benchmark_name\": \"Translation Set1\\u2192en COMET22\"\n  },\n  {\n    \"model_benchmark_id\": 934,\n    \"benchmark_id\": \"translation-set1\\u2192en-spbleu\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.444,\n    \"normalized_score\": 0.444,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"spBleu\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.971295+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.971295+00:00\",\n    \"benchmark_name\": \"Translation Set1\\u2192en spBleu\"\n  },\n  {\n    \"model_benchmark_id\": 917,\n    \"benchmark_id\": \"vatex\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.778,\n    \"normalized_score\": 0.778,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"CIDEr\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.913837+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.913837+00:00\",\n    \"benchmark_name\": \"VATEX\"\n  },\n  {\n    \"model_benchmark_id\": 838,\n    \"benchmark_id\": \"visualwebbench\",\n    \"model_id\": \"nova-pro\",\n    \"score\": 0.797,\n    \"normalized_score\": 0.797,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"composite\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.752533+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.752533+00:00\",\n    \"benchmark_name\": \"VisualWebBench\"\n  }\n]"
  },
  {
    "path": "data/organizations/amazon/models/nova-pro/model.json",
    "content": "{\n  \"model_id\": \"nova-pro\",\n  \"name\": \"Nova Pro\",\n  \"organization_id\": \"amazon\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Amazon Nova Pro is a highly-capable multimodal model with state-of-the-art performance across text, image, and video understanding. It excels at core capabilities like language understanding, mathematical reasoning, and multimodal tasks while offering industry-leading speed and cost efficiency.\",\n  \"release_date\": \"2024-11-20\",\n  \"announcement_date\": \"2024-11-20\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-nova.html\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://huggingface.co/amazon-agi\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.431675+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.431675+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/amazon/organization.json",
    "content": "{\n  \"organization_id\": \"amazon\",\n  \"name\": \"Amazon\",\n  \"website\": \"https://aws.amazon.com\",\n  \"description\": \"Cloud and AI services\",\n  \"country\": null,\n  \"created_at\": \"2025-07-19T19:49:05.427427+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.427427+00:00\"\n}"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-5-haiku-20241022/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 958,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"score\": 0.831,\n    \"normalized_score\": 0.831,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot F1 Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.017079+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.017079+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 331,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"score\": 0.416,\n    \"normalized_score\": 0.416,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.725835+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.725835+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 801,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"score\": 0.881,\n    \"normalized_score\": 0.881,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.671817+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.671817+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 417,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"score\": 0.694,\n    \"normalized_score\": 0.694,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.885732+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.885732+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1292,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"score\": 0.856,\n    \"normalized_score\": 0.856,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.705114+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.705114+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 210,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"score\": 0.65,\n    \"normalized_score\": 0.65,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.499754+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.499754+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1347,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"score\": 0.406,\n    \"normalized_score\": 0.406,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.836974+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.836974+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1771,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"score\": 0.228,\n    \"normalized_score\": 0.228,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.997081+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.997081+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1757,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"score\": 0.51,\n    \"normalized_score\": 0.51,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.970473+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.970473+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  }\n]"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-5-haiku-20241022/model.json",
    "content": "{\n  \"model_id\": \"claude-3-5-haiku-20241022\",\n  \"name\": \"Claude 3.5 Haiku\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Claude 3.5 Haiku is Anthropic's fastest model, delivering advanced coding, tool use, and reasoning capabilities at an accessible price. It excels at user-facing products, specialized sub-agent tasks, and generating personalized experiences from large data volumes. The model is particularly well-suited for code completions, interactive chatbots, data extraction, and real-time content moderation.\",\n  \"release_date\": \"2024-10-22\",\n  \"announcement_date\": \"2024-10-22\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-5-family\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-3-5-haiku\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.744002+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.744002+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-5-sonnet-20240620/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1086,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"claude-3-5-sonnet-20240620\",\n    \"score\": 0.931,\n    \"normalized_score\": 0.931,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.259482+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.259482+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 961,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"claude-3-5-sonnet-20240620\",\n    \"score\": 0.871,\n    \"normalized_score\": 0.871,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot F1 Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.021997+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.021997+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 336,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-3-5-sonnet-20240620\",\n    \"score\": 0.594,\n    \"normalized_score\": 0.594,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.733246+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.733246+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1010,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"claude-3-5-sonnet-20240620\",\n    \"score\": 0.964,\n    \"normalized_score\": 0.964,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.107479+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.107479+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 804,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"claude-3-5-sonnet-20240620\",\n    \"score\": 0.92,\n    \"normalized_score\": 0.92,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.676235+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.676235+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 420,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"claude-3-5-sonnet-20240620\",\n    \"score\": 0.711,\n    \"normalized_score\": 0.711,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.891344+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.891344+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1295,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"claude-3-5-sonnet-20240620\",\n    \"score\": 0.916,\n    \"normalized_score\": 0.916,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.710814+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.710814+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 107,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"claude-3-5-sonnet-20240620\",\n    \"score\": 0.904,\n    \"normalized_score\": 0.904,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-5-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.300996+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.300996+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 212,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"claude-3-5-sonnet-20240620\",\n    \"score\": 0.761,\n    \"normalized_score\": 0.761,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.503274+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.503274+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  }\n]"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-5-sonnet-20240620/model.json",
    "content": "{\n  \"model_id\": \"claude-3-5-sonnet-20240620\",\n  \"name\": \"Claude 3.5 Sonnet\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Claude 3.5 Sonnet is a powerful AI model. It excels in graduate-level reasoning, undergraduate-level knowledge, and coding proficiency, with improved understanding of nuance, humor, and complex instructions.\",\n  \"release_date\": \"2024-06-21\",\n  \"announcement_date\": \"2024-06-21\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-5-family\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-3-5-sonnet\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.757926+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.757926+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-5-sonnet-20241022/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1260,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.947,\n    \"normalized_score\": 0.947,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.643744+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.643744+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 1084,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.931,\n    \"normalized_score\": 0.931,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.256021+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.256021+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 872,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.908,\n    \"normalized_score\": 0.908,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test, relaxed accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.819413+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.819413+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 897,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.952,\n    \"normalized_score\": 0.952,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test, ANLS score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.867423+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.867423+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 959,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.871,\n    \"normalized_score\": 0.871,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot F1 Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.018623+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.018623+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 334,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.672,\n    \"normalized_score\": 0.672,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Maj@32 5-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.730271+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.730271+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1008,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.964,\n    \"normalized_score\": 0.964,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.104248+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.104248+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 802,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.937,\n    \"normalized_score\": 0.937,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.673295+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.673295+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 418,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.783,\n    \"normalized_score\": 0.783,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.887521+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.887521+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 535,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.677,\n    \"normalized_score\": 0.677,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"testmini\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.108158+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.108158+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1293,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.916,\n    \"normalized_score\": 0.916,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.707042+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.707042+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 105,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.904,\n    \"normalized_score\": 0.904,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.298011+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.298011+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 211,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.776,\n    \"normalized_score\": 0.776,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.501331+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.501331+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 584,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.683,\n    \"normalized_score\": 0.683,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"validation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.201491+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.201491+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1814,\n    \"benchmark_id\": \"osworld-extended\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.22,\n    \"normalized_score\": 0.22,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.117020+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.117020+00:00\",\n    \"benchmark_name\": \"OSWorld Extended\"\n  },\n  {\n    \"model_benchmark_id\": 1813,\n    \"benchmark_id\": \"osworld-screenshot-only\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.149,\n    \"normalized_score\": 0.149,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.112291+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.112291+00:00\",\n    \"benchmark_name\": \"OSWorld Screenshot-only\"\n  },\n  {\n    \"model_benchmark_id\": 1350,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.49,\n    \"normalized_score\": 0.49,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.842061+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.842061+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1774,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.46,\n    \"normalized_score\": 0.46,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.003886+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.003886+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1760,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"score\": 0.692,\n    \"normalized_score\": 0.692,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/3-5-models-and-computer-use\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.975456+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.975456+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  }\n]"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-5-sonnet-20241022/model.json",
    "content": "{\n  \"model_id\": \"claude-3-5-sonnet-20241022\",\n  \"name\": \"Claude 3.5 Sonnet\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Claude 3.5 Sonnet is a powerful AI model with industry-leading software engineering skills. It excels in coding, planning, and problem-solving, with significant improvements in agentic coding and tool use tasks. The model includes computer use capabilities in public beta, allowing it to interact with computer interfaces like a human user.\",\n  \"release_date\": \"2024-10-22\",\n  \"announcement_date\": \"2024-10-22\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.anthropic.com/en/docs/intro-to-claude#claude-3-5-family\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": \"https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf\",\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-3-5-sonnet\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.752534+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.752534+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-7-sonnet-20250219/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 478,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"score\": 0.8,\n    \"normalized_score\": 0.8,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-7-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.007831+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.007831+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 700,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"score\": 0.548,\n    \"normalized_score\": 0.548,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Parallel test-time compute (footnotes 4, 5)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.464908+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.464908+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 332,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"score\": 0.848,\n    \"normalized_score\": 0.848,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-7-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.727330+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.727330+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 629,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"score\": 0.932,\n    \"normalized_score\": 0.932,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-7-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.294010+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.294010+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 512,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"score\": 0.962,\n    \"normalized_score\": 0.962,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-7-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.063685+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.063685+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 1478,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"score\": 0.861,\n    \"normalized_score\": 0.861,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-7-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Average over 14 non-English languages (footnote 3)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.152773+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.152773+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 582,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"score\": 0.75,\n    \"normalized_score\": 0.75,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-7-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"validation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.197283+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.197283+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1348,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"score\": 0.703,\n    \"normalized_score\": 0.703,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-7-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"With multiple parallel attempts and advanced scaffolding\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.838599+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.838599+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1772,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"score\": 0.584,\n    \"normalized_score\": 0.584,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-7-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"With prompt addendum to better utilize planning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.999875+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.999875+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1758,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"score\": 0.812,\n    \"normalized_score\": 0.812,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-7-sonnet\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"With prompt addendum to better utilize planning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.971988+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.971988+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  },\n  {\n    \"model_benchmark_id\": 653,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"score\": 0.352,\n    \"normalized_score\": 0.352,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Parallel test-time compute, Claude Code agent framework (footnotes 2, 5)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.350298+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.350298+00:00\",\n    \"benchmark_name\": \"Terminal-bench\"\n  }\n]"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-7-sonnet-20250219/model.json",
    "content": "{\n  \"model_id\": \"claude-3-7-sonnet-20250219\",\n  \"name\": \"Claude 3.7 Sonnet\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"The most intelligent Claude model and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. Shows particularly strong improvements in coding and front-end web development.\",\n  \"release_date\": \"2025-02-24\",\n  \"announcement_date\": \"2025-02-24\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.anthropic.com/en/docs/about-claude/models/all-models\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-3-7-sonnet\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.747775+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.747775+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-haiku-20240307/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 27,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"score\": 0.892,\n    \"normalized_score\": 0.892,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.137830+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.137830+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1085,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"score\": 0.737,\n    \"normalized_score\": 0.737,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.257814+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.257814+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 960,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"score\": 0.784,\n    \"normalized_score\": 0.784,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot, F1 score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.020609+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.020609+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 335,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"score\": 0.333,\n    \"normalized_score\": 0.333,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.731729+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.731729+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1009,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"score\": 0.889,\n    \"normalized_score\": 0.889,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.105970+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.105970+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 53,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"score\": 0.859,\n    \"normalized_score\": 0.859,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.195028+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.195028+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 803,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"score\": 0.759,\n    \"normalized_score\": 0.759,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.674804+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.674804+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 419,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"score\": 0.389,\n    \"normalized_score\": 0.389,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.889123+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.889123+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1294,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"score\": 0.751,\n    \"normalized_score\": 0.751,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.709200+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.709200+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 106,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"score\": 0.752,\n    \"normalized_score\": 0.752,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-haiku\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.299416+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.299416+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  }\n]"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-haiku-20240307/model.json",
    "content": "{\n  \"model_id\": \"claude-3-haiku-20240307\",\n  \"name\": \"Claude 3 Haiku\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Claude 3 Haiku is the fastest and most compact model in the Claude 3 family, designed for near-instant responsiveness. It excels at answering simple queries and requests with unmatched speed, making it ideal for seamless AI experiences that mimic human interactions.\",\n  \"release_date\": \"2024-03-13\",\n  \"announcement_date\": \"2024-03-13\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.anthropic.com/claude\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": \"https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf\",\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-3-haiku\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.755159+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.755159+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-opus-20240229/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 25,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"score\": 0.964,\n    \"normalized_score\": 0.964,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.134917+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.134917+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1082,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"score\": 0.868,\n    \"normalized_score\": 0.868,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.252820+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.252820+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 956,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"score\": 0.831,\n    \"normalized_score\": 0.831,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot, F1 Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.013702+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.013702+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 329,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"score\": 0.504,\n    \"normalized_score\": 0.504,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT - Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.722913+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.722913+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1006,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"score\": 0.95,\n    \"normalized_score\": 0.95,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.101310+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.101310+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 51,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"score\": 0.954,\n    \"normalized_score\": 0.954,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.190975+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.190975+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 799,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"score\": 0.849,\n    \"normalized_score\": 0.849,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.668395+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.668395+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 415,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"score\": 0.601,\n    \"normalized_score\": 0.601,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.882261+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.882261+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1290,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"score\": 0.907,\n    \"normalized_score\": 0.907,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.701952+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.701952+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 103,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"score\": 0.868,\n    \"normalized_score\": 0.868,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.294591+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.294591+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 208,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"score\": 0.685,\n    \"normalized_score\": 0.685,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2406.01574\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.496438+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.496438+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  }\n]"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-opus-20240229/model.json",
    "content": "{\n  \"model_id\": \"claude-3-opus-20240229\",\n  \"name\": \"Claude 3 Opus\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Claude 3 Opus is Anthropic's most intelligent model, with best-in-market performance on highly complex tasks. It can navigate open-ended prompts and sight-unseen scenarios with remarkable fluency and human-like understanding, showing the outer limits of what's possible with generative AI.\",\n  \"release_date\": \"2024-02-29\",\n  \"announcement_date\": \"2024-02-29\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.anthropic.com/claude\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": \"https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf\",\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-3-family\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.738279+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.738279+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-sonnet-20240229/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 26,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"score\": 0.932,\n    \"normalized_score\": 0.932,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.136363+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.136363+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1083,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"score\": 0.829,\n    \"normalized_score\": 0.829,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.254531+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.254531+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 957,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"score\": 0.789,\n    \"normalized_score\": 0.789,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot, F1 score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.015601+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.015601+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 330,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"score\": 0.404,\n    \"normalized_score\": 0.404,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT - Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.724379+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.724379+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1007,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"score\": 0.923,\n    \"normalized_score\": 0.923,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.102758+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.102758+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 52,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"score\": 0.89,\n    \"normalized_score\": 0.89,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.193193+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.193193+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 800,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"score\": 0.73,\n    \"normalized_score\": 0.73,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.670119+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.670119+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 416,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"score\": 0.431,\n    \"normalized_score\": 0.431,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.884160+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.884160+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1291,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"score\": 0.835,\n    \"normalized_score\": 0.835,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.703593+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.703593+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 104,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"score\": 0.79,\n    \"normalized_score\": 0.79,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-3-family\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.296409+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.296409+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 209,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"score\": 0.568,\n    \"normalized_score\": 0.568,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2406.01574\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.498008+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.498008+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  }\n]"
  },
  {
    "path": "data/organizations/anthropic/models/claude-3-sonnet-20240229/model.json",
    "content": "{\n  \"model_id\": \"claude-3-sonnet-20240229\",\n  \"name\": \"Claude 3 Sonnet\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Claude 3 Sonnet strikes the ideal balance between intelligence and speed\\u2014particularly for enterprise workloads. It delivers strong performance at a lower cost compared to its peers, and is engineered for high endurance in large-scale AI deployments.\",\n  \"release_date\": \"2024-02-29\",\n  \"announcement_date\": \"2024-02-29\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.anthropic.com/claude\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": \"https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf\",\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-3-family\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.740647+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.740647+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/anthropic/models/claude-haiku-4-5-20251015/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 22228,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 22229,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.41,\n    \"normalized_score\": 0.41,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Terminal-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 22230,\n    \"benchmark_id\": \"tau2-retail\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.832,\n    \"normalized_score\": 0.832,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 Retail\"\n  },\n  {\n    \"model_benchmark_id\": 22231,\n    \"benchmark_id\": \"tau2-airline\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.636,\n    \"normalized_score\": 0.636,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 Airline\"\n  },\n  {\n    \"model_benchmark_id\": 22232,\n    \"benchmark_id\": \"tau2-telecom\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.83,\n    \"normalized_score\": 0.83,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 Telecom\"\n  },\n  {\n    \"model_benchmark_id\": 22233,\n    \"benchmark_id\": \"osworld\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.507,\n    \"normalized_score\": 0.507,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"OSWorld\"\n  },\n  {\n    \"model_benchmark_id\": 22234,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.963,\n    \"normalized_score\": 0.963,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"python\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 22235,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.807,\n    \"normalized_score\": 0.807,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"no tools\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 22236,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.73,\n    \"normalized_score\": 0.73,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond subset\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 22237,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.83,\n    \"normalized_score\": 0.83,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 22238,\n    \"benchmark_id\": \"mmmu-(validation)\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.732,\n    \"normalized_score\": 0.732,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMMU (validation)\"\n  },\n  {\n    \"model_benchmark_id\": 22239,\n    \"benchmark_id\": \"cybersecurity-ctfs\",\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"score\": 0.46875,\n    \"normalized_score\": 0.46875,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://assets.anthropic.com/m/99128ddd009bdcb/original/Claude-Haiku-4-5-System-Card.pdf\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"32-challenge subset\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"15/32 challenges solved (pass@30)\",\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Cybersecurity CTFs\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/anthropic/models/claude-haiku-4-5-20251015/model.json",
    "content": "{\n  \"model_id\": \"claude-haiku-4-5-20251015\",\n  \"name\": \"Claude Haiku 4.5\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Claude Haiku 4.5 is Anthropic's fastest, most cost-efficient model, matching Sonnet 4's performance on coding, computer use, and agent tasks. It offers similar performance to Sonnet 4 at one-third the cost and more than twice the speed, making it ideal for high-volume, latency-sensitive applications and multi-agent orchestration.\",\n  \"release_date\": \"2025-10-15\",\n  \"announcement_date\": \"2025-10-15\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2025-02-01\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.anthropic.com/en/docs/about-claude/models\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": \"https://assets.anthropic.com/m/99128ddd009bdcb/original/Claude-Haiku-4-5-System-Card.pdf\",\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-haiku-4-5\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/anthropic/models/claude-opus-4-1-20250805/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 2001,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"score\": 0.745,\n    \"normalized_score\": 0.745,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-opus-4-1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"No extended thinking. Simple scaffold with bash tool and file editing tool via string replacements. Scores reported out of full 500 problems.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 2002,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"score\": 0.433,\n    \"normalized_score\": 0.433,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-opus-4-1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"No extended thinking. Terminus 1 averaged over 5 trials.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Terminal-bench\"\n  },\n  {\n    \"model_benchmark_id\": 2003,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"score\": 0.809,\n    \"normalized_score\": 0.809,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-opus-4-1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond: Extended thinking (up to 64K tokens)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 2004,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"score\": 0.824,\n    \"normalized_score\": 0.824,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-opus-4-1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking with tool use (up to 64K tokens, prompt addendum, increased max steps from 30 to 100).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  },\n  {\n    \"model_benchmark_id\": 2005,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"score\": 0.56,\n    \"normalized_score\": 0.56,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-opus-4-1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking with tool use (up to 64K tokens, prompt addendum, increased max steps from 30 to 100).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 2006,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"score\": 0.895,\n    \"normalized_score\": 0.895,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-opus-4-1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking (up to 64K tokens). Average over 14 non-English languages.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 2007,\n    \"benchmark_id\": \"mmmu-(validation)\",\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"score\": 0.771,\n    \"normalized_score\": 0.771,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-opus-4-1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking (up to 64K tokens)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMMU (validation)\"\n  },\n  {\n    \"model_benchmark_id\": 2008,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"score\": 0.78,\n    \"normalized_score\": 0.78,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-opus-4-1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking (up to 64K tokens). AIME 2025 using nucleus sampling with a top_p of 0.95.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/anthropic/models/claude-opus-4-1-20250805/model.json",
    "content": "{\n  \"model_id\": \"claude-opus-4-1-20250805\",\n  \"name\": \"Claude Opus 4.1\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Claude Opus 4.1 is a hybrid reasoning model that pushes the frontier for coding and AI agents, featuring a 200K context window. It delivers superior performance and precision for real-world coding and agentic tasks, handling complex multi-step problems with rigor and attention to detail. With extended thinking capabilities, it offers instant responses or extended step-by-step thinking visible through user-friendly summaries. It advances state-of-the-art coding performance to 74.5% on SWE-bench Verified, excels at agentic search and research, and produces human-quality content with exceptional writing abilities. It supports 32K output tokens and adapts to specific coding styles while delivering exceptional quality for extensive generation and refactoring projects.\",\n  \"release_date\": \"2025-08-05\",\n  \"announcement_date\": \"2025-08-05\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.anthropic.com/en/docs/about-claude/models/all-models\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-opus-4-1\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/anthropic/models/claude-opus-4-20250514/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 702,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"score\": 0.755,\n    \"normalized_score\": 0.755,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking (up to 64K tokens) with parallel test-time compute (multiple attempts, internal scoring model selection). Nucleus sampling (top_p 0.95). Based on footnotes 4, 5 and blog appendix.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.468994+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.468994+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1388,\n    \"benchmark_id\": \"arc-agi-v2\",\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"score\": 0.086,\n    \"normalized_score\": 0.086,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.923803+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.923803+00:00\",\n    \"benchmark_name\": \"ARC-AGI v2\"\n  },\n  {\n    \"model_benchmark_id\": 337,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"score\": 0.796,\n    \"normalized_score\": 0.796,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond: Extended thinking (up to 64K tokens) with parallel test-time compute (multiple attempts, internal scoring model selection). Based on footnote 5 and blog appendix.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.734764+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.734764+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1480,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"score\": 0.888,\n    \"normalized_score\": 0.888,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking (up to 64K tokens). Average over 14 non-English languages. Based on blog appendix and footnote 3.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.155829+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.155829+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1815,\n    \"benchmark_id\": \"mmmu-(validation)\",\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"score\": 0.765,\n    \"normalized_score\": 0.765,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking (up to 64K tokens). Based on blog appendix.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.120938+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.120938+00:00\",\n    \"benchmark_name\": \"MMMU (validation)\"\n  },\n  {\n    \"model_benchmark_id\": 1351,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"score\": 0.725,\n    \"normalized_score\": 0.725,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Parallel test-time compute (multiple attempts, internal scoring model selection). No extended thinking. Based on footnote 5 and SWE-bench methodology for high compute.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.843719+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.843719+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1775,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"score\": 0.596,\n    \"normalized_score\": 0.596,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking with tool use (up to 64K tokens, prompt addendum, increased max steps). Based on blog appendix and TAU-bench methodology.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.005622+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.005622+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1761,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"score\": 0.814,\n    \"normalized_score\": 0.814,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking with tool use (up to 64K tokens, prompt addendum, increased max steps). Based on blog appendix and TAU-bench methodology.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.977090+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.977090+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  },\n  {\n    \"model_benchmark_id\": 655,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"score\": 0.392,\n    \"normalized_score\": 0.392,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Parallel test-time compute (multiple attempts, internal scoring model selection). No extended thinking. Claude Code as agent framework. Based on footnotes 2 and 5.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.354970+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.354970+00:00\",\n    \"benchmark_name\": \"Terminal-bench\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/anthropic/models/claude-opus-4-20250514/model.json",
    "content": "{\n  \"model_id\": \"claude-opus-4-20250514\",\n  \"name\": \"Claude Opus 4\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Claude Opus 4 is Anthropic's most powerful model and the world's best coding model, part of the Claude 4 family. It delivers sustained performance on complex, long-running tasks and agent workflows. Opus 4 excels at coding, advanced reasoning, and can use tools (like web search) during extended thinking. It supports parallel tool execution and has improved memory capabilities.\",\n  \"release_date\": \"2025-05-22\",\n  \"announcement_date\": \"2025-05-22\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.anthropic.com/en/docs/about-claude/models/all-models\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-4\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.760983+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.760983+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/anthropic/models/claude-sonnet-4-20250514/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 701,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"score\": 0.705,\n    \"normalized_score\": 0.705,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking (up to 64K tokens) with parallel test-time compute (multiple attempts, internal scoring model selection). Nucleus sampling (top_p 0.95). Based on footnotes 4, 5 and blog appendix.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.466833+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.466833+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 333,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"score\": 0.754,\n    \"normalized_score\": 0.754,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond: Extended thinking (up to 64K tokens) with parallel test-time compute (multiple attempts, internal scoring model selection). Based on footnote 5 and blog appendix.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.728759+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.728759+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1479,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"score\": 0.865,\n    \"normalized_score\": 0.865,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking (up to 64K tokens). Average over 14 non-English languages. Based on blog appendix and footnote 3.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.154357+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.154357+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 583,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"score\": 0.744,\n    \"normalized_score\": 0.744,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking (up to 64K tokens). Based on blog appendix.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.199608+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.199608+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1349,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"score\": 0.727,\n    \"normalized_score\": 0.727,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Parallel test-time compute (multiple attempts, internal scoring model selection). No extended thinking. Based on footnote 5 and SWE-bench methodology for high compute.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.840540+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.840540+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1773,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"score\": 0.6,\n    \"normalized_score\": 0.6,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking with tool use (up to 64K tokens, prompt addendum, increased max steps). Based on blog appendix and TAU-bench methodology.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.002282+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.002282+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1759,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"score\": 0.805,\n    \"normalized_score\": 0.805,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Extended thinking with tool use (up to 64K tokens, prompt addendum, increased max steps). Based on blog appendix and TAU-bench methodology.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.973668+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.973668+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  },\n  {\n    \"model_benchmark_id\": 654,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"score\": 0.355,\n    \"normalized_score\": 0.355,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Parallel test-time compute (multiple attempts, internal scoring model selection). No extended thinking. Claude Code as agent framework. Based on footnotes 2 and 5.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.353338+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.353338+00:00\",\n    \"benchmark_name\": \"Terminal-bench\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/anthropic/models/claude-sonnet-4-20250514/model.json",
    "content": "{\n  \"model_id\": \"claude-sonnet-4-20250514\",\n  \"name\": \"Claude Sonnet 4\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Claude Sonnet 4, part of the Claude 4 family, is a significant upgrade to Claude Sonnet 3.7. It excels in coding (72.7% on SWE-bench) and reasoning, responding more precisely to instructions. Sonnet 4 offers an optimal mix of capability and practicality, with enhanced steerability, and supports extended thinking with tool use.\",\n  \"release_date\": \"2025-05-22\",\n  \"announcement_date\": \"2025-05-22\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.anthropic.com/en/docs/about-claude/models/all-models\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-4\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.750182+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.750182+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/anthropic/models/claude-sonnet-4-5-20250929/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 701,\n    \"benchmark_id\": \"swe-bench-verified-(agentic-coding)\",\n    \"model_id\": \"claude-sonnet-4-5-20250929\",\n    \"score\": 0.772,\n    \"normalized_score\": 0.772,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-sonnet-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic coding\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"updated_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"benchmark_name\": \"SWE-bench Verified (Agentic Coding)\"\n  },\n  {\n    \"model_benchmark_id\": 702,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"claude-sonnet-4-5-20250929\",\n    \"score\": 0.5,\n    \"normalized_score\": 0.5,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-sonnet-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic terminal coding\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"updated_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"benchmark_name\": \"Terminal-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 703,\n    \"benchmark_id\": \"osworld\",\n    \"model_id\": \"claude-sonnet-4-5-20250929\",\n    \"score\": 0.614,\n    \"normalized_score\": 0.614,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-sonnet-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Computer use\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"updated_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"benchmark_name\": \"OSWorld\"\n  },\n  {\n    \"model_benchmark_id\": 704,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"claude-sonnet-4-5-20250929\",\n    \"score\": 0.87,\n    \"normalized_score\": 0.87,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-sonnet-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"High school math competition\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"updated_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 705,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"claude-sonnet-4-5-20250929\",\n    \"score\": 0.834,\n    \"normalized_score\": 0.834,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-sonnet-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Graduate-level reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"updated_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 706,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"claude-sonnet-4-5-20250929\",\n    \"score\": 0.891,\n    \"normalized_score\": 0.891,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-sonnet-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multilingual Q&A\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"updated_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 707,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"claude-sonnet-4-5-20250929\",\n    \"score\": 0.862,\n    \"normalized_score\": 0.862,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-sonnet-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic tool use\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"updated_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  },\n  {\n    \"model_benchmark_id\": 708,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"claude-sonnet-4-5-20250929\",\n    \"score\": 0.7,\n    \"normalized_score\": 0.7,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-sonnet-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic tool use\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"updated_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 710,\n    \"benchmark_id\": \"mmmuval\",\n    \"model_id\": \"claude-sonnet-4-5-20250929\",\n    \"score\": 0.778,\n    \"normalized_score\": 0.778,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.anthropic.com/news/claude-sonnet-4-5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Visual reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"updated_at\": \"2025-09-29T19:56:12.466833+00:00\",\n    \"benchmark_name\": \"MMMUval\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/anthropic/models/claude-sonnet-4-5-20250929/model.json",
    "content": "{\n  \"model_id\": \"claude-sonnet-4-5-20250929\",\n  \"name\": \"Claude Sonnet 4.5\",\n  \"organization_id\": \"anthropic\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Claude Sonnet 4.5 is the best coding model in the world. It's the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math. Highest intelligence across most tasks with exceptional agent and coding capabilities.\",\n  \"release_date\": \"2025-09-29\",\n  \"announcement_date\": \"2025-09-29\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2025-01-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.anthropic.com/en/docs/about-claude/models/all-models\",\n  \"source_playground\": \"https://claude.ai\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://www.anthropic.com/news/claude-sonnet-4-5\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.750182+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.750182+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/anthropic/organization.json",
    "content": "{\n  \"organization_id\": \"anthropic\",\n  \"name\": \"Anthropic\",\n  \"website\": \"https://anthropic.com\",\n  \"description\": \"AI safety company\",\n  \"country\": \"US\",\n  \"created_at\": \"2025-07-19T19:49:05.736520+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.736520+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/cohere/models/command-r-plus-04-2024/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"command-r-plus-04-2024\",\n    \"score\": 0.7099,\n    \"normalized_score\": 0.7099,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/CohereForAI/c4ai-command-r-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standardized Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.062949+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.062949+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 157,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"command-r-plus-04-2024\",\n    \"score\": 0.707,\n    \"normalized_score\": 0.707,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/CohereForAI/c4ai-command-r-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standardized Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.401017+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.401017+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 32,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"command-r-plus-04-2024\",\n    \"score\": 0.886,\n    \"normalized_score\": 0.886,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/CohereForAI/c4ai-command-r-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standardized Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.149067+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.149067+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 56,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"command-r-plus-04-2024\",\n    \"score\": 0.757,\n    \"normalized_score\": 0.757,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/CohereForAI/c4ai-command-r-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standardized Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.202939+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.202939+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 131,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"command-r-plus-04-2024\",\n    \"score\": 0.563,\n    \"normalized_score\": 0.563,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/CohereForAI/c4ai-command-r-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standardized Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.341733+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.341733+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 147,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"command-r-plus-04-2024\",\n    \"score\": 0.854,\n    \"normalized_score\": 0.854,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/CohereForAI/c4ai-command-r-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standardized Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.378573+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.378573+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/cohere/models/command-r-plus-04-2024/model.json",
    "content": "{\n  \"model_id\": \"command-r-plus-04-2024\",\n  \"name\": \"Command R+\",\n  \"organization_id\": \"cohere\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"C4AI Command R+ is a 104 billion parameter model with advanced capabilities, including Retrieval Augmented Generation (RAG) and multi-step tool use, optimized for multilingual tasks.\",\n  \"release_date\": \"2024-08-30\",\n  \"announcement_date\": \"2024-08-30\",\n  \"license_id\": \"cc_by_nc\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 104000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.cohere.com/v2/docs/command-r-plus\",\n  \"source_playground\": \"https://huggingface.co/CohereForAI/c4ai-command-r-plus\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://huggingface.co/CohereForAI/c4ai-command-r-plus\",\n  \"source_weights_link\": \"\",\n  \"created_at\": \"2025-07-19T19:49:05.415748+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.415748+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/cohere/organization.json",
    "content": "{\n  \"organization_id\": \"cohere\",\n  \"name\": \"Cohere\",\n  \"website\": \"https://cohere.ai\",\n  \"description\": \"Enterprise AI company\",\n  \"country\": \"CA\",\n  \"created_at\": \"2025-07-19T19:49:05.404836+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.404836+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1/benchmarks.json",
    "content": "[]\n"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1/model.json",
    "content": "{\n  \"model_id\": \"deepseek-r1\",\n  \"name\": \"DeepSeek-R1\",\n  \"organization_id\": \"deepseek\",\n  \"model_family_id\": null,\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"DeepSeek-R1 is a reasoning-focused language model from DeepSeek that features advanced thinking capabilities. It serves as the foundation for DeepSeek's reasoning model family and pioneered their thinking mode approach for complex problem-solving tasks.\",\n  \"release_date\": \"2025-01-20\",\n  \"announcement_date\": \"2025-01-20\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 671000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": false,\n  \"source_api_ref\": \"https://api.deepseek.com/docs\",\n  \"source_playground\": \"https://chat.deepseek.com/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://www.deepseek.com/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1\",\n  \"created_at\": \"2025-01-20T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-0528/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 9601,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.934,\n    \"normalized_score\": 0.934,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 9602,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.85,\n    \"normalized_score\": 0.85,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 9603,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.81,\n    \"normalized_score\": 0.81,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9604,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.177,\n    \"normalized_score\": 0.177,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Thinking mode, text-only subset\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Text-only subset evaluation\",\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 9605,\n    \"benchmark_id\": \"browsecomp\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.089,\n    \"normalized_score\": 0.089,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Search agent with pre-defined workflow\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Evaluated with pre-defined workflow\",\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp\"\n  },\n  {\n    \"model_benchmark_id\": 9606,\n    \"benchmark_id\": \"browsecomp-zh\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.357,\n    \"normalized_score\": 0.357,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Search agent with pre-defined workflow\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Evaluated with pre-defined workflow\",\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp-zh\"\n  },\n  {\n    \"model_benchmark_id\": 9607,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.923,\n    \"normalized_score\": 0.923,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Search agent evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 9608,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, 2408-2505, Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 9609,\n    \"benchmark_id\": \"codeforces\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.6433,\n    \"normalized_score\": 0.6433,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Div1 Rating, Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Codeforces\"\n  },\n  {\n    \"model_benchmark_id\": 9610,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.716,\n    \"normalized_score\": 0.716,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 9611,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.446,\n    \"normalized_score\": 0.446,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agent mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Evaluated with internal code agent framework\",\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 9612,\n    \"benchmark_id\": \"swe-bench-multilingual\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.305,\n    \"normalized_score\": 0.305,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agent mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Evaluated with internal code agent framework\",\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Bench Multilingual\"\n  },\n  {\n    \"model_benchmark_id\": 9613,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.057,\n    \"normalized_score\": 0.057,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Terminus 1 framework\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Terminal-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 9614,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.914,\n    \"normalized_score\": 0.914,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 9615,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.875,\n    \"normalized_score\": 0.875,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 9616,\n    \"benchmark_id\": \"hmmt-2025\",\n    \"model_id\": \"deepseek-r1-0528\",\n    \"score\": 0.794,\n    \"normalized_score\": 0.794,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT 2025\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-0528/model.json",
    "content": "{\n  \"model_id\": \"deepseek-r1-0528\",\n  \"name\": \"DeepSeek-R1-0528\",\n  \"organization_id\": \"deepseek\",\n  \"model_family_id\": null,\n  \"fine_tuned_from_model_id\": \"deepseek-r1\",\n  \"description\": \"DeepSeek-R1-0528 is the May 28, 2025 version of DeepSeek's reasoning model. It features advanced thinking capabilities and serves as a benchmark comparison for newer models like DeepSeek-V3.1. This model excels in complex reasoning tasks, mathematical problem-solving, and code generation through its thinking mode approach.\",\n  \"release_date\": \"2025-05-28\",\n  \"announcement_date\": \"2025-05-28\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 671000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": false,\n  \"source_api_ref\": \"https://api.deepseek.com/docs\",\n  \"source_playground\": \"https://chat.deepseek.com/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://www.deepseek.com/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1\",\n  \"created_at\": \"2025-05-28T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-llama-70b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 467,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"deepseek-r1-distill-llama-70b\",\n    \"score\": 0.867,\n    \"normalized_score\": 0.867,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Cons@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.987242+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.989505+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 315,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-r1-distill-llama-70b\",\n    \"score\": 0.652,\n    \"normalized_score\": 0.652,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.700874+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.700874+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1135,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-r1-distill-llama-70b\",\n    \"score\": 0.575,\n    \"normalized_score\": 0.575,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.386337+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.386337+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 503,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"deepseek-r1-distill-llama-70b\",\n    \"score\": 0.945,\n    \"normalized_score\": 0.945,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.048302+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.048302+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-llama-70b/model.json",
    "content": "{\n  \"model_id\": \"deepseek-r1-distill-llama-70b\",\n  \"name\": \"DeepSeek R1 Distill Llama 70B\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"DeepSeek-R1 is the first-generation reasoning model built atop DeepSeek-V3 (671B total parameters, 37B activated per token). It incorporates large-scale reinforcement learning (RL) to enhance its chain-of-thought and reasoning capabilities, delivering strong performance in math, code, and multi-step reasoning tasks.\",\n  \"release_date\": \"2025-01-20\",\n  \"announcement_date\": \"2025-01-20\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 70600000000,\n  \"training_tokens\": 14800000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://api-docs.deepseek.com/news/news250120\",\n  \"source_playground\": \"https://chat.deepseek.com\",\n  \"source_paper\": \"https://arxiv.org/pdf/2501.12948\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-R1\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B\",\n  \"created_at\": \"2025-07-19T19:49:05.685839+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.685839+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-llama-8b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 465,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"deepseek-r1-distill-llama-8b\",\n    \"score\": 0.8,\n    \"normalized_score\": 0.8,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Cons@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.984093+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.985582+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 314,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-r1-distill-llama-8b\",\n    \"score\": 0.49,\n    \"normalized_score\": 0.49,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.699365+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.699365+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1134,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-r1-distill-llama-8b\",\n    \"score\": 0.396,\n    \"normalized_score\": 0.396,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.384499+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.384499+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 502,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"deepseek-r1-distill-llama-8b\",\n    \"score\": 0.891,\n    \"normalized_score\": 0.891,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.046427+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.046427+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-llama-8b/model.json",
    "content": "{\n  \"model_id\": \"deepseek-r1-distill-llama-8b\",\n  \"name\": \"DeepSeek R1 Distill Llama 8B\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"DeepSeek-R1 is the first-generation reasoning model built atop DeepSeek-V3 (671B total parameters, 37B activated per token). It incorporates large-scale reinforcement learning (RL) to enhance its chain-of-thought and reasoning capabilities, delivering strong performance in math, code, and multi-step reasoning tasks.\",\n  \"release_date\": \"2025-01-20\",\n  \"announcement_date\": \"2025-01-20\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 8030000000,\n  \"training_tokens\": 14800000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://api-docs.deepseek.com/news/news250120\",\n  \"source_playground\": \"https://chat.deepseek.com\",\n  \"source_paper\": \"https://arxiv.org/pdf/2501.12948\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-R1\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B\",\n  \"created_at\": \"2025-07-19T19:49:05.683265+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.683265+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-qwen-1.5b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 461,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"deepseek-r1-distill-qwen-1.5b\",\n    \"score\": 0.527,\n    \"normalized_score\": 0.527,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Cons@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.976978+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.978475+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 311,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-r1-distill-qwen-1.5b\",\n    \"score\": 0.338,\n    \"normalized_score\": 0.338,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.694071+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.694071+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1130,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-r1-distill-qwen-1.5b\",\n    \"score\": 0.169,\n    \"normalized_score\": 0.169,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.362673+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.362673+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 499,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"deepseek-r1-distill-qwen-1.5b\",\n    \"score\": 0.839,\n    \"normalized_score\": 0.839,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.041592+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.041592+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-qwen-1.5b/model.json",
    "content": "{\n  \"model_id\": \"deepseek-r1-distill-qwen-1.5b\",\n  \"name\": \"DeepSeek R1 Distill Qwen 1.5B\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"DeepSeek-R1 is the first-generation reasoning model built atop DeepSeek-V3 (671B total parameters, 37B activated per token). It incorporates large-scale reinforcement learning (RL) to enhance its chain-of-thought and reasoning capabilities, delivering strong performance in math, code, and multi-step reasoning tasks.\",\n  \"release_date\": \"2025-01-20\",\n  \"announcement_date\": \"2025-01-20\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 1780000000,\n  \"training_tokens\": 14800000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://api-docs.deepseek.com/news/news250120\",\n  \"source_playground\": \"https://chat.deepseek.com\",\n  \"source_paper\": \"https://arxiv.org/pdf/2501.12948\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-R1\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\",\n  \"created_at\": \"2025-07-19T19:49:05.672853+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.672853+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-qwen-14b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 469,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"deepseek-r1-distill-qwen-14b\",\n    \"score\": 0.8,\n    \"normalized_score\": 0.8,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Cons@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.991646+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.993518+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 316,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-r1-distill-qwen-14b\",\n    \"score\": 0.591,\n    \"normalized_score\": 0.591,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.702334+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.702334+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1136,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-r1-distill-qwen-14b\",\n    \"score\": 0.531,\n    \"normalized_score\": 0.531,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.387993+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.387993+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 504,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"deepseek-r1-distill-qwen-14b\",\n    \"score\": 0.939,\n    \"normalized_score\": 0.939,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.050287+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.050287+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-qwen-14b/model.json",
    "content": "{\n  \"model_id\": \"deepseek-r1-distill-qwen-14b\",\n  \"name\": \"DeepSeek R1 Distill Qwen 14B\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"DeepSeek-R1 is the first-generation reasoning model built atop DeepSeek-V3 (671B total parameters, 37B activated per token). It incorporates large-scale reinforcement learning (RL) to enhance its chain-of-thought and reasoning capabilities, delivering strong performance in math, code, and multi-step reasoning tasks.\",\n  \"release_date\": \"2025-01-20\",\n  \"announcement_date\": \"2025-01-20\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 14800000000,\n  \"training_tokens\": 14800000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://api-docs.deepseek.com/news/news250120\",\n  \"source_playground\": \"https://chat.deepseek.com\",\n  \"source_paper\": \"https://arxiv.org/pdf/2501.12948\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-R1\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B\",\n  \"created_at\": \"2025-07-19T19:49:05.688267+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.688267+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-qwen-32b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 471,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"deepseek-r1-distill-qwen-32b\",\n    \"score\": 0.833,\n    \"normalized_score\": 0.833,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Cons@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.995645+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.997517+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 317,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-r1-distill-qwen-32b\",\n    \"score\": 0.621,\n    \"normalized_score\": 0.621,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.703902+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.703902+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1137,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-r1-distill-qwen-32b\",\n    \"score\": 0.572,\n    \"normalized_score\": 0.572,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.389729+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.389729+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 505,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"deepseek-r1-distill-qwen-32b\",\n    \"score\": 0.943,\n    \"normalized_score\": 0.943,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.051744+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.051744+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-qwen-32b/model.json",
    "content": "{\n  \"model_id\": \"deepseek-r1-distill-qwen-32b\",\n  \"name\": \"DeepSeek R1 Distill Qwen 32B\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"DeepSeek-R1 is the first-generation reasoning model built atop DeepSeek-V3 (671B total parameters, 37B activated per token). It incorporates large-scale reinforcement learning (RL) to enhance its chain-of-thought and reasoning capabilities, delivering strong performance in math, code, and multi-step reasoning tasks.\",\n  \"release_date\": \"2025-01-20\",\n  \"announcement_date\": \"2025-01-20\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 32800000000,\n  \"training_tokens\": 14800000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://api-docs.deepseek.com/news/news250120\",\n  \"source_playground\": \"https://chat.deepseek.com\",\n  \"source_paper\": \"https://arxiv.org/pdf/2501.12948\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-R1\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B\",\n  \"created_at\": \"2025-07-19T19:49:05.690560+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.690560+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-qwen-7b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 459,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"deepseek-r1-distill-qwen-7b\",\n    \"score\": 0.833,\n    \"normalized_score\": 0.833,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Cons@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.973870+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.975371+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 310,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-r1-distill-qwen-7b\",\n    \"score\": 0.491,\n    \"normalized_score\": 0.491,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.692702+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.692702+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1129,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-r1-distill-qwen-7b\",\n    \"score\": 0.376,\n    \"normalized_score\": 0.376,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.360567+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.360567+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 498,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"deepseek-r1-distill-qwen-7b\",\n    \"score\": 0.928,\n    \"normalized_score\": 0.928,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.039853+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.039853+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-distill-qwen-7b/model.json",
    "content": "{\n  \"model_id\": \"deepseek-r1-distill-qwen-7b\",\n  \"name\": \"DeepSeek R1 Distill Qwen 7B\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"DeepSeek-R1 is the first-generation reasoning model built atop DeepSeek-V3 (671B total parameters, 37B activated per token). It incorporates large-scale reinforcement learning (RL) to enhance its chain-of-thought and reasoning capabilities, delivering strong performance in math, code, and multi-step reasoning tasks.\",\n  \"release_date\": \"2025-01-20\",\n  \"announcement_date\": \"2025-01-20\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 7620000000,\n  \"training_tokens\": 14800000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://api-docs.deepseek.com/news/news250120\",\n  \"source_playground\": \"https://chat.deepseek.com\",\n  \"source_paper\": \"https://arxiv.org/pdf/2501.12948\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-R1\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\",\n  \"created_at\": \"2025-07-19T19:49:05.669926+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.669926+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-zero/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 457,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"deepseek-r1-zero\",\n    \"score\": 0.867,\n    \"normalized_score\": 0.867,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2501.12948\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Cons@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.970600+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.972162+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 309,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-r1-zero\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2501.12948\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1 Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.691175+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.691175+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1128,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-r1-zero\",\n    \"score\": 0.5,\n    \"normalized_score\": 0.5,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2501.12948\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.357962+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.357962+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 497,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"deepseek-r1-zero\",\n    \"score\": 0.959,\n    \"normalized_score\": 0.959,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2501.12948\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.038172+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.038172+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-r1-zero/model.json",
    "content": "{\n  \"model_id\": \"deepseek-r1-zero\",\n  \"name\": \"DeepSeek R1 Zero\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": \"deepseek-v3\",\n  \"description\": \"DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.\",\n  \"release_date\": \"2025-01-20\",\n  \"announcement_date\": \"2025-01-20\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 671000000000,\n  \"training_tokens\": 14800000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://api-docs.deepseek.com/news/news250120\",\n  \"source_playground\": \"https://chat.deepseek.com\",\n  \"source_paper\": \"https://arxiv.org/abs/2501.12948\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-R1\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-R1\",\n  \"created_at\": \"2025-07-19T19:49:05.902496+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.902496+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-v2.5/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1627,\n    \"benchmark_id\": \"aider\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.722,\n    \"normalized_score\": 0.722,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V2.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.574890+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.574890+00:00\",\n    \"benchmark_name\": \"Aider\"\n  },\n  {\n    \"model_benchmark_id\": 1619,\n    \"benchmark_id\": \"alignbench\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.804,\n    \"normalized_score\": 0.804,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.deepseek.com/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.550691+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.550691+00:00\",\n    \"benchmark_name\": \"AlignBench\"\n  },\n  {\n    \"model_benchmark_id\": 1790,\n    \"benchmark_id\": \"alpacaeval-2.0\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.505,\n    \"normalized_score\": 0.505,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V2.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.041535+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.041535+00:00\",\n    \"benchmark_name\": \"AlpacaEval 2.0\"\n  },\n  {\n    \"model_benchmark_id\": 1456,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.762,\n    \"normalized_score\": 0.762,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V2.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.104170+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.104170+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 974,\n    \"benchmark_id\": \"bbh\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.843,\n    \"normalized_score\": 0.843,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.deepseek.com/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.046694+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.046694+00:00\",\n    \"benchmark_name\": \"BBH\"\n  },\n  {\n    \"model_benchmark_id\": 1797,\n    \"benchmark_id\": \"ds-arena-code\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.631,\n    \"normalized_score\": 0.631,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V2.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.060324+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.060324+00:00\",\n    \"benchmark_name\": \"DS-Arena-Code\"\n  },\n  {\n    \"model_benchmark_id\": 1796,\n    \"benchmark_id\": \"ds-fim-eval\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.783,\n    \"normalized_score\": 0.783,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V2.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.056487+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.056487+00:00\",\n    \"benchmark_name\": \"DS-FIM-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 1000,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.951,\n    \"normalized_score\": 0.951,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.deepseek.com/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.091340+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.091340+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 792,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.89,\n    \"normalized_score\": 0.89,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.deepseek.com/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.656959+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.656959+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1789,\n    \"benchmark_id\": \"humaneval-mul\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.738,\n    \"normalized_score\": 0.738,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V2.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.037209+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.037209+00:00\",\n    \"benchmark_name\": \"HumanEval-Mul\"\n  },\n  {\n    \"model_benchmark_id\": 1795,\n    \"benchmark_id\": \"livecodebench(01-09)\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.418,\n    \"normalized_score\": 0.418,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V2.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.052983+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.052983+00:00\",\n    \"benchmark_name\": \"LiveCodeBench(01-09)\"\n  },\n  {\n    \"model_benchmark_id\": 411,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.747,\n    \"normalized_score\": 0.747,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.deepseek.com/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.874944+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.874944+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 94,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.804,\n    \"normalized_score\": 0.804,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.deepseek.com/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.277903+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.277903+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1608,\n    \"benchmark_id\": \"mt-bench\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.902,\n    \"normalized_score\": 0.902,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.deepseek.com/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.525856+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.525856+00:00\",\n    \"benchmark_name\": \"MT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 1345,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"deepseek-v2.5\",\n    \"score\": 0.168,\n    \"normalized_score\": 0.168,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V2.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.830793+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.830793+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-v2.5/model.json",
    "content": "{\n  \"model_id\": \"deepseek-v2.5\",\n  \"name\": \"DeepSeek-V2.5\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"DeepSeek-V2.5 is an upgraded version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct, integrating general and coding abilities. It better aligns with human preferences and has been optimized in various aspects, including writing and instruction following.\",\n  \"release_date\": \"2024-05-08\",\n  \"announcement_date\": \"2024-05-08\",\n  \"license_id\": \"deepseek\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 236000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.deepseek.com/\",\n  \"source_playground\": \"https://huggingface.co/deepseek-ai/DeepSeek-V2.5\",\n  \"source_paper\": \"https://arxiv.org/abs/2405.04434\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V2.5\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V2.5\",\n  \"created_at\": \"2025-07-19T19:49:05.680851+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.680851+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-v3/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 663,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.496,\n    \"normalized_score\": 0.496,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.374175+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.374175+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 1330,\n    \"benchmark_id\": \"aider-polyglot-edit\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.797,\n    \"normalized_score\": 0.797,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.796886+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.796886+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot Edit\"\n  },\n  {\n    \"model_benchmark_id\": 463,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.392,\n    \"normalized_score\": 0.392,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.980196+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.980196+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 438,\n    \"benchmark_id\": \"c-eval\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.865,\n    \"normalized_score\": 0.865,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.928060+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.928060+00:00\",\n    \"benchmark_name\": \"C-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 600,\n    \"benchmark_id\": \"cluewsc\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.909,\n    \"normalized_score\": 0.909,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.237991+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.237991+00:00\",\n    \"benchmark_name\": \"CLUEWSC\"\n  },\n  {\n    \"model_benchmark_id\": 711,\n    \"benchmark_id\": \"cnmo-2024\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.432,\n    \"normalized_score\": 0.432,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.493124+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.493124+00:00\",\n    \"benchmark_name\": \"CNMO 2024\"\n  },\n  {\n    \"model_benchmark_id\": 442,\n    \"benchmark_id\": \"csimpleqa\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.648,\n    \"normalized_score\": 0.648,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Correct\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.937598+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.937598+00:00\",\n    \"benchmark_name\": \"CSimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 951,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.916,\n    \"normalized_score\": 0.916,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot F1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.005931+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.005931+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 1753,\n    \"benchmark_id\": \"frames\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.958906+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.958906+00:00\",\n    \"benchmark_name\": \"FRAMES\"\n  },\n  {\n    \"model_benchmark_id\": 312,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.591,\n    \"normalized_score\": 0.591,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.695757+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.695757+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1788,\n    \"benchmark_id\": \"humaneval-mul\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.826,\n    \"normalized_score\": 0.826,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.035409+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.035409+00:00\",\n    \"benchmark_name\": \"HumanEval-Mul\"\n  },\n  {\n    \"model_benchmark_id\": 622,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.861,\n    \"normalized_score\": 0.861,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Prompt Strict\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.280659+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.280659+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1131,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.376,\n    \"normalized_score\": 0.376,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.364940+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.372242+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1787,\n    \"benchmark_id\": \"longbench-v2\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.487,\n    \"normalized_score\": 0.487,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.031520+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.031520+00:00\",\n    \"benchmark_name\": \"LongBench v2\"\n  },\n  {\n    \"model_benchmark_id\": 500,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.902,\n    \"normalized_score\": 0.902,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.043125+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.043125+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 93,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.885,\n    \"normalized_score\": 0.885,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.275957+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.275957+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 202,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.759,\n    \"normalized_score\": 0.759,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.485394+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.485394+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 737,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.891,\n    \"normalized_score\": 0.891,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.548864+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.548864+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 235,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.249,\n    \"normalized_score\": 0.249,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Correct\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.549943+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.549943+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 1344,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"deepseek-v3\",\n    \"score\": 0.42,\n    \"normalized_score\": 0.42,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Resolved\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.828562+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.828562+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-v3/model.json",
    "content": "{\n  \"model_id\": \"deepseek-v3\",\n  \"name\": \"DeepSeek-V3\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A powerful Mixture-of-Experts (MoE) language model with 671B total parameters (37B activated per token). Features Multi-head Latent Attention (MLA), auxiliary-loss-free load balancing, and multi-token prediction training. Pre-trained on 14.8T tokens with strong performance in reasoning, math, and code tasks.\",\n  \"release_date\": \"2024-12-25\",\n  \"announcement_date\": \"2024-12-25\",\n  \"license_id\": \"mit_+_model_license_(commercial_use_allowed)\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 671000000000,\n  \"training_tokens\": 14800000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.deepseek.com\",\n  \"source_playground\": \"https://chat.deepseek.com\",\n  \"source_paper\": \"https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3\",\n  \"created_at\": \"2025-07-19T19:49:05.677307+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.677307+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-v3-0324/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 473,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"deepseek-v3-0324\",\n    \"score\": 0.594,\n    \"normalized_score\": 0.594,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://api-docs.deepseek.com/news/news250325\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.999879+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.999879+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 318,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-v3-0324\",\n    \"score\": 0.684,\n    \"normalized_score\": 0.684,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://api-docs.deepseek.com/news/news250325\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.705537+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.705537+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1138,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-v3-0324\",\n    \"score\": 0.492,\n    \"normalized_score\": 0.492,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://api-docs.deepseek.com/news/news250325\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.392232+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.392232+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 506,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"deepseek-v3-0324\",\n    \"score\": 0.94,\n    \"normalized_score\": 0.94,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://api-docs.deepseek.com/news/news250325\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.053333+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.053333+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 204,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"deepseek-v3-0324\",\n    \"score\": 0.812,\n    \"normalized_score\": 0.812,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://api-docs.deepseek.com/news/news250325\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.488686+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.488686+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-v3-0324/model.json",
    "content": "{\n  \"model_id\": \"deepseek-v3-0324\",\n  \"name\": \"DeepSeek-V3 0324\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A powerful Mixture-of-Experts (MoE) language model with 671B total parameters (37B activated per token). Features Multi-head Latent Attention (MLA), auxiliary-loss-free load balancing, and multi-token prediction training. Pre-trained on 14.8T tokens with strong performance in reasoning, math, and code tasks.\",\n  \"release_date\": \"2025-03-25\",\n  \"announcement_date\": \"2025-03-25\",\n  \"license_id\": \"mit_+_model_license_(commercial_use_allowed)\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 671000000000,\n  \"training_tokens\": 14800000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.deepseek.com\",\n  \"source_playground\": \"https://chat.deepseek.com\",\n  \"source_paper\": \"https://arxiv.org/abs/2412.19437\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3-0324\",\n  \"created_at\": \"2025-07-19T19:49:05.693499+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.693499+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-v3.1/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 9501,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.918,\n    \"normalized_score\": 0.918,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Non-Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Non-thinking: 91.8%, Thinking: 93.7%\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 9502,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.837,\n    \"normalized_score\": 0.837,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Non-Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Non-thinking: 83.7%, Thinking: 84.8%\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 9503,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.749,\n    \"normalized_score\": 0.749,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Non-Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Non-thinking: 74.9%, Thinking: 80.1%\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9504,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.159,\n    \"normalized_score\": 0.159,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Thinking mode, text-only subset\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Thinking mode only, text-only subset\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 9505,\n    \"benchmark_id\": \"browsecomp\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.3,\n    \"normalized_score\": 0.3,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode with search agent\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Search agent with commercial API + webpage filter + 128K context\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp\"\n  },\n  {\n    \"model_benchmark_id\": 9506,\n    \"benchmark_id\": \"browsecomp-zh\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.492,\n    \"normalized_score\": 0.492,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode with search agent\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Search agent with commercial API + webpage filter + 128K context\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp-zh\"\n  },\n  {\n    \"model_benchmark_id\": 9507,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.934,\n    \"normalized_score\": 0.934,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode with search agent\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Search agent evaluation\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 9508,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.564,\n    \"normalized_score\": 0.564,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, 2408-2505, Non-Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Non-thinking: 56.4%, Thinking: 74.8%\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 9509,\n    \"benchmark_id\": \"codeforces\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.697,\n    \"normalized_score\": 0.697,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Div1 Rating, Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Codeforces Div1 rating in thinking mode\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Codeforces\"\n  },\n  {\n    \"model_benchmark_id\": 9510,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.684,\n    \"normalized_score\": 0.684,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Non-Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Non-thinking: 68.4%, Thinking: 76.3%\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 9511,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.66,\n    \"normalized_score\": 0.66,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agent mode, Non-Thinking\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Evaluated with internal code agent framework\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 9512,\n    \"benchmark_id\": \"swe-bench-multilingual\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.545,\n    \"normalized_score\": 0.545,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agent mode, Non-Thinking\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Evaluated with internal code agent framework\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Bench Multilingual\"\n  },\n  {\n    \"model_benchmark_id\": 9513,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.313,\n    \"normalized_score\": 0.313,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Terminus 1 framework, Non-Thinking\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Terminal-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 9514,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.663,\n    \"normalized_score\": 0.663,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Non-Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Non-thinking: 66.3%, Thinking: 93.1%\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 9515,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.498,\n    \"normalized_score\": 0.498,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Non-Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Non-thinking: 49.8%, Thinking: 88.4%\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 9516,\n    \"benchmark_id\": \"hmmt-2025\",\n    \"model_id\": \"deepseek-v3.1\",\n    \"score\": 0.335,\n    \"normalized_score\": 0.335,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Non-Thinking mode\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Non-thinking: 33.5%, Thinking: 84.2%\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT 2025\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-v3.1/model.json",
    "content": "{\n  \"model_id\": \"deepseek-v3.1\",\n  \"name\": \"DeepSeek-V3.1\",\n  \"organization_id\": \"deepseek\",\n  \"model_family_id\": null,\n  \"fine_tuned_from_model_id\": \"deepseek-v3\",\n  \"description\": \"DeepSeek-V3.1 is a hybrid model supporting both thinking and non-thinking modes through different chat templates. Built on DeepSeek-V3.1-Base with a two-phase long context extension (32K phase: 630B tokens, 128K phase: 209B tokens), it features 671B total parameters with 37B activated. Key improvements include smarter tool calling through post-training optimization, higher thinking efficiency achieving comparable quality to DeepSeek-R1-0528 while responding more quickly, and UE8M0 FP8 scale data format for model weights and activations. The model excels in both reasoning tasks (thinking mode) and practical applications (non-thinking mode), with particularly strong performance in code agent tasks, math competitions, and search-based problem solving.\",\n  \"release_date\": \"2025-01-10\",\n  \"announcement_date\": \"2025-01-10\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 671000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://api.deepseek.com/docs\",\n  \"source_playground\": \"https://chat.deepseek.com/\",\n  \"source_paper\": \"https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek-V3.pdf\",\n  \"source_scorecard_blog_link\": \"https://www.deepseek.com/news/deepseek-v3-1\",\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-V3\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.1\",\n  \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-v3.2-exp/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 9521,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.85,\n    \"normalized_score\": 0.85,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Reasoning Mode (w/o Tool Use)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 9522,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.799,\n    \"normalized_score\": 0.799,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Reasoning Mode (w/o Tool Use)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9523,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.198,\n    \"normalized_score\": 0.198,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Reasoning Mode (w/o Tool Use)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Text-only subset where applicable\",\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 9524,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.741,\n    \"normalized_score\": 0.741,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1 (Reasoning Mode w/o Tool Use)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 9525,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.893,\n    \"normalized_score\": 0.893,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1 (Reasoning Mode w/o Tool Use)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 9526,\n    \"benchmark_id\": \"hmmt-2025\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.836,\n    \"normalized_score\": 0.836,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1 (Reasoning Mode w/o Tool Use)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT 2025\"\n  },\n  {\n    \"model_benchmark_id\": 9527,\n    \"benchmark_id\": \"codeforces\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.707,\n    \"normalized_score\": 0.707,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Div1 rating (Reasoning Mode)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Raw rating ≈ 2121; normalized by 3000 max\",\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Codeforces\"\n  },\n  {\n    \"model_benchmark_id\": 9528,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.745,\n    \"normalized_score\": 0.745,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Reasoning Mode (w/o Tool Use)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 9529,\n    \"benchmark_id\": \"browsecomp\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.401,\n    \"normalized_score\": 0.401,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic Tool Use\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp\"\n  },\n  {\n    \"model_benchmark_id\": 9530,\n    \"benchmark_id\": \"browsecomp-zh\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.479,\n    \"normalized_score\": 0.479,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic Tool Use\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp-zh\"\n  },\n  {\n    \"model_benchmark_id\": 9531,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.971,\n    \"normalized_score\": 0.971,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic Tool Use\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 9532,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.678,\n    \"normalized_score\": 0.678,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic Tool Use\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 9533,\n    \"benchmark_id\": \"swe-bench-multilingual\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.579,\n    \"normalized_score\": 0.579,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic Tool Use\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Bench Multilingual\"\n  },\n  {\n    \"model_benchmark_id\": 9534,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"score\": 0.377,\n    \"normalized_score\": 0.377,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic Tool Use\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Terminal-Bench\"\n  }\n]\n\n"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-v3.2-exp/model.json",
    "content": "{\n  \"model_id\": \"deepseek-v3.2-exp\",\n  \"name\": \"DeepSeek-V3.2-Exp\",\n  \"organization_id\": \"deepseek\",\n  \"model_family_id\": null,\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"DeepSeek-V3.2-Exp is an experimental iteration introducing DeepSeek Sparse Attention (DSA) to improve long-context training and inference efficiency while keeping output quality on par with V3.1. It explores fine-grained sparse attention for extended sequence processing.\",\n  \"release_date\": \"2025-09-29\",\n  \"announcement_date\": \"2025-09-29\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 685000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://api.deepseek.com/docs\",\n  \"source_playground\": \"https://chat.deepseek.com/\",\n  \"source_paper\": \"https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-V3.2-Exp\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp\",\n  \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-vl2/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1256,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.814,\n    \"normalized_score\": 0.814,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.636398+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.636398+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 868,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.86,\n    \"normalized_score\": 0.86,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.812840+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.812840+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 890,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.933,\n    \"normalized_score\": 0.933,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.852402+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.852402+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1244,\n    \"benchmark_id\": \"infovqa\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.781,\n    \"normalized_score\": 0.781,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.614094+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.614094+00:00\",\n    \"benchmark_name\": \"InfoVQA\"\n  },\n  {\n    \"model_benchmark_id\": 528,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.628,\n    \"normalized_score\": 0.628,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"testmini\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.096047+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.096047+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1513,\n    \"benchmark_id\": \"mmbench\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.796,\n    \"normalized_score\": 0.796,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"en test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.245378+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.247008+00:00\",\n    \"benchmark_name\": \"MMBench\"\n  },\n  {\n    \"model_benchmark_id\": 1727,\n    \"benchmark_id\": \"mmbench-v1.1\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.792,\n    \"normalized_score\": 0.792,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"cn test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.873346+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.873346+00:00\",\n    \"benchmark_name\": \"MMBench-V1.1\"\n  },\n  {\n    \"model_benchmark_id\": 1784,\n    \"benchmark_id\": \"mme\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.2253,\n    \"normalized_score\": 0.2253,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.025040+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.025040+00:00\",\n    \"benchmark_name\": \"MME\"\n  },\n  {\n    \"model_benchmark_id\": 574,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.511,\n    \"normalized_score\": 0.511,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"val\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.181251+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.181251+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1663,\n    \"benchmark_id\": \"mmstar\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.613,\n    \"normalized_score\": 0.613,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.669907+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.669907+00:00\",\n    \"benchmark_name\": \"MMStar\"\n  },\n  {\n    \"model_benchmark_id\": 1667,\n    \"benchmark_id\": \"mmt-bench\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.636,\n    \"normalized_score\": 0.636,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.678247+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.678247+00:00\",\n    \"benchmark_name\": \"MMT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 1542,\n    \"benchmark_id\": \"ocrbench\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.811,\n    \"normalized_score\": 0.811,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.320020+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.320020+00:00\",\n    \"benchmark_name\": \"OCRBench\"\n  },\n  {\n    \"model_benchmark_id\": 1635,\n    \"benchmark_id\": \"realworldqa\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.684,\n    \"normalized_score\": 0.684,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.601290+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.601290+00:00\",\n    \"benchmark_name\": \"RealWorldQA\"\n  },\n  {\n    \"model_benchmark_id\": 912,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"deepseek-vl2\",\n    \"score\": 0.842,\n    \"normalized_score\": 0.842,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"val\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.902069+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.902069+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-vl2/model.json",
    "content": "{\n  \"model_id\": \"deepseek-vl2\",\n  \"name\": \"DeepSeek VL2\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"An advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding.\",\n  \"release_date\": \"2024-12-13\",\n  \"announcement_date\": \"2024-12-13\",\n  \"license_id\": \"deepseek\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 27000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.deepseek.com/\",\n  \"source_playground\": \"https://huggingface.co/deepseek-ai/deepseek-vl2\",\n  \"source_paper\": \"https://arxiv.org/pdf/2412.10302\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-VL2?tab=readme-ov-file\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/deepseek-vl2\",\n  \"created_at\": \"2025-07-19T19:49:05.658016+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.658016+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-vl2-small/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1258,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.8,\n    \"normalized_score\": 0.8,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.640145+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.640145+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 870,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.845,\n    \"normalized_score\": 0.845,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.816278+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.816278+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 892,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.923,\n    \"normalized_score\": 0.923,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.857733+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.857733+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1246,\n    \"benchmark_id\": \"infovqa\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.758,\n    \"normalized_score\": 0.758,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.617970+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.617970+00:00\",\n    \"benchmark_name\": \"InfoVQA\"\n  },\n  {\n    \"model_benchmark_id\": 530,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.607,\n    \"normalized_score\": 0.607,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"testmini\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.100314+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.100314+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1517,\n    \"benchmark_id\": \"mmbench\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.803,\n    \"normalized_score\": 0.803,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"en test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.252930+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.254459+00:00\",\n    \"benchmark_name\": \"MMBench\"\n  },\n  {\n    \"model_benchmark_id\": 1729,\n    \"benchmark_id\": \"mmbench-v1.1\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.793,\n    \"normalized_score\": 0.793,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"cn test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.876824+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.876824+00:00\",\n    \"benchmark_name\": \"MMBench-V1.1\"\n  },\n  {\n    \"model_benchmark_id\": 1786,\n    \"benchmark_id\": \"mme\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.2123,\n    \"normalized_score\": 0.2123,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.028315+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.028315+00:00\",\n    \"benchmark_name\": \"MME\"\n  },\n  {\n    \"model_benchmark_id\": 576,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.48,\n    \"normalized_score\": 0.48,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"val\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.184966+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.184966+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1665,\n    \"benchmark_id\": \"mmstar\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.57,\n    \"normalized_score\": 0.57,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.672978+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.672978+00:00\",\n    \"benchmark_name\": \"MMStar\"\n  },\n  {\n    \"model_benchmark_id\": 1669,\n    \"benchmark_id\": \"mmt-bench\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.629,\n    \"normalized_score\": 0.629,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.683443+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.683443+00:00\",\n    \"benchmark_name\": \"MMT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 1544,\n    \"benchmark_id\": \"ocrbench\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.834,\n    \"normalized_score\": 0.834,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.324965+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.324965+00:00\",\n    \"benchmark_name\": \"OCRBench\"\n  },\n  {\n    \"model_benchmark_id\": 1637,\n    \"benchmark_id\": \"realworldqa\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.654,\n    \"normalized_score\": 0.654,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.604508+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.604508+00:00\",\n    \"benchmark_name\": \"RealWorldQA\"\n  },\n  {\n    \"model_benchmark_id\": 914,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"deepseek-vl2-small\",\n    \"score\": 0.834,\n    \"normalized_score\": 0.834,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"val\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.906237+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.906237+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-vl2-small/model.json",
    "content": "{\n  \"model_id\": \"deepseek-vl2-small\",\n  \"name\": \"DeepSeek VL2 Small\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"An advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding.\",\n  \"release_date\": \"2024-12-13\",\n  \"announcement_date\": \"2024-12-13\",\n  \"license_id\": \"deepseek\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 16000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.deepseek.com/\",\n  \"source_playground\": \"https://huggingface.co/deepseek-ai/deepseek-vl2-small\",\n  \"source_paper\": \"https://arxiv.org/pdf/2412.10302\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-VL2\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/deepseek-vl2-small\",\n  \"created_at\": \"2025-07-19T19:49:05.666424+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.666424+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-vl2-tiny/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1257,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.716,\n    \"normalized_score\": 0.716,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.638556+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.638556+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 869,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.81,\n    \"normalized_score\": 0.81,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.814592+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.814592+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 891,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.889,\n    \"normalized_score\": 0.889,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.854588+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.854588+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1245,\n    \"benchmark_id\": \"infovqa\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.661,\n    \"normalized_score\": 0.661,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.616113+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.616113+00:00\",\n    \"benchmark_name\": \"InfoVQA\"\n  },\n  {\n    \"model_benchmark_id\": 529,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.536,\n    \"normalized_score\": 0.536,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"testmini\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.098477+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.098477+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1515,\n    \"benchmark_id\": \"mmbench\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.692,\n    \"normalized_score\": 0.692,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"en test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.249349+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.251060+00:00\",\n    \"benchmark_name\": \"MMBench\"\n  },\n  {\n    \"model_benchmark_id\": 1728,\n    \"benchmark_id\": \"mmbench-v1.1\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.683,\n    \"normalized_score\": 0.683,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"cn test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.875207+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.875207+00:00\",\n    \"benchmark_name\": \"MMBench-V1.1\"\n  },\n  {\n    \"model_benchmark_id\": 1785,\n    \"benchmark_id\": \"mme\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.1915,\n    \"normalized_score\": 0.1915,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.026734+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.026734+00:00\",\n    \"benchmark_name\": \"MME\"\n  },\n  {\n    \"model_benchmark_id\": 575,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.407,\n    \"normalized_score\": 0.407,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"val\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.183016+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.183016+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1664,\n    \"benchmark_id\": \"mmstar\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.459,\n    \"normalized_score\": 0.459,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.671412+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.671412+00:00\",\n    \"benchmark_name\": \"MMStar\"\n  },\n  {\n    \"model_benchmark_id\": 1668,\n    \"benchmark_id\": \"mmt-bench\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.532,\n    \"normalized_score\": 0.532,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.681683+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.681683+00:00\",\n    \"benchmark_name\": \"MMT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 1543,\n    \"benchmark_id\": \"ocrbench\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.809,\n    \"normalized_score\": 0.809,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.321888+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.321888+00:00\",\n    \"benchmark_name\": \"OCRBench\"\n  },\n  {\n    \"model_benchmark_id\": 1636,\n    \"benchmark_id\": \"realworldqa\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.642,\n    \"normalized_score\": 0.642,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.602948+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.602948+00:00\",\n    \"benchmark_name\": \"RealWorldQA\"\n  },\n  {\n    \"model_benchmark_id\": 913,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"deepseek-vl2-tiny\",\n    \"score\": 0.807,\n    \"normalized_score\": 0.807,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.10302\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"val\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.904238+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.904238+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/deepseek/models/deepseek-vl2-tiny/model.json",
    "content": "{\n  \"model_id\": \"deepseek-vl2-tiny\",\n  \"name\": \"DeepSeek VL2 Tiny\",\n  \"organization_id\": \"deepseek\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"An advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding.\",\n  \"release_date\": \"2024-12-13\",\n  \"announcement_date\": \"2024-12-13\",\n  \"license_id\": \"deepseek\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 3000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.deepseek.com/\",\n  \"source_playground\": \"https://huggingface.co/deepseek-ai/deepseek-vl2-tiny\",\n  \"source_paper\": \"https://arxiv.org/pdf/2412.10302\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/deepseek-ai/DeepSeek-VL2\",\n  \"source_weights_link\": \"https://huggingface.co/deepseek-ai/deepseek-vl2-tiny\",\n  \"created_at\": \"2025-07-19T19:49:05.662552+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.662552+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/deepseek/organization.json",
    "content": "{\n  \"organization_id\": \"deepseek\",\n  \"name\": \"DeepSeek\",\n  \"website\": \"https://deepseek.com\",\n  \"description\": \"Chinese AI company developing state-of-the-art large language models including the DeepSeek-V3 series with mixture-of-experts architecture and hybrid thinking/non-thinking capabilities\",\n  \"country\": \"CN\",\n  \"created_at\": \"2025-07-19T19:49:05.655332+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/google/models/gemini-1.0-pro/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1390,\n    \"benchmark_id\": \"big-bench\",\n    \"model_id\": \"gemini-1.0-pro\",\n    \"score\": 0.75,\n    \"normalized_score\": 0.75,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.928761+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.928761+00:00\",\n    \"benchmark_name\": \"BIG-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 920,\n    \"benchmark_id\": \"egoschema\",\n    \"model_id\": \"gemini-1.0-pro\",\n    \"score\": 0.557,\n    \"normalized_score\": 0.557,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.922622+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.922622+00:00\",\n    \"benchmark_name\": \"EgoSchema\"\n  },\n  {\n    \"model_benchmark_id\": 1397,\n    \"benchmark_id\": \"fleurs\",\n    \"model_id\": \"gemini-1.0-pro\",\n    \"score\": 0.064,\n    \"normalized_score\": 0.064,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.946039+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.946039+00:00\",\n    \"benchmark_name\": \"FLEURS\"\n  },\n  {\n    \"model_benchmark_id\": 264,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-1.0-pro\",\n    \"score\": 0.279,\n    \"normalized_score\": 0.279,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.607534+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.607534+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 378,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemini-1.0-pro\",\n    \"score\": 0.326,\n    \"normalized_score\": 0.326,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.817378+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.817378+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 516,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gemini-1.0-pro\",\n    \"score\": 0.466,\n    \"normalized_score\": 0.466,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.073663+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.073663+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 64,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gemini-1.0-pro\",\n    \"score\": 0.718,\n    \"normalized_score\": 0.718,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.221259+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.221259+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 553,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gemini-1.0-pro\",\n    \"score\": 0.479,\n    \"normalized_score\": 0.479,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.139083+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.139083+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1393,\n    \"benchmark_id\": \"wmt23\",\n    \"model_id\": \"gemini-1.0-pro\",\n    \"score\": 0.717,\n    \"normalized_score\": 0.717,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.937549+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.937549+00:00\",\n    \"benchmark_name\": \"WMT23\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-1.0-pro/model.json",
    "content": "{\n  \"model_id\": \"gemini-1.0-pro\",\n  \"name\": \"Gemini 1.0 Pro\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemini 1.0 Pro is a Natural Language Processing (NLP) model designed for tasks such as multi-turn text and code chat, and code generation. It supports text input and output, making it ideal for natural language tasks. The model is optimized for handling complex conversations and generating code snippets. It offers adjustable safety settings and supports function calling, but does not support JSON mode, JSON schema, or system instructions. The latest stable version is gemini-1.0-pro-001, and it was last updated in February 2024.\",\n  \"release_date\": \"2024-02-15\",\n  \"announcement_date\": \"2024-02-15\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2024-02-01\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.google.dev/gemini-api/docs/models/gemini#gemini-1.0-pro\",\n  \"source_playground\": \"https://gemini.google/advanced/\",\n  \"source_paper\": \"https://arxiv.org/pdf/2312.11805\",\n  \"source_scorecard_blog_link\": \"https://blog.google/technology/ai/google-gemini-ai/#scalable-efficient\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.461784+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.461784+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemini-1.5-flash/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1417,\n    \"benchmark_id\": \"amc-2022-23\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.348,\n    \"normalized_score\": 0.348,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.maa.org/math-competitions/amc-1012\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (4-shot)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.997413+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.997413+00:00\",\n    \"benchmark_name\": \"AMC_2022_23\"\n  },\n  {\n    \"model_benchmark_id\": 1072,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.855,\n    \"normalized_score\": 0.855,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2206.04615\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (3-shot)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.235605+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.235605+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1399,\n    \"benchmark_id\": \"fleurs\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.096,\n    \"normalized_score\": 0.096,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Word Error Rate\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.949679+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.949679+00:00\",\n    \"benchmark_name\": \"FLEURS\"\n  },\n  {\n    \"model_benchmark_id\": 1415,\n    \"benchmark_id\": \"functionalmath\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.536,\n    \"normalized_score\": 0.536,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2201.04723\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (0-shot)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.991969+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.991969+00:00\",\n    \"benchmark_name\": \"FunctionalMATH\"\n  },\n  {\n    \"model_benchmark_id\": 272,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.51,\n    \"normalized_score\": 0.51,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.622361+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.622361+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 981,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.862,\n    \"normalized_score\": 0.862,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2110.14168\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (11-shot)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.060014+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.060014+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 40,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.865,\n    \"normalized_score\": 0.865,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/1905.07830\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (10-shot)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.168455+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.168455+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 1158,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.472,\n    \"normalized_score\": 0.472,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.436585+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.436585+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 768,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.743,\n    \"normalized_score\": 0.743,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass Rate\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.617215+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.617215+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 383,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.779,\n    \"normalized_score\": 0.779,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.826586+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.826586+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 518,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.658,\n    \"normalized_score\": 0.658,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.077492+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.077492+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1276,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.826,\n    \"normalized_score\": 0.826,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2305.08916\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (8-shot)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.676395+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.676395+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 69,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.789,\n    \"normalized_score\": 0.789,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2403.05530\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.229674+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.229674+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 168,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.673,\n    \"normalized_score\": 0.673,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.426986+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.426986+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 560,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.623,\n    \"normalized_score\": 0.623,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.153019+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.153019+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1376,\n    \"benchmark_id\": \"mrcr\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.719,\n    \"normalized_score\": 0.719,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.896456+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.896456+00:00\",\n    \"benchmark_name\": \"MRCR\"\n  },\n  {\n    \"model_benchmark_id\": 1199,\n    \"benchmark_id\": \"natural2code\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.798,\n    \"normalized_score\": 0.798,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.525034+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.525034+00:00\",\n    \"benchmark_name\": \"Natural2Code\"\n  },\n  {\n    \"model_benchmark_id\": 1413,\n    \"benchmark_id\": \"physicsfinals\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.574,\n    \"normalized_score\": 0.574,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2303.16416\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (0-shot)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.986673+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.986673+00:00\",\n    \"benchmark_name\": \"PhysicsFinals\"\n  },\n  {\n    \"model_benchmark_id\": 1369,\n    \"benchmark_id\": \"vibe-eval\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.489,\n    \"normalized_score\": 0.489,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.882991+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.882991+00:00\",\n    \"benchmark_name\": \"Vibe-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 1381,\n    \"benchmark_id\": \"video-mme\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.761,\n    \"normalized_score\": 0.761,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.908485+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.908485+00:00\",\n    \"benchmark_name\": \"Video-MME\"\n  },\n  {\n    \"model_benchmark_id\": 1395,\n    \"benchmark_id\": \"wmt23\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.741,\n    \"normalized_score\": 0.741,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.940965+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.940965+00:00\",\n    \"benchmark_name\": \"WMT23\"\n  },\n  {\n    \"model_benchmark_id\": 1419,\n    \"benchmark_id\": \"xstest\",\n    \"model_id\": \"gemini-1.5-flash\",\n    \"score\": 0.97,\n    \"normalized_score\": 0.97,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.004109+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.004109+00:00\",\n    \"benchmark_name\": \"XSTest\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-1.5-flash/model.json",
    "content": "{\n  \"model_id\": \"gemini-1.5-flash\",\n  \"name\": \"Gemini 1.5 Flash\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemini 1.5 Flash is a fast and versatile multimodal model for scaling across diverse tasks. It supports audio, images, video, and text input, and produces text output. The model is optimized for generating code, extracting data, editing text, and more, making it ideal for narrow, high-frequency tasks.\",\n  \"release_date\": \"2024-05-01\",\n  \"announcement_date\": \"2024-05-01\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2023-11-01\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.google.dev/gemini-api/docs/models/gemini#gemini-1.5-flash\",\n  \"source_playground\": \"https://ai.google.dev/studio\",\n  \"source_paper\": \"https://arxiv.org/pdf/2403.05530\",\n  \"source_scorecard_blog_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.514569+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.514569+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemini-1.5-flash-8b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1400,\n    \"benchmark_id\": \"fleurs\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.864,\n    \"normalized_score\": 0.864,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Speech recognition accuracy (1 - WER)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.951665+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.951665+00:00\",\n    \"benchmark_name\": \"FLEURS\"\n  },\n  {\n    \"model_benchmark_id\": 277,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.384,\n    \"normalized_score\": 0.384,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy on expert-written science questions\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.635441+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.635441+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1163,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.328,\n    \"normalized_score\": 0.328,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy on competition-level math problems\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.447290+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.447290+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 387,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.587,\n    \"normalized_score\": 0.587,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy on mathematical problem-solving tasks\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.834192+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.834192+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 519,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.547,\n    \"normalized_score\": 0.547,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Visual mathematical reasoning accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.078820+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.078820+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 173,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.587,\n    \"normalized_score\": 0.587,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multiple choice accuracy across enhanced MMLU dataset with higher difficulty tasks\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.436045+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.436045+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 561,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.537,\n    \"normalized_score\": 0.537,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multimodal understanding accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.154594+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.154594+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1377,\n    \"benchmark_id\": \"mrcr\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.547,\n    \"normalized_score\": 0.547,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Long-context comprehension accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.898262+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.898262+00:00\",\n    \"benchmark_name\": \"MRCR\"\n  },\n  {\n    \"model_benchmark_id\": 1203,\n    \"benchmark_id\": \"natural2code\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.755,\n    \"normalized_score\": 0.755,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass rate on code generation tasks across multiple programming languages\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.531432+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.531432+00:00\",\n    \"benchmark_name\": \"Natural2Code\"\n  },\n  {\n    \"model_benchmark_id\": 1370,\n    \"benchmark_id\": \"vibe-eval\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.409,\n    \"normalized_score\": 0.409,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Visual understanding evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.885058+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.885058+00:00\",\n    \"benchmark_name\": \"Vibe-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 1382,\n    \"benchmark_id\": \"video-mme\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.662,\n    \"normalized_score\": 0.662,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Video analysis accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.910273+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.910273+00:00\",\n    \"benchmark_name\": \"Video-MME\"\n  },\n  {\n    \"model_benchmark_id\": 1396,\n    \"benchmark_id\": \"wmt23\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.726,\n    \"normalized_score\": 0.726,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Translation quality score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.942779+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.942779+00:00\",\n    \"benchmark_name\": \"WMT23\"\n  },\n  {\n    \"model_benchmark_id\": 1420,\n    \"benchmark_id\": \"xstest\",\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"score\": 0.926,\n    \"normalized_score\": 0.926,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Safe request fulfillment rate\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.005888+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.005888+00:00\",\n    \"benchmark_name\": \"XSTest\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-1.5-flash-8b/model.json",
    "content": "{\n  \"model_id\": \"gemini-1.5-flash-8b\",\n  \"name\": \"Gemini 1.5 Flash 8B\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A multimodal model capable of processing audio, images, video, and text with high efficiency. Features JSON mode, function calling, code execution, and system instructions support. Optimized for fast inference with 8B parameters.\",\n  \"release_date\": \"2024-03-15\",\n  \"announcement_date\": \"2024-03-15\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-10-01\",\n  \"param_count\": 8000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.google.dev/docs/gemini_1.5_flash\",\n  \"source_playground\": \"https://ai.google.dev/studio\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/google/generative-ai\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.530672+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.530672+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemini-1.5-pro/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1416,\n    \"benchmark_id\": \"amc-2022-23\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.464,\n    \"normalized_score\": 0.464,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2403.05530\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"4-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.995700+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.995700+00:00\",\n    \"benchmark_name\": \"AMC_2022_23\"\n  },\n  {\n    \"model_benchmark_id\": 1070,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.892,\n    \"normalized_score\": 0.892,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2403.05530\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.231702+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.231702+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 945,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.749,\n    \"normalized_score\": 0.749,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2403.05530\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Variable shots\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.994980+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.994980+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 1398,\n    \"benchmark_id\": \"fleurs\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.067,\n    \"normalized_score\": 0.067,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Word Error Rate\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.947638+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.947638+00:00\",\n    \"benchmark_name\": \"FLEURS\"\n  },\n  {\n    \"model_benchmark_id\": 1414,\n    \"benchmark_id\": \"functionalmath\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.646,\n    \"normalized_score\": 0.646,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2403.05530\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.990248+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.990248+00:00\",\n    \"benchmark_name\": \"FunctionalMATH\"\n  },\n  {\n    \"model_benchmark_id\": 268,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.591,\n    \"normalized_score\": 0.591,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.614440+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.614440+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 979,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.908,\n    \"normalized_score\": 0.908,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2403.05530\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"11-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.055992+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.055992+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 37,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.933,\n    \"normalized_score\": 0.933,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2403.05530\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.158919+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.158919+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 1157,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.52,\n    \"normalized_score\": 0.52,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.434888+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.434888+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 766,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.841,\n    \"normalized_score\": 0.841,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2403.05530\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.613548+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.613548+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 381,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.865,\n    \"normalized_score\": 0.865,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.822515+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.822515+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 517,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.681,\n    \"normalized_score\": 0.681,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.075702+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.075702+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1275,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.875,\n    \"normalized_score\": 0.875,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2403.05530\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"8-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.674684+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.674684+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 67,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.859,\n    \"normalized_score\": 0.859,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2403.05530\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.226593+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.226593+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 167,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.758,\n    \"normalized_score\": 0.758,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.425109+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.425109+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 556,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.659,\n    \"normalized_score\": 0.659,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.145100+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.145100+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1373,\n    \"benchmark_id\": \"mrcr\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.826,\n    \"normalized_score\": 0.826,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.891629+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.891629+00:00\",\n    \"benchmark_name\": \"MRCR\"\n  },\n  {\n    \"model_benchmark_id\": 1198,\n    \"benchmark_id\": \"natural2code\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.854,\n    \"normalized_score\": 0.854,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.523328+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.523328+00:00\",\n    \"benchmark_name\": \"Natural2Code\"\n  },\n  {\n    \"model_benchmark_id\": 1412,\n    \"benchmark_id\": \"physicsfinals\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.639,\n    \"normalized_score\": 0.639,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2403.05530\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.984883+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.984883+00:00\",\n    \"benchmark_name\": \"PhysicsFinals\"\n  },\n  {\n    \"model_benchmark_id\": 1366,\n    \"benchmark_id\": \"vibe-eval\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.539,\n    \"normalized_score\": 0.539,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.877591+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.877591+00:00\",\n    \"benchmark_name\": \"Vibe-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 1380,\n    \"benchmark_id\": \"video-mme\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.786,\n    \"normalized_score\": 0.786,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.906552+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.906552+00:00\",\n    \"benchmark_name\": \"Video-MME\"\n  },\n  {\n    \"model_benchmark_id\": 1394,\n    \"benchmark_id\": \"wmt23\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.751,\n    \"normalized_score\": 0.751,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.939104+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.939104+00:00\",\n    \"benchmark_name\": \"WMT23\"\n  },\n  {\n    \"model_benchmark_id\": 1418,\n    \"benchmark_id\": \"xstest\",\n    \"model_id\": \"gemini-1.5-pro\",\n    \"score\": 0.988,\n    \"normalized_score\": 0.988,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Safety Compliance\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.002222+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.002222+00:00\",\n    \"benchmark_name\": \"XSTest\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-1.5-pro/model.json",
    "content": "{\n  \"model_id\": \"gemini-1.5-pro\",\n  \"name\": \"Gemini 1.5 Pro\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemini 1.5 Pro is a mid-size multimodal model optimized for a wide range of reasoning tasks. It can process large amounts of data at once, including 2 hours of video, 19 hours of audio, codebases with 60,000 lines of code, or 2,000 pages of text.\",\n  \"release_date\": \"2024-05-01\",\n  \"announcement_date\": \"2024-05-01\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2023-11-01\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.google.dev/gemini-api/docs/models/gemini#gemini-1.5-pro\",\n  \"source_playground\": \"https://ai.google.dev/studio\",\n  \"source_paper\": \"https://arxiv.org/pdf/2403.05530\",\n  \"source_scorecard_blog_link\": \"https://deepmind.google/technologies/gemini/pro/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.481673+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.481673+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemini-2.0-flash/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1152,\n    \"benchmark_id\": \"bird-sql-(dev)\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.569,\n    \"normalized_score\": 0.569,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Natural language to SQL conversion evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.423568+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.423568+00:00\",\n    \"benchmark_name\": \"Bird-SQL (dev)\"\n  },\n  {\n    \"model_benchmark_id\": 1404,\n    \"benchmark_id\": \"covost2\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.392,\n    \"normalized_score\": 0.392,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Automatic speech translation (BLEU score) across 21 languages\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.962212+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.962212+00:00\",\n    \"benchmark_name\": \"CoVoST2\"\n  },\n  {\n    \"model_benchmark_id\": 922,\n    \"benchmark_id\": \"egoschema\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.715,\n    \"normalized_score\": 0.715,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Video analysis across multiple domains\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.926117+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.926117+00:00\",\n    \"benchmark_name\": \"EgoSchema\"\n  },\n  {\n    \"model_benchmark_id\": 1095,\n    \"benchmark_id\": \"facts-grounding\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.836,\n    \"normalized_score\": 0.836,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Ability to provide factuality correct responses given documents and diverse user requests\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.278460+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.278460+00:00\",\n    \"benchmark_name\": \"FACTS Grounding\"\n  },\n  {\n    \"model_benchmark_id\": 279,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.621,\n    \"normalized_score\": 0.621,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Challenging dataset of questions written by domain experts in biology, physics, and chemistry\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.639283+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.639283+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1164,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.63,\n    \"normalized_score\": 0.63,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Competition-level math problems, Held out dataset AIME/AMC-like\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.449979+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.449979+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 1111,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.351,\n    \"normalized_score\": 0.351,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Code generation in Python. Code Generation subset covering more recent examples: 06/01/2024 - 10/05/2024\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.317443+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.317443+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 388,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.897,\n    \"normalized_score\": 0.897,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Challenging math problems including algebra, geometry, pre-calculus, and others\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.835842+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.835842+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 174,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.764,\n    \"normalized_score\": 0.764,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Enhanced version of MMLU dataset evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.437540+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.437540+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 562,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.707,\n    \"normalized_score\": 0.707,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multi-discipline college-level multimodal understanding and reasoning problems\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.156776+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.156776+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1378,\n    \"benchmark_id\": \"mrcr\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.692,\n    \"normalized_score\": 0.692,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Novel, diagnostic long-context understanding evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.900780+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.900780+00:00\",\n    \"benchmark_name\": \"MRCR\"\n  },\n  {\n    \"model_benchmark_id\": 1204,\n    \"benchmark_id\": \"natural2code\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.929,\n    \"normalized_score\": 0.929,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Code generation evaluation across multiple languages\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.533525+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.533525+00:00\",\n    \"benchmark_name\": \"Natural2Code\"\n  },\n  {\n    \"model_benchmark_id\": 1371,\n    \"benchmark_id\": \"vibe-eval\",\n    \"model_id\": \"gemini-2.0-flash\",\n    \"score\": 0.563,\n    \"normalized_score\": 0.563,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Visual understanding in chat models with challenging everyday examples\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.886575+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.886575+00:00\",\n    \"benchmark_name\": \"Vibe-Eval\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-2.0-flash/model.json",
    "content": "{\n  \"model_id\": \"gemini-2.0-flash\",\n  \"name\": \"Gemini 2.0 Flash\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Next-generation model featuring superior speed, native tool use, multimodal generation, and a 1M token context window. Supports audio, images, video, and text input with capabilities for structured outputs, function calling, code execution, search, and multimodal operations.\",\n  \"release_date\": \"2024-12-01\",\n  \"announcement_date\": \"2024-12-01\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-08-01\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.google.dev/gemini-api/docs/models/gemini#gemini-2.0-flash\",\n  \"source_playground\": \"https://ai.google.dev/studio\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.538624+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.538624+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemini-2.0-flash-lite/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1148,\n    \"benchmark_id\": \"bird-sql-(dev)\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.574,\n    \"normalized_score\": 0.574,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"- evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.415349+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.415349+00:00\",\n    \"benchmark_name\": \"Bird-SQL (dev)\"\n  },\n  {\n    \"model_benchmark_id\": 1403,\n    \"benchmark_id\": \"covost2\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.384,\n    \"normalized_score\": 0.384,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Automatic speech translation (BLEU score) across 21 languages\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.960537+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.960537+00:00\",\n    \"benchmark_name\": \"CoVoST2\"\n  },\n  {\n    \"model_benchmark_id\": 921,\n    \"benchmark_id\": \"egoschema\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.672,\n    \"normalized_score\": 0.672,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Video analysis across multiple domains\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.924659+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.924659+00:00\",\n    \"benchmark_name\": \"EgoSchema\"\n  },\n  {\n    \"model_benchmark_id\": 1088,\n    \"benchmark_id\": \"facts-grounding\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.836,\n    \"normalized_score\": 0.836,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"- evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.264333+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.264333+00:00\",\n    \"benchmark_name\": \"FACTS Grounding\"\n  },\n  {\n    \"model_benchmark_id\": 1209,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.782,\n    \"normalized_score\": 0.782,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.543616+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.543616+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 266,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.515,\n    \"normalized_score\": 0.515,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.611234+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.611234+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1156,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.553,\n    \"normalized_score\": 0.553,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.433332+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.433332+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 1320,\n    \"benchmark_id\": \"livecodebench-v5\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.289,\n    \"normalized_score\": 0.289,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.771288+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.771288+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v5\"\n  },\n  {\n    \"model_benchmark_id\": 379,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.868,\n    \"normalized_score\": 0.868,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.819524+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.819524+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 166,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.716,\n    \"normalized_score\": 0.716,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/gemini-2-family-expands/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chain-of-Thought accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.423223+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.423223+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 554,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.68,\n    \"normalized_score\": 0.68,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multi-discipline college-level multimodal understanding and reasoning problems\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.141505+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.141505+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1402,\n    \"benchmark_id\": \"mrcr-1m\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.58,\n    \"normalized_score\": 0.58,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Long-context comprehension accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.956748+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.956748+00:00\",\n    \"benchmark_name\": \"MRCR 1M\"\n  },\n  {\n    \"model_benchmark_id\": 226,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"score\": 0.217,\n    \"normalized_score\": 0.217,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Factuality\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.535234+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.535234+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-2.0-flash-lite/model.json",
    "content": "{\n  \"model_id\": \"gemini-2.0-flash-lite\",\n  \"name\": \"Gemini 2.0 Flash-Lite\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A Gemini 2.0 Flash model optimized for cost efficiency and low latency\",\n  \"release_date\": \"2025-02-05\",\n  \"announcement_date\": \"2025-02-05\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-06-01\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.google.dev/gemini-api/docs/models#gemini-2.0-flash-lite\",\n  \"source_playground\": \"https://aistudio.google.com/prompts/new_chat?model=gemini-2.0-flash-lite\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://developers.googleblog.com/en/gemini-2-family-expands\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.469548+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.469548+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemini-2.0-flash-thinking/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 448,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"gemini-2.0-flash-thinking\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models/gemini#evaluation\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Enhanced reasoning on competition-level math prompts\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.952263+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.952263+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 271,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-2.0-flash-thinking\",\n    \"score\": 0.742,\n    \"normalized_score\": 0.742,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models/gemini#evaluation\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Challenging science questions requiring chain-of-thought reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.620752+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.620752+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 559,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gemini-2.0-flash-thinking\",\n    \"score\": 0.754,\n    \"normalized_score\": 0.754,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemini-api/docs/models/gemini#evaluation\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Image-text QA across various domains\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.151038+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.151038+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-2.0-flash-thinking/model.json",
    "content": "{\n  \"model_id\": \"gemini-2.0-flash-thinking\",\n  \"name\": \"Gemini 2.0 Flash Thinking\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemini 2.0 Flash Thinking is a enhanced reasoning model, capable of showing its thoughts to improve performance and explainability. Combining speed and performance, Gemini 2.0 Flash Thinking also excels in science and math, showing its thinking to solve complex problems.\",\n  \"release_date\": \"2025-01-21\",\n  \"announcement_date\": \"2025-01-21\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-08-01\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.google.dev/gemini-api/docs/models/gemini#gemini-2.0-flash-thinking-experimental\",\n  \"source_playground\": \"https://ai.google.dev/studio\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.504495+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.504495+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemini-2.5-flash/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 661,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.619,\n    \"normalized_score\": 0.619,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"whole\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.370513+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.370513+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 1329,\n    \"benchmark_id\": \"aider-polyglot-edit\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.567,\n    \"normalized_score\": 0.567,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-updates-io-2025\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diff-Fenced\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.795058+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.795058+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot Edit\"\n  },\n  {\n    \"model_benchmark_id\": 447,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.88,\n    \"normalized_score\": 0.88,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.950448+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.950448+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 683,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.72,\n    \"normalized_score\": 0.72,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.428509+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.428509+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1091,\n    \"benchmark_id\": \"facts-grounding\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.853,\n    \"normalized_score\": 0.853,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.271323+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.271323+00:00\",\n    \"benchmark_name\": \"FACTS Grounding\"\n  },\n  {\n    \"model_benchmark_id\": 1212,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.884,\n    \"normalized_score\": 0.884,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.550549+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.550549+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 270,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.828,\n    \"normalized_score\": 0.828,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.619078+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.619078+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 720,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.11,\n    \"normalized_score\": 0.11,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.518055+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.518055+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 1321,\n    \"benchmark_id\": \"livecodebench-v5\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.639,\n    \"normalized_score\": 0.639,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.773194+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.773194+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v5\"\n  },\n  {\n    \"model_benchmark_id\": 558,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.797,\n    \"normalized_score\": 0.797,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.148985+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.148985+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1374,\n    \"benchmark_id\": \"mrcr\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.32,\n    \"normalized_score\": 0.32,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/technology/google-deepmind/google-gemini-updates-io-2025\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"1M-pointwise\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.893404+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.895016+00:00\",\n    \"benchmark_name\": \"MRCR\"\n  },\n  {\n    \"model_benchmark_id\": 229,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.269,\n    \"normalized_score\": 0.269,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.540281+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.540281+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 1341,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.604,\n    \"normalized_score\": 0.604,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.822771+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.822771+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1368,\n    \"benchmark_id\": \"vibe-eval\",\n    \"model_id\": \"gemini-2.5-flash\",\n    \"score\": 0.654,\n    \"normalized_score\": 0.654,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.880772+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.880772+00:00\",\n    \"benchmark_name\": \"Vibe-Eval\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-2.5-flash/model.json",
    "content": "{\n  \"model_id\": \"gemini-2.5-flash\",\n  \"name\": \"Gemini 2.5 Flash\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A thinking model designed for a balance between price and performance. It builds upon Gemini 2.0 Flash with upgraded reasoning, hybrid thinking control, multimodal capabilities (text, image, video, audio input), and a 1M token input context window.\",\n  \"release_date\": \"2025-05-20\",\n  \"announcement_date\": \"2025-05-20\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2025-01-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.google.dev/gemini-api/docs/models?hl=en#gemini-2.5-flash-preview-04-17\",\n  \"source_playground\": \"https://aistudio.google.com/?model=gemini-2.5-flash-preview-04-17\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://developers.googleblog.com/en/start-building-with-gemini-25-flash/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.500918+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.500918+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemini-2.5-flash-lite/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 659,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.267,\n    \"normalized_score\": 0.267,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Code editing\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.366506+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.366506+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 681,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.498,\n    \"normalized_score\": 0.498,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Mathematics\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.422347+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.422347+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1406,\n    \"benchmark_id\": \"arc\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.025,\n    \"normalized_score\": 0.025,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash-lite\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Default\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.969921+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.969921+00:00\",\n    \"benchmark_name\": \"Arc\"\n  },\n  {\n    \"model_benchmark_id\": 1089,\n    \"benchmark_id\": \"facts-grounding\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.841,\n    \"normalized_score\": 0.841,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Factuality\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.267251+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.267251+00:00\",\n    \"benchmark_name\": \"FACTS Grounding\"\n  },\n  {\n    \"model_benchmark_id\": 1210,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.811,\n    \"normalized_score\": 0.811,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multilingual performance\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.546251+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.546251+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 267,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.646,\n    \"normalized_score\": 0.646,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.612808+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.612808+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 718,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.051,\n    \"normalized_score\": 0.051,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"No tools\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.514286+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.514286+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 1104,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.337,\n    \"normalized_score\": 0.337,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Code generation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.300809+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.300809+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 555,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.729,\n    \"normalized_score\": 0.729,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Visual reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.143254+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.143254+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1405,\n    \"benchmark_id\": \"mrcr-v2\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.166,\n    \"normalized_score\": 0.166,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Long context 128k average. 8 needle.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.966057+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.966057+00:00\",\n    \"benchmark_name\": \"MRCR v2\"\n  },\n  {\n    \"model_benchmark_id\": 227,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.107,\n    \"normalized_score\": 0.107,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Factuality\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.536893+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.536893+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 1339,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.316,\n    \"normalized_score\": 0.316,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic coding single attempt\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.819222+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.819222+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1365,\n    \"benchmark_id\": \"vibe-eval\",\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"score\": 0.513,\n    \"normalized_score\": 0.513,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/flash-lite/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Reka\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.875989+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.875989+00:00\",\n    \"benchmark_name\": \"Vibe-Eval\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-2.5-flash-lite/model.json",
    "content": "{\n  \"model_id\": \"gemini-2.5-flash-lite\",\n  \"name\": \"Gemini 2.5 Flash-Lite\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemini 2.5 Flash-Lite is a model developed by Google DeepMind, designed to handle various tasks including reasoning, science, mathematics, code generation, and more. It features advanced capabilities in multilingual performance and long context understanding. It is optimized for low latency use cases, supporting multimodal input with a 1 million-token context length.\",\n  \"release_date\": \"2025-06-17\",\n  \"announcement_date\": \"2025-06-17\",\n  \"license_id\": \"creative_commons_attribution_4_0_license\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2025-01-01\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash-lite\",\n  \"source_playground\": \"https://ai.google.com/studio\",\n  \"source_paper\": \"https://arxiv.org/abs/2503.16534\",\n  \"source_scorecard_blog_link\": \"https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash-lite\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.473471+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.473471+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemini-2.5-pro/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 658,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.765,\n    \"normalized_score\": 0.765,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.364634+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.364634+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 1328,\n    \"benchmark_id\": \"aider-polyglot-edit\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.727,\n    \"normalized_score\": 0.727,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diff\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.793176+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.793176+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot Edit\"\n  },\n  {\n    \"model_benchmark_id\": 446,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.92,\n    \"normalized_score\": 0.92,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.948567+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.948567+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 679,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.83,\n    \"normalized_score\": 0.83,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.417055+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.417055+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1385,\n    \"benchmark_id\": \"arc-agi-v2\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.049,\n    \"normalized_score\": 0.049,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.918991+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.918991+00:00\",\n    \"benchmark_name\": \"ARC-AGI v2\"\n  },\n  {\n    \"model_benchmark_id\": 1207,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.886,\n    \"normalized_score\": 0.886,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.540318+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.540318+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 263,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.83,\n    \"normalized_score\": 0.83,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.605360+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.605360+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 717,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.178,\n    \"normalized_score\": 0.178,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.511856+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.511856+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 1318,\n    \"benchmark_id\": \"livecodebench-v5\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.756,\n    \"normalized_score\": 0.756,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.763325+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.763325+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v5\"\n  },\n  {\n    \"model_benchmark_id\": 552,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.796,\n    \"normalized_score\": 0.796,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.137517+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.137517+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1372,\n    \"benchmark_id\": \"mrcr\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.93,\n    \"normalized_score\": 0.93,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"128k-average\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.889867+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.889867+00:00\",\n    \"benchmark_name\": \"MRCR\"\n  },\n  {\n    \"model_benchmark_id\": 1384,\n    \"benchmark_id\": \"mrcr-1m-(pointwise)\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.829,\n    \"normalized_score\": 0.829,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pointwise\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.915166+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.915166+00:00\",\n    \"benchmark_name\": \"MRCR 1M (pointwise)\"\n  },\n  {\n    \"model_benchmark_id\": 225,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.508,\n    \"normalized_score\": 0.508,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.532774+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.532774+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 1338,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.632,\n    \"normalized_score\": 0.632,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.816932+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.816932+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1364,\n    \"benchmark_id\": \"vibe-eval\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.656,\n    \"normalized_score\": 0.656,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.874453+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.874453+00:00\",\n    \"benchmark_name\": \"Vibe-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 1379,\n    \"benchmark_id\": \"video-mme\",\n    \"model_id\": \"gemini-2.5-pro\",\n    \"score\": 0.848,\n    \"normalized_score\": 0.848,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini/pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.904547+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.904547+00:00\",\n    \"benchmark_name\": \"Video-MME\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-2.5-pro/model.json",
    "content": "{\n  \"model_id\": \"gemini-2.5-pro\",\n  \"name\": \"Gemini 2.5 Pro\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Our most intelligent AI model, built for the agentic era. Gemini 2.5 Pro leads on common benchmarks with enhanced reasoning, multimodal capabilities (text, image, video, audio input), and a 1M token context window.\",\n  \"release_date\": \"2025-05-20\",\n  \"announcement_date\": \"2025-05-20\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2025-01-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.google.dev/gemini-api/docs/models?hl=en#gemini-2.5-pro-preview-03-25\",\n  \"source_playground\": \"https://aistudio.google.com/?model=gemini-2.5-pro-preview-03-25\",\n  \"source_paper\": \"https://storage.googleapis.com/model-cards/documents/gemini-2.5-pro-preview.pdf\",\n  \"source_scorecard_blog_link\": \"https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.458697+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.458697+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemini-2.5-pro-preview-06-05/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 660,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.822,\n    \"normalized_score\": 0.822,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diff-fenced\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.368655+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.368655+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 682,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.88,\n    \"normalized_score\": 0.88,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Single attempt\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.425843+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.425843+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1090,\n    \"benchmark_id\": \"facts-grounding\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.878,\n    \"normalized_score\": 0.878,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Factuality\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.269434+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.269434+00:00\",\n    \"benchmark_name\": \"FACTS Grounding\"\n  },\n  {\n    \"model_benchmark_id\": 1211,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.892,\n    \"normalized_score\": 0.892,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multilingual performance\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.548453+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.548453+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 269,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.864,\n    \"normalized_score\": 0.864,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Single attempt Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.617404+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.617404+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 719,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.216,\n    \"normalized_score\": 0.216,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"No tools\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.516239+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.516239+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 1105,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.69,\n    \"normalized_score\": 0.69,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Single attempt (1/1/2025-5/1/2025)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.303010+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.303010+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 557,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.82,\n    \"normalized_score\": 0.82,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Single attempt\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.146880+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.146880+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1422,\n    \"benchmark_id\": \"mrcr-v2-(8-needle)\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.164,\n    \"normalized_score\": 0.164,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"1M pointwise\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.013534+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.016258+00:00\",\n    \"benchmark_name\": \"MRCR v2 (8-needle)\"\n  },\n  {\n    \"model_benchmark_id\": 228,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.54,\n    \"normalized_score\": 0.54,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Factuality\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.538432+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.538432+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 1340,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.672,\n    \"normalized_score\": 0.672,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multiple attempts\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.820885+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.820885+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1367,\n    \"benchmark_id\": \"vibe-eval\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.672,\n    \"normalized_score\": 0.672,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Image understanding\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.879257+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.879257+00:00\",\n    \"benchmark_name\": \"Vibe-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 1421,\n    \"benchmark_id\": \"videommmu\",\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"score\": 0.836,\n    \"normalized_score\": 0.836,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Video understanding\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.009959+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.009959+00:00\",\n    \"benchmark_name\": \"VideoMMMU\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-2.5-pro-preview-06-05/model.json",
    "content": "{\n  \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n  \"name\": \"Gemini 2.5 Pro Preview 06-05\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"The latest preview version of Google's most advanced reasoning Gemini model, capable of solving complex problems. Built for the agentic era with enhanced reasoning capabilities, multimodal understanding (text, image, video, audio), and a 1M token context window. Features thinking preview, code execution, grounding with Google Search, system instructions, function calling, and controlled generation. Supports up to 3,000 images per prompt, 45-60 minutes of video, and 8.4 hours of audio.\",\n  \"release_date\": \"2025-06-05\",\n  \"announcement_date\": \"2025-06-05\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2025-01-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro\",\n  \"source_playground\": \"https://aistudio.google.com\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.493595+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.493595+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemini-diffusion/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 685,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gemini-diffusion\",\n    \"score\": 0.233,\n    \"normalized_score\": 0.233,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini-diffusion/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass @1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.434861+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.434861+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1100,\n    \"benchmark_id\": \"big-bench-extra-hard\",\n    \"model_id\": \"gemini-diffusion\",\n    \"score\": 0.15,\n    \"normalized_score\": 0.15,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini-diffusion/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass @1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.291288+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.291288+00:00\",\n    \"benchmark_name\": \"BIG-Bench Extra Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1433,\n    \"benchmark_id\": \"bigcodebench\",\n    \"model_id\": \"gemini-diffusion\",\n    \"score\": 0.454,\n    \"normalized_score\": 0.454,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini-diffusion/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass @1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.050987+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.050987+00:00\",\n    \"benchmark_name\": \"BigCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1217,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemini-diffusion\",\n    \"score\": 0.691,\n    \"normalized_score\": 0.691,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini-diffusion/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass @1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.559014+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.559014+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 278,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemini-diffusion\",\n    \"score\": 0.404,\n    \"normalized_score\": 0.404,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini-diffusion/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass @1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.637311+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.637311+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 773,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemini-diffusion\",\n    \"score\": 0.896,\n    \"normalized_score\": 0.896,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini-diffusion/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass @1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.625233+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.625233+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1435,\n    \"benchmark_id\": \"lbpp-(v2)\",\n    \"model_id\": \"gemini-diffusion\",\n    \"score\": 0.568,\n    \"normalized_score\": 0.568,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini-diffusion/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass @1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.056060+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.056060+00:00\",\n    \"benchmark_name\": \"LBPP (v2)\"\n  },\n  {\n    \"model_benchmark_id\": 1110,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemini-diffusion\",\n    \"score\": 0.309,\n    \"normalized_score\": 0.309,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini-diffusion/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass @1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.314684+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.314684+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1175,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"gemini-diffusion\",\n    \"score\": 0.76,\n    \"normalized_score\": 0.76,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini-diffusion/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass @1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.475906+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.475906+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1342,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gemini-diffusion\",\n    \"score\": 0.229,\n    \"normalized_score\": 0.229,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://deepmind.google/models/gemini-diffusion/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass @1, Non-agentic evaluation (single turn edit only), max prompt length of 32K\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.824708+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.824708+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemini-diffusion/model.json",
    "content": "{\n  \"model_id\": \"gemini-diffusion\",\n  \"name\": \"Gemini Diffusion\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemini Diffusion is a state-of-the-art, experimental text diffusion model from Google DeepMind. It explores a new kind of language model designed to provide users with greater control, creativity, and speed in text generation. Instead of predicting text token-by-token, it learns to generate outputs by refining noise step-by-step, allowing for rapid iteration and error correction during generation. Key capabilities include rapid response times (reportedly 1479 tokens/sec excluding overhead), generation of more coherent text by outputting entire blocks of tokens at once, and iterative refinement for consistent outputs. It excels at tasks like editing, including in math and code contexts.\",\n  \"release_date\": \"2025-05-20\",\n  \"announcement_date\": \"2025-05-20\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://deepmind.google/models/gemini-diffusion/\",\n  \"source_repo_link\": \"https://github.com/google\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.534835+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.534835+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-2-27b-it/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1408,\n    \"benchmark_id\": \"agieval\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.551,\n    \"normalized_score\": 0.551,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.975397+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.975397+00:00\",\n    \"benchmark_name\": \"AGIEval\"\n  },\n  {\n    \"model_benchmark_id\": 9,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.714,\n    \"normalized_score\": 0.714,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.099650+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.099650+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1055,\n    \"benchmark_id\": \"arc-e\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.886,\n    \"normalized_score\": 0.886,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.203403+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.203403+00:00\",\n    \"benchmark_name\": \"ARC-E\"\n  },\n  {\n    \"model_benchmark_id\": 1392,\n    \"benchmark_id\": \"big-bench\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.749,\n    \"normalized_score\": 0.749,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot, CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.932992+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.932992+00:00\",\n    \"benchmark_name\": \"BIG-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 1021,\n    \"benchmark_id\": \"boolq\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.848,\n    \"normalized_score\": 0.848,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.126514+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.126514+00:00\",\n    \"benchmark_name\": \"BoolQ\"\n  },\n  {\n    \"model_benchmark_id\": 980,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.74,\n    \"normalized_score\": 0.74,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot, maj@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.058102+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.058102+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 38,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.864,\n    \"normalized_score\": 0.864,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.164247+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.164247+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 767,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.518,\n    \"normalized_score\": 0.518,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.615384+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.615384+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 382,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.423,\n    \"normalized_score\": 0.423,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"4-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.824501+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.824501+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1170,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.626,\n    \"normalized_score\": 0.626,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.464425+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.464425+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 68,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.752,\n    \"normalized_score\": 0.752,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot, top-1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.228104+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.228104+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1048,\n    \"benchmark_id\": \"natural-questions\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.345,\n    \"normalized_score\": 0.345,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.188220+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.188220+00:00\",\n    \"benchmark_name\": \"Natural Questions\"\n  },\n  {\n    \"model_benchmark_id\": 1030,\n    \"benchmark_id\": \"piqa\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.832,\n    \"normalized_score\": 0.832,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.145819+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.145819+00:00\",\n    \"benchmark_name\": \"PIQA\"\n  },\n  {\n    \"model_benchmark_id\": 1039,\n    \"benchmark_id\": \"social-iqa\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.537,\n    \"normalized_score\": 0.537,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.168648+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.168648+00:00\",\n    \"benchmark_name\": \"Social IQa\"\n  },\n  {\n    \"model_benchmark_id\": 248,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.837,\n    \"normalized_score\": 0.837,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.574247+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.574247+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  },\n  {\n    \"model_benchmark_id\": 1060,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"gemma-2-27b-it\",\n    \"score\": 0.837,\n    \"normalized_score\": 0.837,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.212219+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.212219+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-2-27b-it/model.json",
    "content": "{\n  \"model_id\": \"gemma-2-27b-it\",\n  \"name\": \"Gemma 2 27B\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemma 2 27B IT is an instruction-tuned version of Google's state-of-the-art open language model. Built from the same research and technology as Gemini, it's optimized for dialogue applications through supervised fine-tuning, distillation from larger models, and RLHF. The model excels at text generation tasks including question answering, summarization, and reasoning.\",\n  \"release_date\": \"2024-06-27\",\n  \"announcement_date\": \"2024-06-27\",\n  \"license_id\": \"gemma\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 27200000000,\n  \"training_tokens\": 13000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/google/gemma-2-27b-it\",\n  \"source_playground\": \"https://huggingface.co/chat/models/google/gemma-2-27b-it\",\n  \"source_paper\": \"https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf\",\n  \"source_scorecard_blog_link\": \"https://huggingface.co/blog/gemma2\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-2-27b-it\",\n  \"created_at\": \"2025-07-19T19:49:05.485572+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.485572+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-2-9b-it/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1407,\n    \"benchmark_id\": \"agieval\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.528,\n    \"normalized_score\": 0.528,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-5-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.973652+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.973652+00:00\",\n    \"benchmark_name\": \"AGIEval\"\n  },\n  {\n    \"model_benchmark_id\": 8,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.684,\n    \"normalized_score\": 0.684,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.097779+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.097779+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1054,\n    \"benchmark_id\": \"arc-e\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.88,\n    \"normalized_score\": 0.88,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.201834+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.201834+00:00\",\n    \"benchmark_name\": \"ARC-E\"\n  },\n  {\n    \"model_benchmark_id\": 1391,\n    \"benchmark_id\": \"big-bench\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.682,\n    \"normalized_score\": 0.682,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.930966+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.930966+00:00\",\n    \"benchmark_name\": \"BIG-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 1020,\n    \"benchmark_id\": \"boolq\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.842,\n    \"normalized_score\": 0.842,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.124981+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.124981+00:00\",\n    \"benchmark_name\": \"BoolQ\"\n  },\n  {\n    \"model_benchmark_id\": 978,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.686,\n    \"normalized_score\": 0.686,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot majority@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.053844+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.053844+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 36,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.819,\n    \"normalized_score\": 0.819,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.157090+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.157090+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 765,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.402,\n    \"normalized_score\": 0.402,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.611318+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.611318+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 380,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.366,\n    \"normalized_score\": 0.366,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"4-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.821125+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.821125+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1169,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.524,\n    \"normalized_score\": 0.524,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.462564+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.462564+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 66,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.713,\n    \"normalized_score\": 0.713,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.224994+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.224994+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1047,\n    \"benchmark_id\": \"natural-questions\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.292,\n    \"normalized_score\": 0.292,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.186631+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.186631+00:00\",\n    \"benchmark_name\": \"Natural Questions\"\n  },\n  {\n    \"model_benchmark_id\": 1029,\n    \"benchmark_id\": \"piqa\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.817,\n    \"normalized_score\": 0.817,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.144012+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.144012+00:00\",\n    \"benchmark_name\": \"PIQA\"\n  },\n  {\n    \"model_benchmark_id\": 1038,\n    \"benchmark_id\": \"social-iqa\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.534,\n    \"normalized_score\": 0.534,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.166311+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.166311+00:00\",\n    \"benchmark_name\": \"Social IQa\"\n  },\n  {\n    \"model_benchmark_id\": 247,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.766,\n    \"normalized_score\": 0.766,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.572657+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.572657+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  },\n  {\n    \"model_benchmark_id\": 148,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"gemma-2-9b-it\",\n    \"score\": 0.806,\n    \"normalized_score\": 0.806,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/blog/gemma2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"partial score evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.380497+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.380497+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-2-9b-it/model.json",
    "content": "{\n  \"model_id\": \"gemma-2-9b-it\",\n  \"name\": \"Gemma 2 9B\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemma 2 9B IT is an instruction-tuned version of Google's Gemma 2 9B base model. It was trained on 8 trillion tokens of web data, code, and math content. The model features sliding window attention, logit soft-capping, and knowledge distillation techniques. It's optimized for dialogue applications through supervised fine-tuning, distillation, RLHF, and model merging using WARP.\",\n  \"release_date\": \"2024-06-27\",\n  \"announcement_date\": \"2024-06-27\",\n  \"license_id\": \"gemma\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 9240000000,\n  \"training_tokens\": 8000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/google/gemma-2-9b-it\",\n  \"source_playground\": \"https://huggingface.co/chat/models/google/gemma-2-9b-it\",\n  \"source_paper\": \"https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf\",\n  \"source_scorecard_blog_link\": \"https://huggingface.co/blog/gemma2\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-2-9b-it\",\n  \"created_at\": \"2025-07-19T19:49:05.477806+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.477806+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-3-12b-it/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1247,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.842,\n    \"normalized_score\": 0.842,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.621225+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.621225+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 1096,\n    \"benchmark_id\": \"big-bench-extra-hard\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.163,\n    \"normalized_score\": 0.163,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.282747+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.282747+00:00\",\n    \"benchmark_name\": \"BIG-Bench Extra Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1067,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.857,\n    \"normalized_score\": 0.857,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.226924+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.226924+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1147,\n    \"benchmark_id\": \"bird-sql-(dev)\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.479,\n    \"normalized_score\": 0.479,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"- evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.413629+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.413629+00:00\",\n    \"benchmark_name\": \"Bird-SQL (dev)\"\n  },\n  {\n    \"model_benchmark_id\": 855,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.757,\n    \"normalized_score\": 0.757,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.789962+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.789962+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 878,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.871,\n    \"normalized_score\": 0.871,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.830839+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.830839+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1219,\n    \"benchmark_id\": \"eclektic\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.103,\n    \"normalized_score\": 0.103,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.563615+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.563615+00:00\",\n    \"benchmark_name\": \"ECLeKTic\"\n  },\n  {\n    \"model_benchmark_id\": 1087,\n    \"benchmark_id\": \"facts-grounding\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.758,\n    \"normalized_score\": 0.758,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"- evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.262640+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.262640+00:00\",\n    \"benchmark_name\": \"FACTS Grounding\"\n  },\n  {\n    \"model_benchmark_id\": 1205,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.695,\n    \"normalized_score\": 0.695,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.537058+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.537058+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 261,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.409,\n    \"normalized_score\": 0.409,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.600334+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.600334+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 977,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.944,\n    \"normalized_score\": 0.944,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.052379+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.052379+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 1153,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.545,\n    \"normalized_score\": 0.545,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.427708+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.427708+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 762,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.854,\n    \"normalized_score\": 0.854,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.606840+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.606840+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 607,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.889,\n    \"normalized_score\": 0.889,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.254325+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.254325+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1238,\n    \"benchmark_id\": \"infovqa\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.649,\n    \"normalized_score\": 0.649,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.604072+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.604072+00:00\",\n    \"benchmark_name\": \"InfoVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1101,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.246,\n    \"normalized_score\": 0.246,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.294686+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.294686+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 377,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.838,\n    \"normalized_score\": 0.838,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.815597+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.815597+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1266,\n    \"benchmark_id\": \"mathvista-mini\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.629,\n    \"normalized_score\": 0.629,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.657019+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.657019+00:00\",\n    \"benchmark_name\": \"MathVista-Mini\"\n  },\n  {\n    \"model_benchmark_id\": 1166,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.73,\n    \"normalized_score\": 0.73,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.456223+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.456223+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 163,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.606,\n    \"normalized_score\": 0.606,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.415028+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.415028+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1235,\n    \"benchmark_id\": \"mmmu-(val)\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.596,\n    \"normalized_score\": 0.596,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.595790+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.595790+00:00\",\n    \"benchmark_name\": \"MMMU (val)\"\n  },\n  {\n    \"model_benchmark_id\": 1197,\n    \"benchmark_id\": \"natural2code\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.807,\n    \"normalized_score\": 0.807,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.521277+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.521277+00:00\",\n    \"benchmark_name\": \"Natural2Code\"\n  },\n  {\n    \"model_benchmark_id\": 224,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.063,\n    \"normalized_score\": 0.063,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.528858+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.528858+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 903,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.677,\n    \"normalized_score\": 0.677,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.882990+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.882990+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1263,\n    \"benchmark_id\": \"vqav2-(val)\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.716,\n    \"normalized_score\": 0.716,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.650557+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.650557+00:00\",\n    \"benchmark_name\": \"VQAv2 (val)\"\n  },\n  {\n    \"model_benchmark_id\": 1227,\n    \"benchmark_id\": \"wmt24++\",\n    \"model_id\": \"gemma-3-12b-it\",\n    \"score\": 0.516,\n    \"normalized_score\": 0.516,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.578915+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.578915+00:00\",\n    \"benchmark_name\": \"WMT24++\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-3-12b-it/model.json",
    "content": "{\n  \"model_id\": \"gemma-3-12b-it\",\n  \"name\": \"Gemma 3 12B\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemma 3 12B is a 12-billion-parameter vision-language model from Google, handling text and image input and generating text output. It features a 128K context window, multilingual support, and open weights. Suitable for question answering, summarization, reasoning, and image understanding tasks.\",\n  \"release_date\": \"2025-03-12\",\n  \"announcement_date\": \"2025-03-12\",\n  \"license_id\": \"gemma\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 12000000000,\n  \"training_tokens\": 12000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": null,\n  \"source_paper\": \"https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf\",\n  \"source_scorecard_blog_link\": \"https://huggingface.co/blog/gemma3\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-3-12b-it\",\n  \"created_at\": \"2025-07-19T19:49:05.444134+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.444134+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-3-1b-it/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1099,\n    \"benchmark_id\": \"big-bench-extra-hard\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.072,\n    \"normalized_score\": 0.072,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.289054+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.289054+00:00\",\n    \"benchmark_name\": \"BIG-Bench Extra Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1075,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.391,\n    \"normalized_score\": 0.391,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.240587+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.240587+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1151,\n    \"benchmark_id\": \"bird-sql-(dev)\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.064,\n    \"normalized_score\": 0.064,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"- evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.421336+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.421336+00:00\",\n    \"benchmark_name\": \"Bird-SQL (dev)\"\n  },\n  {\n    \"model_benchmark_id\": 1225,\n    \"benchmark_id\": \"eclektic\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.014,\n    \"normalized_score\": 0.014,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.574307+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.574307+00:00\",\n    \"benchmark_name\": \"ECLeKTic\"\n  },\n  {\n    \"model_benchmark_id\": 1094,\n    \"benchmark_id\": \"facts-grounding\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.364,\n    \"normalized_score\": 0.364,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"- evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.276605+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.276605+00:00\",\n    \"benchmark_name\": \"FACTS Grounding\"\n  },\n  {\n    \"model_benchmark_id\": 1216,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.342,\n    \"normalized_score\": 0.342,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.557306+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.557306+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 276,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.192,\n    \"normalized_score\": 0.192,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.633668+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.633668+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 984,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.628,\n    \"normalized_score\": 0.628,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.064705+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.064705+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 1162,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.158,\n    \"normalized_score\": 0.158,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.445125+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.445125+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 772,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.415,\n    \"normalized_score\": 0.415,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.623656+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.623656+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 610,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.802,\n    \"normalized_score\": 0.802,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.260062+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.260062+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1109,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.019,\n    \"normalized_score\": 0.019,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.311408+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.311408+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 386,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.48,\n    \"normalized_score\": 0.48,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.832121+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.832121+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1174,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.352,\n    \"normalized_score\": 0.352,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.474036+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.474036+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 172,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.147,\n    \"normalized_score\": 0.147,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.434242+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.434242+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1202,\n    \"benchmark_id\": \"natural2code\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.56,\n    \"normalized_score\": 0.56,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.529701+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.529701+00:00\",\n    \"benchmark_name\": \"Natural2Code\"\n  },\n  {\n    \"model_benchmark_id\": 232,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.022,\n    \"normalized_score\": 0.022,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.544931+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.544931+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 1233,\n    \"benchmark_id\": \"wmt24++\",\n    \"model_id\": \"gemma-3-1b-it\",\n    \"score\": 0.359,\n    \"normalized_score\": 0.359,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.590063+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.590063+00:00\",\n    \"benchmark_name\": \"WMT24++\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-3-1b-it/model.json",
    "content": "{\n  \"model_id\": \"gemma-3-1b-it\",\n  \"name\": \"Gemma 3 1B\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"The Gemma 3 1B model is a lightweight, 1-billion-parameter language model by Google, optimized for efficiency on resource-limited devices. At 529MB, it processes text at 2,585 tokens/second with a context window of 128,000 tokens. It supports 35+ languages but handles text-only input, unlike larger multimodal Gemma models. This balance of speed and efficiency makes it ideal for fast text processing on mobile and low-power devices.\",\n  \"release_date\": \"2025-03-12\",\n  \"announcement_date\": \"2025-03-12\",\n  \"license_id\": \"gemma\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 1000000000,\n  \"training_tokens\": 2000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/google/gemma-3-1b-it\",\n  \"source_playground\": \"https://huggingface.co/chat/models/google/gemma-3-1b-it\",\n  \"source_paper\": \"https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf\",\n  \"source_scorecard_blog_link\": \"https://huggingface.co/blog/gemma3\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-3-1b-it\",\n  \"created_at\": \"2025-07-19T19:49:05.527185+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.527185+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-3-27b-it/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1249,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.845,\n    \"normalized_score\": 0.845,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.624921+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.624921+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 1098,\n    \"benchmark_id\": \"big-bench-extra-hard\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.193,\n    \"normalized_score\": 0.193,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.286991+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.286991+00:00\",\n    \"benchmark_name\": \"BIG-Bench Extra Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1074,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.876,\n    \"normalized_score\": 0.876,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.238868+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.238868+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1150,\n    \"benchmark_id\": \"bird-sql-(dev)\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.544,\n    \"normalized_score\": 0.544,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"- evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.418526+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.418526+00:00\",\n    \"benchmark_name\": \"Bird-SQL (dev)\"\n  },\n  {\n    \"model_benchmark_id\": 857,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.78,\n    \"normalized_score\": 0.78,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.793657+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.793657+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 880,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.866,\n    \"normalized_score\": 0.866,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.834284+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.834284+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1224,\n    \"benchmark_id\": \"eclektic\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.167,\n    \"normalized_score\": 0.167,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.572334+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.572334+00:00\",\n    \"benchmark_name\": \"ECLeKTic\"\n  },\n  {\n    \"model_benchmark_id\": 1093,\n    \"benchmark_id\": \"facts-grounding\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.749,\n    \"normalized_score\": 0.749,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"- evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.275050+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.275050+00:00\",\n    \"benchmark_name\": \"FACTS Grounding\"\n  },\n  {\n    \"model_benchmark_id\": 1215,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.751,\n    \"normalized_score\": 0.751,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.555532+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.555532+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 275,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.424,\n    \"normalized_score\": 0.424,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.628803+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.628803+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 983,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.959,\n    \"normalized_score\": 0.959,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.063038+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.063038+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 1161,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.603,\n    \"normalized_score\": 0.603,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.443231+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.443231+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 771,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.878,\n    \"normalized_score\": 0.878,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.621954+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.621954+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 609,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.904,\n    \"normalized_score\": 0.904,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.258406+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.258406+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1240,\n    \"benchmark_id\": \"infovqa\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.706,\n    \"normalized_score\": 0.706,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.607541+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.607541+00:00\",\n    \"benchmark_name\": \"InfoVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1108,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.297,\n    \"normalized_score\": 0.297,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.308517+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.308517+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 385,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.89,\n    \"normalized_score\": 0.89,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.830123+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.830123+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1268,\n    \"benchmark_id\": \"mathvista-mini\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.676,\n    \"normalized_score\": 0.676,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.660624+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.660624+00:00\",\n    \"benchmark_name\": \"MathVista-Mini\"\n  },\n  {\n    \"model_benchmark_id\": 1173,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.744,\n    \"normalized_score\": 0.744,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.472259+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.472259+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 171,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.675,\n    \"normalized_score\": 0.675,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.432013+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.432013+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1237,\n    \"benchmark_id\": \"mmmu-(val)\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.649,\n    \"normalized_score\": 0.649,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.599826+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.599826+00:00\",\n    \"benchmark_name\": \"MMMU (val)\"\n  },\n  {\n    \"model_benchmark_id\": 1201,\n    \"benchmark_id\": \"natural2code\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.845,\n    \"normalized_score\": 0.845,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.528235+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.528235+00:00\",\n    \"benchmark_name\": \"Natural2Code\"\n  },\n  {\n    \"model_benchmark_id\": 231,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.1,\n    \"normalized_score\": 0.1,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.543428+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.543428+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 905,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.651,\n    \"normalized_score\": 0.651,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.886992+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.886992+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1265,\n    \"benchmark_id\": \"vqav2-(val)\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.71,\n    \"normalized_score\": 0.71,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.653584+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.653584+00:00\",\n    \"benchmark_name\": \"VQAv2 (val)\"\n  },\n  {\n    \"model_benchmark_id\": 1232,\n    \"benchmark_id\": \"wmt24++\",\n    \"model_id\": \"gemma-3-27b-it\",\n    \"score\": 0.534,\n    \"normalized_score\": 0.534,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.587542+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.587542+00:00\",\n    \"benchmark_name\": \"WMT24++\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-3-27b-it/model.json",
    "content": "{\n  \"model_id\": \"gemma-3-27b-it\",\n  \"name\": \"Gemma 3 27B\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemma 3 27B is a 27-billion-parameter vision-language model from Google, handling text and image input and generating text output. It features a 128K context window, multilingual support, and open weights. Suitable for complex question answering, summarization, reasoning, and image understanding tasks.\",\n  \"release_date\": \"2025-03-12\",\n  \"announcement_date\": \"2025-03-12\",\n  \"license_id\": \"gemma\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 27000000000,\n  \"training_tokens\": 14000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": null,\n  \"source_paper\": \"https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf\",\n  \"source_scorecard_blog_link\": \"https://huggingface.co/blog/gemma3\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-3-27b-it\",\n  \"created_at\": \"2025-07-19T19:49:05.523800+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.523800+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-3-4b-it/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1248,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.748,\n    \"normalized_score\": 0.748,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.622871+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.622871+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 1097,\n    \"benchmark_id\": \"big-bench-extra-hard\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.11,\n    \"normalized_score\": 0.11,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.285056+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.285056+00:00\",\n    \"benchmark_name\": \"BIG-Bench Extra Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1073,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.722,\n    \"normalized_score\": 0.722,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.237255+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.237255+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1149,\n    \"benchmark_id\": \"bird-sql-(dev)\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.363,\n    \"normalized_score\": 0.363,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"- evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.417046+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.417046+00:00\",\n    \"benchmark_name\": \"Bird-SQL (dev)\"\n  },\n  {\n    \"model_benchmark_id\": 856,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.688,\n    \"normalized_score\": 0.688,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.791952+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.791952+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 879,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.758,\n    \"normalized_score\": 0.758,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.832468+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.832468+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1223,\n    \"benchmark_id\": \"eclektic\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.046,\n    \"normalized_score\": 0.046,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.570776+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.570776+00:00\",\n    \"benchmark_name\": \"ECLeKTic\"\n  },\n  {\n    \"model_benchmark_id\": 1092,\n    \"benchmark_id\": \"facts-grounding\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.701,\n    \"normalized_score\": 0.701,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"- evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.273464+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.273464+00:00\",\n    \"benchmark_name\": \"FACTS Grounding\"\n  },\n  {\n    \"model_benchmark_id\": 1214,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.545,\n    \"normalized_score\": 0.545,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.553690+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.553690+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 274,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.308,\n    \"normalized_score\": 0.308,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.625675+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.625675+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 982,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.892,\n    \"normalized_score\": 0.892,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.061601+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.061601+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 1160,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.43,\n    \"normalized_score\": 0.43,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.440350+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.440350+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 770,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.713,\n    \"normalized_score\": 0.713,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.620468+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.620468+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 608,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.902,\n    \"normalized_score\": 0.902,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.256346+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.256346+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1239,\n    \"benchmark_id\": \"infovqa\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.5,\n    \"normalized_score\": 0.5,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.605648+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.605648+00:00\",\n    \"benchmark_name\": \"InfoVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1107,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.126,\n    \"normalized_score\": 0.126,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.306674+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.306674+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 384,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.756,\n    \"normalized_score\": 0.756,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.828322+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.828322+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1267,\n    \"benchmark_id\": \"mathvista-mini\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.5,\n    \"normalized_score\": 0.5,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.659077+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.659077+00:00\",\n    \"benchmark_name\": \"MathVista-Mini\"\n  },\n  {\n    \"model_benchmark_id\": 1172,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.632,\n    \"normalized_score\": 0.632,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.469983+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.469983+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 170,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.436,\n    \"normalized_score\": 0.436,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.430343+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.430343+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1236,\n    \"benchmark_id\": \"mmmu-(val)\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.488,\n    \"normalized_score\": 0.488,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.597769+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.597769+00:00\",\n    \"benchmark_name\": \"MMMU (val)\"\n  },\n  {\n    \"model_benchmark_id\": 1200,\n    \"benchmark_id\": \"natural2code\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.703,\n    \"normalized_score\": 0.703,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.526663+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.526663+00:00\",\n    \"benchmark_name\": \"Natural2Code\"\n  },\n  {\n    \"model_benchmark_id\": 230,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.04,\n    \"normalized_score\": 0.04,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.542000+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.542000+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 904,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.578,\n    \"normalized_score\": 0.578,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.885190+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.885190+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1264,\n    \"benchmark_id\": \"vqav2-(val)\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.624,\n    \"normalized_score\": 0.624,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"multimodal evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.652122+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.652122+00:00\",\n    \"benchmark_name\": \"VQAv2 (val)\"\n  },\n  {\n    \"model_benchmark_id\": 1231,\n    \"benchmark_id\": \"wmt24++\",\n    \"model_id\": \"gemma-3-4b-it\",\n    \"score\": 0.468,\n    \"normalized_score\": 0.468,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.google.dev/gemma/docs/core/model_card_3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.586157+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.586157+00:00\",\n    \"benchmark_name\": \"WMT24++\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-3-4b-it/model.json",
    "content": "{\n  \"model_id\": \"gemma-3-4b-it\",\n  \"name\": \"Gemma 3 4B\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemma 3 4B is a 4-billion-parameter vision-language model from Google, handling text and image input and generating text output. It features a 128K context window, multilingual support, and open weights. Suitable for question answering, summarization, reasoning, and image understanding tasks.\",\n  \"release_date\": \"2025-03-12\",\n  \"announcement_date\": \"2025-03-12\",\n  \"license_id\": \"gemma\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-08-01\",\n  \"param_count\": 4000000000,\n  \"training_tokens\": 4000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": null,\n  \"source_paper\": \"https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf\",\n  \"source_scorecard_blog_link\": \"https://huggingface.co/blog/gemma3\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-3-4b-it\",\n  \"created_at\": \"2025-07-19T19:49:05.520515+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.520515+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e2b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 10,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"gemma-3n-e2b\",\n    \"score\": 0.517,\n    \"normalized_score\": 0.517,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.102376+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.102376+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1056,\n    \"benchmark_id\": \"arc-e\",\n    \"model_id\": \"gemma-3n-e2b\",\n    \"score\": 0.758,\n    \"normalized_score\": 0.758,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.204955+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.204955+00:00\",\n    \"benchmark_name\": \"ARC-E\"\n  },\n  {\n    \"model_benchmark_id\": 1071,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"gemma-3n-e2b\",\n    \"score\": 0.443,\n    \"normalized_score\": 0.443,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"few-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.233872+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.233872+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1022,\n    \"benchmark_id\": \"boolq\",\n    \"model_id\": \"gemma-3n-e2b\",\n    \"score\": 0.764,\n    \"normalized_score\": 0.764,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.127882+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.127882+00:00\",\n    \"benchmark_name\": \"BoolQ\"\n  },\n  {\n    \"model_benchmark_id\": 946,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"gemma-3n-e2b\",\n    \"score\": 0.539,\n    \"normalized_score\": 0.539,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Token F1 score. 1-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.996776+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.996776+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 39,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"gemma-3n-e2b\",\n    \"score\": 0.722,\n    \"normalized_score\": 0.722,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.166470+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.166470+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 1049,\n    \"benchmark_id\": \"natural-questions\",\n    \"model_id\": \"gemma-3n-e2b\",\n    \"score\": 0.155,\n    \"normalized_score\": 0.155,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.190039+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.190039+00:00\",\n    \"benchmark_name\": \"Natural Questions\"\n  },\n  {\n    \"model_benchmark_id\": 1031,\n    \"benchmark_id\": \"piqa\",\n    \"model_id\": \"gemma-3n-e2b\",\n    \"score\": 0.789,\n    \"normalized_score\": 0.789,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.147878+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.147878+00:00\",\n    \"benchmark_name\": \"PIQA\"\n  },\n  {\n    \"model_benchmark_id\": 1040,\n    \"benchmark_id\": \"social-iqa\",\n    \"model_id\": \"gemma-3n-e2b\",\n    \"score\": 0.488,\n    \"normalized_score\": 0.488,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.170669+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.170669+00:00\",\n    \"benchmark_name\": \"Social IQa\"\n  },\n  {\n    \"model_benchmark_id\": 249,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"gemma-3n-e2b\",\n    \"score\": 0.608,\n    \"normalized_score\": 0.608,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.576196+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.576196+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  },\n  {\n    \"model_benchmark_id\": 1061,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"gemma-3n-e2b\",\n    \"score\": 0.668,\n    \"normalized_score\": 0.668,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.213740+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.213740+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e2b/model.json",
    "content": "{\n  \"model_id\": \"gemma-3n-e2b\",\n  \"name\": \"Gemma 3n E2B\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemma 3n is a multimodal model designed to run locally on hardware, supporting image, text, audio, and video inputs. It features a language decoder, audio encoder, and vision encoder, and is available in two sizes: E2B and E4B. The model is optimized for memory efficiency, allowing it to run on devices with limited GPU RAM. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma models are well-suited for a variety of content understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for instruction-tuned variants. These models were trained with data in over 140 spoken languages.\",\n  \"release_date\": \"2025-06-26\",\n  \"announcement_date\": \"2025-06-26\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-06-01\",\n  \"param_count\": 8000000000,\n  \"training_tokens\": 11000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/blog/gemma3n\",\n  \"source_playground\": \"https://aistudio.google.com/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://ai.google.dev/gemma/docs/gemma-3n\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-3n-E2B\",\n  \"created_at\": \"2025-07-19T19:49:05.508070+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.508070+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e2b-it/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 686,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.067,\n    \"normalized_score\": 0.067,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.437675+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.437675+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1327,\n    \"benchmark_id\": \"codegolf-v2.2\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.11,\n    \"normalized_score\": 0.11,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.787794+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.787794+00:00\",\n    \"benchmark_name\": \"Codegolf v2.2\"\n  },\n  {\n    \"model_benchmark_id\": 1226,\n    \"benchmark_id\": \"eclektic\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.025,\n    \"normalized_score\": 0.025,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.575847+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.575847+00:00\",\n    \"benchmark_name\": \"ECLeKTic\"\n  },\n  {\n    \"model_benchmark_id\": 1316,\n    \"benchmark_id\": \"global-mmlu\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.551,\n    \"normalized_score\": 0.551,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.758455+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.758455+00:00\",\n    \"benchmark_name\": \"Global-MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1218,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.59,\n    \"normalized_score\": 0.59,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.560513+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.560513+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 280,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.248,\n    \"normalized_score\": 0.248,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond. 0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.641018+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.641018+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1165,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.277,\n    \"normalized_score\": 0.277,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.451948+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.451948+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 774,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.665,\n    \"normalized_score\": 0.665,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.626596+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.626596+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1307,\n    \"benchmark_id\": \"include\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.386,\n    \"normalized_score\": 0.386,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.735634+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.735634+00:00\",\n    \"benchmark_name\": \"Include\"\n  },\n  {\n    \"model_benchmark_id\": 1112,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.132,\n    \"normalized_score\": 0.132,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.320311+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.320311+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1323,\n    \"benchmark_id\": \"livecodebench-v5\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.186,\n    \"normalized_score\": 0.186,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.777049+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.777049+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v5\"\n  },\n  {\n    \"model_benchmark_id\": 1176,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.566,\n    \"normalized_score\": 0.566,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1. 3-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.477545+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.477545+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1278,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.531,\n    \"normalized_score\": 0.531,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.679623+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.679623+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 71,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.601,\n    \"normalized_score\": 0.601,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.234595+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.234595+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 175,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.405,\n    \"normalized_score\": 0.405,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.439365+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.439365+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1312,\n    \"benchmark_id\": \"mmlu-prox\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.081,\n    \"normalized_score\": 0.081,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.746554+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.746554+00:00\",\n    \"benchmark_name\": \"MMLU-ProX\"\n  },\n  {\n    \"model_benchmark_id\": 1432,\n    \"benchmark_id\": \"openai-mmlu\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.223,\n    \"normalized_score\": 0.223,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.047435+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.047435+00:00\",\n    \"benchmark_name\": \"OpenAI MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1234,\n    \"benchmark_id\": \"wmt24++\",\n    \"model_id\": \"gemma-3n-e2b-it\",\n    \"score\": 0.427,\n    \"normalized_score\": 0.427,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Character-level F-score. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.592107+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.592107+00:00\",\n    \"benchmark_name\": \"WMT24++\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e2b-it/model.json",
    "content": "{\n  \"model_id\": \"gemma-3n-e2b-it\",\n  \"name\": \"Gemma 3n E2B Instructed\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemma 3n is a multimodal model designed to run locally on hardware, supporting image, text, audio, and video inputs. It features a language decoder, audio encoder, and vision encoder, and is available in two sizes: E2B and E4B. The model is optimized for memory efficiency, allowing it to run on devices with limited GPU RAM. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma models are well-suited for a variety of content understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for instruction-tuned variants. These models were trained with data in over 140 spoken languages.\",\n  \"release_date\": \"2025-06-26\",\n  \"announcement_date\": \"2025-06-26\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-06-01\",\n  \"param_count\": 8000000000,\n  \"training_tokens\": 11000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/blog/gemma3n\",\n  \"source_playground\": \"https://aistudio.google.com/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://ai.google.dev/gemma/docs/gemma-3n\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-3n-E2B-it\",\n  \"created_at\": \"2025-07-19T19:49:05.541972+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.541972+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e2b-it-litert-preview/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 680,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.067,\n    \"normalized_score\": 0.067,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.419451+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.419451+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 7,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.517,\n    \"normalized_score\": 0.517,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.095909+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.095909+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1053,\n    \"benchmark_id\": \"arc-e\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.758,\n    \"normalized_score\": 0.758,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.199540+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.199540+00:00\",\n    \"benchmark_name\": \"ARC-E\"\n  },\n  {\n    \"model_benchmark_id\": 1069,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.443,\n    \"normalized_score\": 0.443,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"few-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.229977+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.229977+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1019,\n    \"benchmark_id\": \"boolq\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.764,\n    \"normalized_score\": 0.764,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.123278+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.123278+00:00\",\n    \"benchmark_name\": \"BoolQ\"\n  },\n  {\n    \"model_benchmark_id\": 1325,\n    \"benchmark_id\": \"codegolf-v2.2\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.11,\n    \"normalized_score\": 0.11,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.783685+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.783685+00:00\",\n    \"benchmark_name\": \"Codegolf v2.2\"\n  },\n  {\n    \"model_benchmark_id\": 944,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.539,\n    \"normalized_score\": 0.539,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"1-shot Token F1 score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.993202+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.993202+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 1221,\n    \"benchmark_id\": \"eclektic\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.025,\n    \"normalized_score\": 0.025,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot ECLeKTic score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.567241+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.567241+00:00\",\n    \"benchmark_name\": \"ECLeKTic\"\n  },\n  {\n    \"model_benchmark_id\": 1314,\n    \"benchmark_id\": \"global-mmlu\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.551,\n    \"normalized_score\": 0.551,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.754602+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.754602+00:00\",\n    \"benchmark_name\": \"Global-MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1208,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.59,\n    \"normalized_score\": 0.59,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.542151+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.542151+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 265,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.248,\n    \"normalized_score\": 0.248,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, 0-shot RelaxedAccuracy/accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.609514+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.609514+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 35,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.722,\n    \"normalized_score\": 0.722,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.154889+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.154889+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 1155,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.277,\n    \"normalized_score\": 0.277,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.431354+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.431354+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 764,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.665,\n    \"normalized_score\": 0.665,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.609959+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.609959+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1305,\n    \"benchmark_id\": \"include\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.386,\n    \"normalized_score\": 0.386,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.731041+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.731041+00:00\",\n    \"benchmark_name\": \"Include\"\n  },\n  {\n    \"model_benchmark_id\": 1103,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.132,\n    \"normalized_score\": 0.132,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.298197+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.298197+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1319,\n    \"benchmark_id\": \"livecodebench-v5\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.186,\n    \"normalized_score\": 0.186,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.768006+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.768006+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v5\"\n  },\n  {\n    \"model_benchmark_id\": 1168,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.566,\n    \"normalized_score\": 0.566,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.460487+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.460487+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1274,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.531,\n    \"normalized_score\": 0.531,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.672774+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.672774+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 65,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.601,\n    \"normalized_score\": 0.601,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.222830+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.222830+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 165,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.405,\n    \"normalized_score\": 0.405,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.421645+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.421645+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1310,\n    \"benchmark_id\": \"mmlu-prox\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.081,\n    \"normalized_score\": 0.081,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.743201+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.743201+00:00\",\n    \"benchmark_name\": \"MMLU-ProX\"\n  },\n  {\n    \"model_benchmark_id\": 1046,\n    \"benchmark_id\": \"natural-questions\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.155,\n    \"normalized_score\": 0.155,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.184897+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.184897+00:00\",\n    \"benchmark_name\": \"Natural Questions\"\n  },\n  {\n    \"model_benchmark_id\": 1028,\n    \"benchmark_id\": \"piqa\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.789,\n    \"normalized_score\": 0.789,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.142086+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.142086+00:00\",\n    \"benchmark_name\": \"PIQA\"\n  },\n  {\n    \"model_benchmark_id\": 1037,\n    \"benchmark_id\": \"social-iqa\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.488,\n    \"normalized_score\": 0.488,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.164056+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.164056+00:00\",\n    \"benchmark_name\": \"Social IQa\"\n  },\n  {\n    \"model_benchmark_id\": 246,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.608,\n    \"normalized_score\": 0.608,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.571204+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.571204+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  },\n  {\n    \"model_benchmark_id\": 1059,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.668,\n    \"normalized_score\": 0.668,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.210650+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.210650+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  },\n  {\n    \"model_benchmark_id\": 1229,\n    \"benchmark_id\": \"wmt24++\",\n    \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n    \"score\": 0.427,\n    \"normalized_score\": 0.427,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"ChrF, 0-shot Character-level F-score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.582347+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.582347+00:00\",\n    \"benchmark_name\": \"WMT24++\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e2b-it-litert-preview/model.json",
    "content": "{\n  \"model_id\": \"gemma-3n-e2b-it-litert-preview\",\n  \"name\": \"Gemma 3n E2B Instructed LiteRT (Preview)\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemma 3n is a generative AI model optimized for use in everyday devices, such as phones, laptops, and tablets. It features innovations like Per-Layer Embedding (PLE) parameter caching and a MatFormer model architecture for reduced compute and memory. These models handle audio, text, and visual data, though this E4B preview currently supports text and vision input. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models, and is licensed for responsible commercial use.\",\n  \"release_date\": \"2025-05-20\",\n  \"announcement_date\": \"2025-05-20\",\n  \"license_id\": \"gemma\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-06-01\",\n  \"param_count\": 1910000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://aistudio.google.com/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://ai.google.dev/gemma/docs/gemma-3n\",\n  \"source_repo_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-3n-E2B-it-litert-preview\",\n  \"created_at\": \"2025-07-19T19:49:05.466473+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.466473+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e4b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 5,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"gemma-3n-e4b\",\n    \"score\": 0.616,\n    \"normalized_score\": 0.616,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.091862+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.091862+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1051,\n    \"benchmark_id\": \"arc-e\",\n    \"model_id\": \"gemma-3n-e4b\",\n    \"score\": 0.816,\n    \"normalized_score\": 0.816,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.195091+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.195091+00:00\",\n    \"benchmark_name\": \"ARC-E\"\n  },\n  {\n    \"model_benchmark_id\": 1066,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"gemma-3n-e4b\",\n    \"score\": 0.529,\n    \"normalized_score\": 0.529,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"few-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.225269+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.225269+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1017,\n    \"benchmark_id\": \"boolq\",\n    \"model_id\": \"gemma-3n-e4b\",\n    \"score\": 0.816,\n    \"normalized_score\": 0.816,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.120054+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.120054+00:00\",\n    \"benchmark_name\": \"BoolQ\"\n  },\n  {\n    \"model_benchmark_id\": 942,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"gemma-3n-e4b\",\n    \"score\": 0.608,\n    \"normalized_score\": 0.608,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Token F1 score. 1-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.989555+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.989555+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 33,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"gemma-3n-e4b\",\n    \"score\": 0.786,\n    \"normalized_score\": 0.786,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.150880+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.150880+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 1044,\n    \"benchmark_id\": \"natural-questions\",\n    \"model_id\": \"gemma-3n-e4b\",\n    \"score\": 0.209,\n    \"normalized_score\": 0.209,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.181324+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.181324+00:00\",\n    \"benchmark_name\": \"Natural Questions\"\n  },\n  {\n    \"model_benchmark_id\": 1026,\n    \"benchmark_id\": \"piqa\",\n    \"model_id\": \"gemma-3n-e4b\",\n    \"score\": 0.81,\n    \"normalized_score\": 0.81,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.136080+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.136080+00:00\",\n    \"benchmark_name\": \"PIQA\"\n  },\n  {\n    \"model_benchmark_id\": 1035,\n    \"benchmark_id\": \"social-iqa\",\n    \"model_id\": \"gemma-3n-e4b\",\n    \"score\": 0.5,\n    \"normalized_score\": 0.5,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.159816+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.159816+00:00\",\n    \"benchmark_name\": \"Social IQa\"\n  },\n  {\n    \"model_benchmark_id\": 244,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"gemma-3n-e4b\",\n    \"score\": 0.702,\n    \"normalized_score\": 0.702,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.567693+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.567693+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  },\n  {\n    \"model_benchmark_id\": 1057,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"gemma-3n-e4b\",\n    \"score\": 0.717,\n    \"normalized_score\": 0.717,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.207598+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.207598+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e4b/model.json",
    "content": "{\n  \"model_id\": \"gemma-3n-e4b\",\n  \"name\": \"Gemma 3n E4B\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemma 3n is a multimodal model designed to run locally on hardware, supporting image, text, audio, and video inputs. It features a language decoder, audio encoder, and vision encoder, and is available in two sizes: E2B and E4B. The model is optimized for memory efficiency, allowing it to run on devices with limited GPU RAM. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma models are well-suited for a variety of content understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for instruction-tuned variants. These models were trained with data in over 140 spoken languages.\",\n  \"release_date\": \"2025-06-26\",\n  \"announcement_date\": \"2025-06-26\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-06-01\",\n  \"param_count\": 8000000000,\n  \"training_tokens\": 11000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/blog/gemma3n\",\n  \"source_playground\": \"https://aistudio.google.com/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://ai.google.dev/gemma/docs/gemma-3n\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-3n-E4B\",\n  \"created_at\": \"2025-07-19T19:49:05.440084+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.440084+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e4b-it/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 684,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.116,\n    \"normalized_score\": 0.116,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.431148+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.431148+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1326,\n    \"benchmark_id\": \"codegolf-v2.2\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.168,\n    \"normalized_score\": 0.168,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.785856+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.785856+00:00\",\n    \"benchmark_name\": \"Codegolf v2.2\"\n  },\n  {\n    \"model_benchmark_id\": 1222,\n    \"benchmark_id\": \"eclektic\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.19,\n    \"normalized_score\": 0.19,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.569227+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.569227+00:00\",\n    \"benchmark_name\": \"ECLeKTic\"\n  },\n  {\n    \"model_benchmark_id\": 1315,\n    \"benchmark_id\": \"global-mmlu\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.603,\n    \"normalized_score\": 0.603,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.756363+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.756363+00:00\",\n    \"benchmark_name\": \"Global-MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1213,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.645,\n    \"normalized_score\": 0.645,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.552233+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.552233+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 273,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.237,\n    \"normalized_score\": 0.237,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond. 0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.624084+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.624084+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1159,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.377,\n    \"normalized_score\": 0.377,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.438271+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.438271+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 769,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.75,\n    \"normalized_score\": 0.75,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.618954+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.618954+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1306,\n    \"benchmark_id\": \"include\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.572,\n    \"normalized_score\": 0.572,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.733461+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.733461+00:00\",\n    \"benchmark_name\": \"Include\"\n  },\n  {\n    \"model_benchmark_id\": 1106,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.132,\n    \"normalized_score\": 0.132,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.304919+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.304919+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1322,\n    \"benchmark_id\": \"livecodebench-v5\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.257,\n    \"normalized_score\": 0.257,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.775429+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.775429+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v5\"\n  },\n  {\n    \"model_benchmark_id\": 1171,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.636,\n    \"normalized_score\": 0.636,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1. 3-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.466832+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.466832+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1277,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.67,\n    \"normalized_score\": 0.67,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.678210+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.678210+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 70,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.649,\n    \"normalized_score\": 0.649,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.232243+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.232243+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 169,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.506,\n    \"normalized_score\": 0.506,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.428457+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.428457+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1311,\n    \"benchmark_id\": \"mmlu-prox\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.199,\n    \"normalized_score\": 0.199,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.744918+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.744918+00:00\",\n    \"benchmark_name\": \"MMLU-ProX\"\n  },\n  {\n    \"model_benchmark_id\": 1431,\n    \"benchmark_id\": \"openai-mmlu\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.356,\n    \"normalized_score\": 0.356,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.045887+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.045887+00:00\",\n    \"benchmark_name\": \"OpenAI MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1230,\n    \"benchmark_id\": \"wmt24++\",\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"score\": 0.501,\n    \"normalized_score\": 0.501,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Character-level F-score. 0-shot.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.584588+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.584588+00:00\",\n    \"benchmark_name\": \"WMT24++\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e4b-it/model.json",
    "content": "{\n  \"model_id\": \"gemma-3n-e4b-it\",\n  \"name\": \"Gemma 3n E4B Instructed\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemma 3n is a multimodal model designed to run locally on hardware, supporting image, text, audio, and video inputs. It features a language decoder, audio encoder, and vision encoder, and is available in two sizes: E2B and E4B. The model is optimized for memory efficiency, allowing it to run on devices with limited GPU RAM. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma models are well-suited for a variety of content understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for instruction-tuned variants. These models were trained with data in over 140 spoken languages.\",\n  \"release_date\": \"2025-06-26\",\n  \"announcement_date\": \"2025-06-26\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-06-01\",\n  \"param_count\": 8000000000,\n  \"training_tokens\": 11000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/blog/gemma3n\",\n  \"source_playground\": \"https://aistudio.google.com/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://ai.google.dev/gemma/docs/gemma-3n\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-3n-E4B-it\",\n  \"created_at\": \"2025-07-19T19:49:05.517334+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.517334+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e4b-it-litert-preview/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 678,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.116,\n    \"normalized_score\": 0.116,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.414248+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.414248+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 6,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.616,\n    \"normalized_score\": 0.616,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.093723+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.093723+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1052,\n    \"benchmark_id\": \"arc-e\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.816,\n    \"normalized_score\": 0.816,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.196728+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.196728+00:00\",\n    \"benchmark_name\": \"ARC-E\"\n  },\n  {\n    \"model_benchmark_id\": 1068,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.529,\n    \"normalized_score\": 0.529,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"few-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.228349+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.228349+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1018,\n    \"benchmark_id\": \"boolq\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.816,\n    \"normalized_score\": 0.816,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.121696+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.121696+00:00\",\n    \"benchmark_name\": \"BoolQ\"\n  },\n  {\n    \"model_benchmark_id\": 1324,\n    \"benchmark_id\": \"codegolf-v2.2\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.168,\n    \"normalized_score\": 0.168,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.781222+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.781222+00:00\",\n    \"benchmark_name\": \"Codegolf v2.2\"\n  },\n  {\n    \"model_benchmark_id\": 943,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.608,\n    \"normalized_score\": 0.608,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"1-shot Token F1 score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.991359+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.991359+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 1220,\n    \"benchmark_id\": \"eclektic\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.019,\n    \"normalized_score\": 0.019,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot ECLeKTic score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.565422+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.565422+00:00\",\n    \"benchmark_name\": \"ECLeKTic\"\n  },\n  {\n    \"model_benchmark_id\": 1313,\n    \"benchmark_id\": \"global-mmlu\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.603,\n    \"normalized_score\": 0.603,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.752749+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.752749+00:00\",\n    \"benchmark_name\": \"Global-MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1206,\n    \"benchmark_id\": \"global-mmlu-lite\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.645,\n    \"normalized_score\": 0.645,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.538643+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.538643+00:00\",\n    \"benchmark_name\": \"Global-MMLU-Lite\"\n  },\n  {\n    \"model_benchmark_id\": 262,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.237,\n    \"normalized_score\": 0.237,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, 0-shot RelaxedAccuracy/accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.602493+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.602493+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 34,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.786,\n    \"normalized_score\": 0.786,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.152761+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.152761+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 1154,\n    \"benchmark_id\": \"hiddenmath\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.377,\n    \"normalized_score\": 0.377,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.429415+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.429415+00:00\",\n    \"benchmark_name\": \"HiddenMath\"\n  },\n  {\n    \"model_benchmark_id\": 763,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.75,\n    \"normalized_score\": 0.75,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.608423+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.608423+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1304,\n    \"benchmark_id\": \"include\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.572,\n    \"normalized_score\": 0.572,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.729199+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.729199+00:00\",\n    \"benchmark_name\": \"Include\"\n  },\n  {\n    \"model_benchmark_id\": 1102,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.132,\n    \"normalized_score\": 0.132,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.296281+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.296281+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1317,\n    \"benchmark_id\": \"livecodebench-v5\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.257,\n    \"normalized_score\": 0.257,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.761673+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.761673+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v5\"\n  },\n  {\n    \"model_benchmark_id\": 1167,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.636,\n    \"normalized_score\": 0.636,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.458570+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.458570+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1273,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.607,\n    \"normalized_score\": 0.607,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.671283+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.671283+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 63,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.649,\n    \"normalized_score\": 0.649,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.219372+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.219372+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 164,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.506,\n    \"normalized_score\": 0.506,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.420000+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.420000+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1309,\n    \"benchmark_id\": \"mmlu-prox\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.199,\n    \"normalized_score\": 0.199,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.741460+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.741460+00:00\",\n    \"benchmark_name\": \"MMLU-ProX\"\n  },\n  {\n    \"model_benchmark_id\": 1045,\n    \"benchmark_id\": \"natural-questions\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.209,\n    \"normalized_score\": 0.209,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.183031+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.183031+00:00\",\n    \"benchmark_name\": \"Natural Questions\"\n  },\n  {\n    \"model_benchmark_id\": 1027,\n    \"benchmark_id\": \"piqa\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.81,\n    \"normalized_score\": 0.81,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.137952+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.137952+00:00\",\n    \"benchmark_name\": \"PIQA\"\n  },\n  {\n    \"model_benchmark_id\": 1036,\n    \"benchmark_id\": \"social-iqa\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.5,\n    \"normalized_score\": 0.5,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.161822+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.161822+00:00\",\n    \"benchmark_name\": \"Social IQa\"\n  },\n  {\n    \"model_benchmark_id\": 245,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.702,\n    \"normalized_score\": 0.702,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.569334+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.569334+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  },\n  {\n    \"model_benchmark_id\": 1058,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.717,\n    \"normalized_score\": 0.717,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.209229+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.209229+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  },\n  {\n    \"model_benchmark_id\": 1228,\n    \"benchmark_id\": \"wmt24++\",\n    \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n    \"score\": 0.501,\n    \"normalized_score\": 0.501,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"ChrF, 0-shot Character-level F-score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.580409+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.580409+00:00\",\n    \"benchmark_name\": \"WMT24++\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/gemma-3n-e4b-it-litert-preview/model.json",
    "content": "{\n  \"model_id\": \"gemma-3n-e4b-it-litert-preview\",\n  \"name\": \"Gemma 3n E4B Instructed LiteRT Preview\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Gemma 3n is a generative AI model optimized for use in everyday devices, such as phones, laptops, and tablets. It features innovations like Per-Layer Embedding (PLE) parameter caching and a MatFormer model architecture for reduced compute and memory. These models handle audio, text, and visual data, though this E4B preview currently supports text and vision input. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models, and is licensed for responsible commercial use.\",\n  \"release_date\": \"2025-05-20\",\n  \"announcement_date\": \"2025-05-20\",\n  \"license_id\": \"gemma\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-06-01\",\n  \"param_count\": 1910000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://aistudio.google.com/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://ai.google.dev/gemma/docs/gemma-3n\",\n  \"source_repo_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n  \"source_weights_link\": \"https://huggingface.co/google/gemma-3n-E4B-it-litert-preview\",\n  \"created_at\": \"2025-07-19T19:49:05.451978+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.451978+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/models/medgemma-4b-it/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1425,\n    \"benchmark_id\": \"chexpert-cxr\",\n    \"model_id\": \"medgemma-4b-it\",\n    \"score\": 0.481,\n    \"normalized_score\": 0.481,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/medgemma-4b-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Average F1 for top 5 conditions\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.023334+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.023334+00:00\",\n    \"benchmark_name\": \"CheXpert CXR\"\n  },\n  {\n    \"model_benchmark_id\": 1426,\n    \"benchmark_id\": \"dermmcqa\",\n    \"model_id\": \"medgemma-4b-it\",\n    \"score\": 0.718,\n    \"normalized_score\": 0.718,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/medgemma-4b-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.026812+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.026812+00:00\",\n    \"benchmark_name\": \"DermMCQA\"\n  },\n  {\n    \"model_benchmark_id\": 1430,\n    \"benchmark_id\": \"medxpertqa\",\n    \"model_id\": \"medgemma-4b-it\",\n    \"score\": 0.188,\n    \"normalized_score\": 0.188,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/medgemma-4b-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.042823+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.042823+00:00\",\n    \"benchmark_name\": \"MedXpertQA\"\n  },\n  {\n    \"model_benchmark_id\": 1424,\n    \"benchmark_id\": \"mimic-cxr\",\n    \"model_id\": \"medgemma-4b-it\",\n    \"score\": 0.889,\n    \"normalized_score\": 0.889,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/medgemma-4b-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Average F1 for top 5 conditions\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.019964+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.019964+00:00\",\n    \"benchmark_name\": \"MIMIC CXR\"\n  },\n  {\n    \"model_benchmark_id\": 1429,\n    \"benchmark_id\": \"pathmcqa\",\n    \"model_id\": \"medgemma-4b-it\",\n    \"score\": 0.698,\n    \"normalized_score\": 0.698,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/medgemma-4b-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.039089+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.039089+00:00\",\n    \"benchmark_name\": \"PathMCQA\"\n  },\n  {\n    \"model_benchmark_id\": 1427,\n    \"benchmark_id\": \"slakevqa\",\n    \"model_id\": \"medgemma-4b-it\",\n    \"score\": 0.623,\n    \"normalized_score\": 0.623,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/medgemma-4b-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Tokenized F1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.029835+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.029835+00:00\",\n    \"benchmark_name\": \"SlakeVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1428,\n    \"benchmark_id\": \"vqa-rad\",\n    \"model_id\": \"medgemma-4b-it\",\n    \"score\": 0.499,\n    \"normalized_score\": 0.499,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/google/medgemma-4b-it\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Tokenized F1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.035504+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.035504+00:00\",\n    \"benchmark_name\": \"VQA-Rad\"\n  }\n]"
  },
  {
    "path": "data/organizations/google/models/medgemma-4b-it/model.json",
    "content": "{\n  \"model_id\": \"medgemma-4b-it\",\n  \"name\": \"MedGemma 4B IT\",\n  \"organization_id\": \"google\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"MedGemma is a collection of Gemma 3 variants that are trained for performance on medical text and image comprehension. MedGemma 4B utilizes a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including chest X-rays, dermatology images, ophthalmology images, and histopathology slides. Its LLM component is trained on a diverse set of medical data, including radiology images, histopathology patches, ophthalmology images, and dermatology images. MedGemma is a multimodal model primarily evaluated on single-image tasks. It has not been evaluated for multi-turn applications and may be more sensitive to specific prompts than its predecessor, Gemma 3. Developers should consider bias in validation data and data contamination concerns when using MedGemma.\",\n  \"release_date\": \"2025-05-20\",\n  \"announcement_date\": \"2025-05-20\",\n  \"license_id\": \"health_ai_developer_foundations_terms_of_use\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 4300000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://developers.google.com/health-ai-developer-foundations/medgemma/get-started\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://developers.google.com/health-ai-developer-foundations/medgemma/model-card\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/google/medgemma-4b-it\",\n  \"created_at\": \"2025-07-19T19:49:05.511963+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.511963+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/google/organization.json",
    "content": "{\n  \"organization_id\": \"google\",\n  \"name\": \"Google\",\n  \"website\": \"https://google.com\",\n  \"description\": \"Technology giant with AI research\",\n  \"country\": \"US\",\n  \"created_at\": \"2025-07-19T19:49:05.437977+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.437977+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/ibm/models/granite-3.3-8b-base/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1409,\n    \"benchmark_id\": \"agieval\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.493,\n    \"normalized_score\": 0.493,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.976963+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.976963+00:00\",\n    \"benchmark_name\": \"AGIEval\"\n  },\n  {\n    \"model_benchmark_id\": 477,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.812,\n    \"normalized_score\": 0.812,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Not specified\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.006332+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.006332+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 1794,\n    \"benchmark_id\": \"alpacaeval-2.0\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.6268,\n    \"normalized_score\": 0.6268,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.048676+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.048676+00:00\",\n    \"benchmark_name\": \"AlpacaEval 2.0\"\n  },\n  {\n    \"model_benchmark_id\": 23,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.5084,\n    \"normalized_score\": 0.5084,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.131347+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.131347+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1460,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.5756,\n    \"normalized_score\": 0.5756,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Arena Hard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.111734+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.111734+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1807,\n    \"benchmark_id\": \"attaq\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.885,\n    \"normalized_score\": 0.885,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Not specified (OLMES)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.087212+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.087212+00:00\",\n    \"benchmark_name\": \"AttaQ\"\n  },\n  {\n    \"model_benchmark_id\": 1081,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.6913,\n    \"normalized_score\": 0.6913,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OLMES (Added regex for more efficient answer extraction)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.251020+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.251020+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 955,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.3614,\n    \"normalized_score\": 0.3614,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.012196+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.012196+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 1004,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.59,\n    \"normalized_score\": 0.59,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.098078+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.098078+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 49,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.801,\n    \"normalized_score\": 0.801,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.186799+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.186799+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 798,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.8973,\n    \"normalized_score\": 0.8973,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OLMES\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.666882+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.666882+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1444,\n    \"benchmark_id\": \"humaneval+\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.8609,\n    \"normalized_score\": 0.8609,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OLMES\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.078662+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.078662+00:00\",\n    \"benchmark_name\": \"HumanEval+\"\n  },\n  {\n    \"model_benchmark_id\": 626,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.7482,\n    \"normalized_score\": 0.7482,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OLMES\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.288064+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.288064+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 508,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.6902,\n    \"normalized_score\": 0.6902,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Not specified\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.056690+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.056690+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 101,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.6389,\n    \"normalized_score\": 0.6389,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.290899+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.290899+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1808,\n    \"benchmark_id\": \"nq\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.365,\n    \"normalized_score\": 0.365,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.090844+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.090844+00:00\",\n    \"benchmark_name\": \"NQ\"\n  },\n  {\n    \"model_benchmark_id\": 1804,\n    \"benchmark_id\": \"popqa\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.2617,\n    \"normalized_score\": 0.2617,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.078883+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.078883+00:00\",\n    \"benchmark_name\": \"PopQA\"\n  },\n  {\n    \"model_benchmark_id\": 250,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.7818,\n    \"normalized_score\": 0.7818,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.577753+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.577753+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  },\n  {\n    \"model_benchmark_id\": 142,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.5215,\n    \"normalized_score\": 0.5215,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.362380+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.362380+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 152,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"granite-3.3-8b-base\",\n    \"score\": 0.744,\n    \"normalized_score\": 0.744,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.387990+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.387990+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/ibm/models/granite-3.3-8b-base/model.json",
    "content": "{\n  \"model_id\": \"granite-3.3-8b-base\",\n  \"name\": \"Granite 3.3 8B Base\",\n  \"organization_id\": \"ibm\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Granite-3.3-8B-Base is a decoder-only language model with a 128K token context window. It improves upon Granite-3.1-8B-Base by adding support for Fill-in-the-Middle (FIM) using specialized tokens, enabling the model to generate content conditioned on both prefix and suffix. This makes it well-suited for code completion tasks\",\n  \"release_date\": \"2025-04-16\",\n  \"announcement_date\": \"2025-04-16\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-04-01\",\n  \"param_count\": 8170000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.ibm.com/granite/docs/\",\n  \"source_playground\": \"https://www.ibm.com/granite/playground/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n  \"source_repo_link\": \"https://github.com/ibm-granite/granite-3.3-language-models\",\n  \"source_weights_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-base\",\n  \"created_at\": \"2025-07-19T19:49:05.727013+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.727013+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/ibm/models/granite-3.3-8b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 476,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.812,\n    \"normalized_score\": 0.812,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Not specified\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.004852+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.004852+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 1793,\n    \"benchmark_id\": \"alpacaeval-2.0\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.6268,\n    \"normalized_score\": 0.6268,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.046908+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.046908+00:00\",\n    \"benchmark_name\": \"AlpacaEval 2.0\"\n  },\n  {\n    \"model_benchmark_id\": 1459,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.5756,\n    \"normalized_score\": 0.5756,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Arena Hard benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.110277+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.110277+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1806,\n    \"benchmark_id\": \"attaq\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.885,\n    \"normalized_score\": 0.885,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Not specified (OLMES)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.085492+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.085492+00:00\",\n    \"benchmark_name\": \"AttaQ\"\n  },\n  {\n    \"model_benchmark_id\": 1080,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.6913,\n    \"normalized_score\": 0.6913,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OLMES (Added regex for more efficient answer extraction)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.249459+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.249459+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 954,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.5936,\n    \"normalized_score\": 0.5936,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OLMES (Modified implementation)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.010691+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.010691+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 1003,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.8089,\n    \"normalized_score\": 0.8089,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OLMES\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.095998+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.095998+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 797,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.8973,\n    \"normalized_score\": 0.8973,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OLMES\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.665403+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.665403+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1443,\n    \"benchmark_id\": \"humaneval+\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.8609,\n    \"normalized_score\": 0.8609,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OLMES\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.076877+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.076877+00:00\",\n    \"benchmark_name\": \"HumanEval+\"\n  },\n  {\n    \"model_benchmark_id\": 625,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.7482,\n    \"normalized_score\": 0.7482,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OLMES\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.286600+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.286600+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 507,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.6902,\n    \"normalized_score\": 0.6902,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Not specified\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.054762+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.054762+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 100,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.6554,\n    \"normalized_score\": 0.6554,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.288937+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.288937+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1803,\n    \"benchmark_id\": \"popqa\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.2617,\n    \"normalized_score\": 0.2617,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.077308+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.077308+00:00\",\n    \"benchmark_name\": \"PopQA\"\n  },\n  {\n    \"model_benchmark_id\": 141,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"granite-3.3-8b-instruct\",\n    \"score\": 0.6686,\n    \"normalized_score\": 0.6686,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.360858+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.360858+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/ibm/models/granite-3.3-8b-instruct/model.json",
    "content": "{\n  \"model_id\": \"granite-3.3-8b-instruct\",\n  \"name\": \"Granite 3.3 8B Instruct\",\n  \"organization_id\": \"ibm\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Granite 3.3 models feature enhanced reasoning capabilities and support for Fill-in-the-Middle (FIM) code completion. They are built on a foundation of open-source instruction datasets with permissive licenses, alongside internally curated synthetic datasets tailored for long-context problem-solving. These models preserve the key strengths of previous Granite versions, including support for a 128K context length, strong performance in retrieval-augmented generation (RAG) and function calling, and controls for response length and originality. Granite 3.3 also delivers competitive results across general, enterprise, and safety benchmarks. Released as open source, the models are available under the Apache 2.0 license.\",\n  \"release_date\": \"2025-04-16\",\n  \"announcement_date\": \"2025-04-16\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-04-01\",\n  \"param_count\": 8000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.ibm.com/granite/docs/\",\n  \"source_playground\": \"https://www.ibm.com/granite/playground/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n  \"source_repo_link\": \"https://github.com/ibm-granite/granite-3.3-language-models\",\n  \"source_weights_link\": \"https://huggingface.co/ibm-granite/granite-3.3-8b-instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.723958+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.723958+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/ibm/models/granite-4.0-tiny-preview/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1792,\n    \"benchmark_id\": \"alpacaeval-2.0\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.3516,\n    \"normalized_score\": 0.3516,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-4.0-tiny-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.045290+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.045290+00:00\",\n    \"benchmark_name\": \"AlpacaEval 2.0\"\n  },\n  {\n    \"model_benchmark_id\": 1458,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.267,\n    \"normalized_score\": 0.267,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/ibm-granite/granite-4.0-tiny-preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.108397+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.108397+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1805,\n    \"benchmark_id\": \"attaq\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.861,\n    \"normalized_score\": 0.861,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.083480+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.083480+00:00\",\n    \"benchmark_name\": \"AttaQ\"\n  },\n  {\n    \"model_benchmark_id\": 1079,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.557,\n    \"normalized_score\": 0.557,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.247228+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.247228+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 953,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.462,\n    \"normalized_score\": 0.462,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.009229+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.009229+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 1002,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.701,\n    \"normalized_score\": 0.701,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.094422+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.094422+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 796,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.824,\n    \"normalized_score\": 0.824,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.663900+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.663900+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1442,\n    \"benchmark_id\": \"humaneval+\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.783,\n    \"normalized_score\": 0.783,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.074105+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.074105+00:00\",\n    \"benchmark_name\": \"HumanEval+\"\n  },\n  {\n    \"model_benchmark_id\": 624,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.63,\n    \"normalized_score\": 0.63,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.285068+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.285068+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 99,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.604,\n    \"normalized_score\": 0.604,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.287184+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.287184+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1802,\n    \"benchmark_id\": \"popqa\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.229,\n    \"normalized_score\": 0.229,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.075622+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.075622+00:00\",\n    \"benchmark_name\": \"PopQA\"\n  },\n  {\n    \"model_benchmark_id\": 140,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"granite-4.0-tiny-preview\",\n    \"score\": 0.581,\n    \"normalized_score\": 0.581,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.358910+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.358910+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/ibm/models/granite-4.0-tiny-preview/model.json",
    "content": "{\n  \"model_id\": \"granite-4.0-tiny-preview\",\n  \"name\": \"IBM Granite 4.0 Tiny Preview\",\n  \"organization_id\": \"ibm\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A preliminary version of the smallest model in the upcoming Granite 4.0 family, released May 2025. It utilizes a novel hybrid Mamba-2/Transformer, fine-grained mixture of experts (MoE) architecture (7B total parameters, 1B active at inference). This preview version is partially trained (2.5T tokens) but demonstrates significant memory efficiency and performance potential, validated for at least 128K context length without positional encoding.\",\n  \"release_date\": \"2025-05-02\",\n  \"announcement_date\": \"2025-05-02\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 7000000000,\n  \"training_tokens\": 2500000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.ibm.com/granite/docs/\",\n  \"source_playground\": \"https://www.ibm.com/granite/playground/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/ibm-granite/granite-4.0-tiny-preview\",\n  \"created_at\": \"2025-07-19T19:49:05.720766+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.720766+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/ibm/organization.json",
    "content": "{\n  \"organization_id\": \"ibm\",\n  \"name\": \"IBM\",\n  \"website\": \"https://ibm.com\",\n  \"description\": \"Technology and consulting company\",\n  \"country\": null,\n  \"created_at\": \"2025-07-19T19:49:05.719047+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.719047+00:00\"\n}"
  },
  {
    "path": "data/organizations/meta/models/llama-3.1-405b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1562,\n    \"benchmark_id\": \"api-bank\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.92,\n    \"normalized_score\": 0.92,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.382379+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.382379+00:00\",\n    \"benchmark_name\": \"API-Bank\"\n  },\n  {\n    \"model_benchmark_id\": 16,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.969,\n    \"normalized_score\": 0.969,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.118562+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.118562+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 848,\n    \"benchmark_id\": \"bfcl\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.885,\n    \"normalized_score\": 0.885,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.775431+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.775431+00:00\",\n    \"benchmark_name\": \"BFCL\"\n  },\n  {\n    \"model_benchmark_id\": 950,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.848,\n    \"normalized_score\": 0.848,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2407.21783\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.004517+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.004517+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 1565,\n    \"benchmark_id\": \"gorilla-benchmark-api-bench\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.353,\n    \"normalized_score\": 0.353,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.390263+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.390263+00:00\",\n    \"benchmark_name\": \"Gorilla Benchmark API Bench\"\n  },\n  {\n    \"model_benchmark_id\": 291,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.507,\n    \"normalized_score\": 0.507,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.662460+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.662460+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 988,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.968,\n    \"normalized_score\": 0.968,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"8-shot, CoT, em_maj1@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.071677+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.071677+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 780,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.89,\n    \"normalized_score\": 0.89,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.636480+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.636480+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 616,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.886,\n    \"normalized_score\": 0.886,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.270752+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.270752+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 394,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.738,\n    \"normalized_score\": 0.738,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, CoT, final_em\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.846056+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.846056+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1578,\n    \"benchmark_id\": \"mbpp-evalplus\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.886,\n    \"normalized_score\": 0.886,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, base, pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.428183+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.428183+00:00\",\n    \"benchmark_name\": \"MBPP EvalPlus\"\n  },\n  {\n    \"model_benchmark_id\": 79,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.873,\n    \"normalized_score\": 0.873,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot, macro_avg/acc\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.249582+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.249582+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1548,\n    \"benchmark_id\": \"mmlu-(cot)\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.886,\n    \"normalized_score\": 0.886,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, macro_avg/acc\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.339647+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.339647+00:00\",\n    \"benchmark_name\": \"MMLU (CoT)\"\n  },\n  {\n    \"model_benchmark_id\": 186,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot, CoT, micro_avg/acc_char\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.458814+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.458814+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1572,\n    \"benchmark_id\": \"multilingual-mgsm-(cot)\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.916,\n    \"normalized_score\": 0.916,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, CoT, em\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.409472+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.409472+00:00\",\n    \"benchmark_name\": \"Multilingual MGSM (CoT)\"\n  },\n  {\n    \"model_benchmark_id\": 1552,\n    \"benchmark_id\": \"multipl-e-humaneval\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.752,\n    \"normalized_score\": 0.752,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.352505+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.352505+00:00\",\n    \"benchmark_name\": \"Multipl-E HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1555,\n    \"benchmark_id\": \"multipl-e-mbpp\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.657,\n    \"normalized_score\": 0.657,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.359473+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.359473+00:00\",\n    \"benchmark_name\": \"Multipl-E MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1568,\n    \"benchmark_id\": \"nexus\",\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"score\": 0.587,\n    \"normalized_score\": 0.587,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, macro_avg/acc\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.398966+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.398966+00:00\",\n    \"benchmark_name\": \"Nexus\"\n  }\n]"
  },
  {
    "path": "data/organizations/meta/models/llama-3.1-405b-instruct/model.json",
    "content": "{\n  \"model_id\": \"llama-3.1-405b-instruct\",\n  \"name\": \"Llama 3.1 405B Instruct\",\n  \"organization_id\": \"meta\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Llama 3.1 405B Instruct is a large language model optimized for multilingual dialogue use cases. It outperforms many available open source and closed chat models on common industry benchmarks. The model supports 8 languages and has a 128K token context length.\",\n  \"release_date\": \"2024-07-23\",\n  \"announcement_date\": \"2024-07-23\",\n  \"license_id\": \"llama_3_1_community_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 405000000000,\n  \"training_tokens\": 15000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://github.com/meta-llama/llama-models\",\n  \"source_playground\": \"https://llama.meta.com/llama-downloads\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://ai.meta.com/blog/meta-llama-3-1/\",\n  \"source_repo_link\": \"https://github.com/meta-llama/llama-models\",\n  \"source_weights_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.585389+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.585389+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/meta/models/llama-3.1-70b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1560,\n    \"benchmark_id\": \"api-bank\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.9,\n    \"normalized_score\": 0.9,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.378301+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.378301+00:00\",\n    \"benchmark_name\": \"API-Bank\"\n  },\n  {\n    \"model_benchmark_id\": 14,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.948,\n    \"normalized_score\": 0.948,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.113697+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.113697+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 846,\n    \"benchmark_id\": \"bfcl\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.848,\n    \"normalized_score\": 0.848,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.771784+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.771784+00:00\",\n    \"benchmark_name\": \"BFCL\"\n  },\n  {\n    \"model_benchmark_id\": 948,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.796,\n    \"normalized_score\": 0.796,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2407.21783\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.001514+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.001514+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 1563,\n    \"benchmark_id\": \"gorilla-benchmark-api-bench\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.297,\n    \"normalized_score\": 0.297,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.386457+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.386457+00:00\",\n    \"benchmark_name\": \"Gorilla Benchmark API Bench\"\n  },\n  {\n    \"model_benchmark_id\": 288,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.417,\n    \"normalized_score\": 0.417,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.657221+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.657221+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1556,\n    \"benchmark_id\": \"gsm-8k-(cot)\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.951,\n    \"normalized_score\": 0.951,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"8-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.362878+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.362878+00:00\",\n    \"benchmark_name\": \"GSM-8K (CoT)\"\n  },\n  {\n    \"model_benchmark_id\": 778,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.805,\n    \"normalized_score\": 0.805,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.632931+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.632931+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 614,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.875,\n    \"normalized_score\": 0.875,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.266791+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.266791+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1558,\n    \"benchmark_id\": \"math-(cot)\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.68,\n    \"normalized_score\": 0.68,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.371489+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.371489+00:00\",\n    \"benchmark_name\": \"MATH (CoT)\"\n  },\n  {\n    \"model_benchmark_id\": 1549,\n    \"benchmark_id\": \"mbpp-++-base-version\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.86,\n    \"normalized_score\": 0.86,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.344061+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.344061+00:00\",\n    \"benchmark_name\": \"MBPP ++ base version\"\n  },\n  {\n    \"model_benchmark_id\": 76,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.836,\n    \"normalized_score\": 0.836,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.243294+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.243294+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1546,\n    \"benchmark_id\": \"mmlu-(cot)\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.86,\n    \"normalized_score\": 0.86,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.334507+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.334507+00:00\",\n    \"benchmark_name\": \"MMLU (CoT)\"\n  },\n  {\n    \"model_benchmark_id\": 184,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.664,\n    \"normalized_score\": 0.664,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.455089+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.455089+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1570,\n    \"benchmark_id\": \"multilingual-mgsm-(cot)\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.869,\n    \"normalized_score\": 0.869,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain-of-Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.405488+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.405488+00:00\",\n    \"benchmark_name\": \"Multilingual MGSM (CoT)\"\n  },\n  {\n    \"model_benchmark_id\": 1550,\n    \"benchmark_id\": \"multipl-e-humaneval\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.655,\n    \"normalized_score\": 0.655,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.347431+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.347431+00:00\",\n    \"benchmark_name\": \"Multipl-E HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1553,\n    \"benchmark_id\": \"multipl-e-mbpp\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.62,\n    \"normalized_score\": 0.62,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.356043+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.356043+00:00\",\n    \"benchmark_name\": \"Multipl-E MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1566,\n    \"benchmark_id\": \"nexus\",\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"score\": 0.567,\n    \"normalized_score\": 0.567,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.394299+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.394299+00:00\",\n    \"benchmark_name\": \"Nexus\"\n  }\n]"
  },
  {
    "path": "data/organizations/meta/models/llama-3.1-70b-instruct/model.json",
    "content": "{\n  \"model_id\": \"llama-3.1-70b-instruct\",\n  \"name\": \"Llama 3.1 70B Instruct\",\n  \"organization_id\": \"meta\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Llama 3.1 70B Instruct is a large language model optimized for multilingual dialogue use cases. It outperforms many available open source and closed chat models on common industry benchmarks.\",\n  \"release_date\": \"2024-07-23\",\n  \"announcement_date\": \"2024-07-23\",\n  \"license_id\": \"llama_3_1_community_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 70000000000,\n  \"training_tokens\": 15000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.meta.com/llama/\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://ai.meta.com/research/publications/llama-3-open-foundation-and-fine-tuned-chat-models/\",\n  \"source_scorecard_blog_link\": \"https://ai.meta.com/blog/meta-llama-3-1/\",\n  \"source_repo_link\": \"https://github.com/meta-llama/llama-models\",\n  \"source_weights_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.575761+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.575761+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/meta/models/llama-3.1-8b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1561,\n    \"benchmark_id\": \"api-bank\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.826,\n    \"normalized_score\": 0.826,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.380088+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.380088+00:00\",\n    \"benchmark_name\": \"API-Bank\"\n  },\n  {\n    \"model_benchmark_id\": 15,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.834,\n    \"normalized_score\": 0.834,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.115810+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.115810+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 847,\n    \"benchmark_id\": \"bfcl\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.761,\n    \"normalized_score\": 0.761,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.773659+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.773659+00:00\",\n    \"benchmark_name\": \"BFCL\"\n  },\n  {\n    \"model_benchmark_id\": 949,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.595,\n    \"normalized_score\": 0.595,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2407.21783\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.003032+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.003032+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 1564,\n    \"benchmark_id\": \"gorilla-benchmark-api-bench\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.082,\n    \"normalized_score\": 0.082,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.388429+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.388429+00:00\",\n    \"benchmark_name\": \"Gorilla Benchmark API Bench\"\n  },\n  {\n    \"model_benchmark_id\": 290,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.304,\n    \"normalized_score\": 0.304,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.660952+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.660952+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1557,\n    \"benchmark_id\": \"gsm-8k-(cot)\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.845,\n    \"normalized_score\": 0.845,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"8-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.364382+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.364382+00:00\",\n    \"benchmark_name\": \"GSM-8K (CoT)\"\n  },\n  {\n    \"model_benchmark_id\": 779,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.726,\n    \"normalized_score\": 0.726,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.634981+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.634981+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 615,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.804,\n    \"normalized_score\": 0.804,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"unspecified\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.268709+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.268709+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1559,\n    \"benchmark_id\": \"math-(cot)\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.519,\n    \"normalized_score\": 0.519,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.373274+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.373274+00:00\",\n    \"benchmark_name\": \"MATH (CoT)\"\n  },\n  {\n    \"model_benchmark_id\": 1577,\n    \"benchmark_id\": \"mbpp-evalplus-(base)\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.728,\n    \"normalized_score\": 0.728,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.424442+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.424442+00:00\",\n    \"benchmark_name\": \"MBPP EvalPlus (base)\"\n  },\n  {\n    \"model_benchmark_id\": 78,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.694,\n    \"normalized_score\": 0.694,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.247675+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.247675+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1547,\n    \"benchmark_id\": \"mmlu-(cot)\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.73,\n    \"normalized_score\": 0.73,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.337443+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.337443+00:00\",\n    \"benchmark_name\": \"MMLU (CoT)\"\n  },\n  {\n    \"model_benchmark_id\": 185,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.483,\n    \"normalized_score\": 0.483,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.457212+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.457212+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1571,\n    \"benchmark_id\": \"multilingual-mgsm-(cot)\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.689,\n    \"normalized_score\": 0.689,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.407707+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.407707+00:00\",\n    \"benchmark_name\": \"Multilingual MGSM (CoT)\"\n  },\n  {\n    \"model_benchmark_id\": 1551,\n    \"benchmark_id\": \"multipl-e-humaneval\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.508,\n    \"normalized_score\": 0.508,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.350301+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.350301+00:00\",\n    \"benchmark_name\": \"Multipl-E HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1554,\n    \"benchmark_id\": \"multipl-e-mbpp\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.524,\n    \"normalized_score\": 0.524,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.357886+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.357886+00:00\",\n    \"benchmark_name\": \"Multipl-E MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1567,\n    \"benchmark_id\": \"nexus\",\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"score\": 0.385,\n    \"normalized_score\": 0.385,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.396611+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.396611+00:00\",\n    \"benchmark_name\": \"Nexus\"\n  }\n]"
  },
  {
    "path": "data/organizations/meta/models/llama-3.1-8b-instruct/model.json",
    "content": "{\n  \"model_id\": \"llama-3.1-8b-instruct\",\n  \"name\": \"Llama 3.1 8B Instruct\",\n  \"organization_id\": \"meta\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Llama 3.1 8B Instruct is a multilingual large language model optimized for dialogue use cases. It features a 128K context length, state-of-the-art tool use, and strong reasoning capabilities.\",\n  \"release_date\": \"2024-07-23\",\n  \"announcement_date\": \"2024-07-23\",\n  \"license_id\": \"llama_3_1_community_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2023-12-31\",\n  \"param_count\": 8000000000,\n  \"training_tokens\": 15000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.llama.com/\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://ai.meta.com/blog/meta-llama-3-1/\",\n  \"source_repo_link\": \"https://github.com/meta-llama/llama-models\",\n  \"source_weights_link\": \"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.582878+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.582878+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/meta/models/llama-3.2-11b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1253,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"score\": 0.911,\n    \"normalized_score\": 0.911,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Test accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.631448+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.631448+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 861,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"score\": 0.834,\n    \"normalized_score\": 0.834,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Test, 0-shot CoT relaxed accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.801741+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.801741+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 883,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"score\": 0.884,\n    \"normalized_score\": 0.884,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Test ANLS\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.839416+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.839416+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 292,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"score\": 0.328,\n    \"normalized_score\": 0.328,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.663962+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.663962+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 395,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"score\": 0.519,\n    \"normalized_score\": 0.519,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.847598+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.847598+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 523,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"score\": 0.515,\n    \"normalized_score\": 0.515,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Test accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.086640+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.086640+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1284,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"score\": 0.689,\n    \"normalized_score\": 0.689,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.690958+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.690958+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 80,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"score\": 0.73,\n    \"normalized_score\": 0.73,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Macro average accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.251362+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.251362+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 566,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"score\": 0.507,\n    \"normalized_score\": 0.507,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Val, 0-shot CoT, micro avg accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.164872+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.164872+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1530,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"score\": 0.33,\n    \"normalized_score\": 0.33,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Test accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.288730+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.288730+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1580,\n    \"benchmark_id\": \"vqav2-(test)\",\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"score\": 0.752,\n    \"normalized_score\": 0.752,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.434081+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.434081+00:00\",\n    \"benchmark_name\": \"VQAv2 (test)\"\n  }\n]"
  },
  {
    "path": "data/organizations/meta/models/llama-3.2-11b-instruct/model.json",
    "content": "{\n  \"model_id\": \"llama-3.2-11b-instruct\",\n  \"name\": \"Llama 3.2 11B Instruct\",\n  \"organization_id\": \"meta\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Llama 3.2 11B Vision Instruct is an instruction-tuned multimodal large language model optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. It accepts text and images as input and generates text as output.\",\n  \"release_date\": \"2024-09-25\",\n  \"announcement_date\": \"2024-09-25\",\n  \"license_id\": \"llama_3_2_community_license\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2023-12-31\",\n  \"param_count\": 10600000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n  \"source_repo_link\": \"https://github.com/facebookresearch/llama\",\n  \"source_weights_link\": \"https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.588479+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.588479+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/meta/models/llama-3.2-3b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 17,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.786,\n    \"normalized_score\": 0.786,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, acc\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.120164+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.120164+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1583,\n    \"benchmark_id\": \"bfcl-v2\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.67,\n    \"normalized_score\": 0.67,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, acc\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.446368+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.446368+00:00\",\n    \"benchmark_name\": \"BFCL v2\"\n  },\n  {\n    \"model_benchmark_id\": 293,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.328,\n    \"normalized_score\": 0.328,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, acc\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.665423+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.665423+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 989,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.777,\n    \"normalized_score\": 0.777,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"8-shot, em_maj1@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.073210+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.073210+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 44,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.698,\n    \"normalized_score\": 0.698,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, acc\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.175473+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.175473+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 617,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.774,\n    \"normalized_score\": 0.774,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg(Prompt/Instruction acc Loose/Strict)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.272319+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.272319+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1589,\n    \"benchmark_id\": \"infinitebench-en.mc\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.633,\n    \"normalized_score\": 0.633,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, longbook_choice/acc\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.464298+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.464298+00:00\",\n    \"benchmark_name\": \"InfiniteBench/En.MC\"\n  },\n  {\n    \"model_benchmark_id\": 1588,\n    \"benchmark_id\": \"infinitebench-en.qa\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.198,\n    \"normalized_score\": 0.198,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, longbook_qa/f1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.460560+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.460560+00:00\",\n    \"benchmark_name\": \"InfiniteBench/En.QA\"\n  },\n  {\n    \"model_benchmark_id\": 396,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.48,\n    \"normalized_score\": 0.48,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, final_em\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.849582+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.849582+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1285,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.582,\n    \"normalized_score\": 0.582,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"CoT, em\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.692573+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.692573+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 81,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.634,\n    \"normalized_score\": 0.634,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot, macro_avg/acc\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.252797+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.252797+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1569,\n    \"benchmark_id\": \"nexus\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.343,\n    \"normalized_score\": 0.343,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, macro_avg/acc\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.401027+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.401027+00:00\",\n    \"benchmark_name\": \"Nexus\"\n  },\n  {\n    \"model_benchmark_id\": 1590,\n    \"benchmark_id\": \"nih-multi-needle\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.847,\n    \"normalized_score\": 0.847,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, recall\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.469424+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.469424+00:00\",\n    \"benchmark_name\": \"NIH/Multi-needle\"\n  },\n  {\n    \"model_benchmark_id\": 1581,\n    \"benchmark_id\": \"open-rewrite\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.401,\n    \"normalized_score\": 0.401,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, micro_avg/rougeL\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.438526+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.438526+00:00\",\n    \"benchmark_name\": \"Open-rewrite\"\n  },\n  {\n    \"model_benchmark_id\": 1582,\n    \"benchmark_id\": \"tldr9+-(test)\",\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"score\": 0.19,\n    \"normalized_score\": 0.19,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"1-shot, rougeL\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.443142+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.443142+00:00\",\n    \"benchmark_name\": \"TLDR9+ (test)\"\n  }\n]"
  },
  {
    "path": "data/organizations/meta/models/llama-3.2-3b-instruct/model.json",
    "content": "{\n  \"model_id\": \"llama-3.2-3b-instruct\",\n  \"name\": \"Llama 3.2 3B Instruct\",\n  \"organization_id\": \"meta\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Llama 3.2 3B Instruct is a large language model that supports a context length of 128K tokens and are state-of-the-art in their class for on-device use cases like summarization, instruction following, and rewriting tasks running locally at the edge.\",\n  \"release_date\": \"2024-09-25\",\n  \"announcement_date\": \"2024-09-25\",\n  \"license_id\": \"llama_3_2_community_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 3210000000,\n  \"training_tokens\": 9000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://github.com/meta-llama/llama-models\",\n  \"source_playground\": \"https://llama.meta.com/llama-downloads\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n  \"source_repo_link\": \"https://github.com/meta-llama/llama-models\",\n  \"source_weights_link\": \"https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.591372+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.591372+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/meta/models/llama-3.2-90b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1252,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.923,\n    \"normalized_score\": 0.923,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.629735+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.629735+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 860,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.855,\n    \"normalized_score\": 0.855,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.799861+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.799861+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 882,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.901,\n    \"normalized_score\": 0.901,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.837654+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.837654+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 289,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.467,\n    \"normalized_score\": 0.467,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.659193+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.659193+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1576,\n    \"benchmark_id\": \"infographicsqa\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.568,\n    \"normalized_score\": 0.568,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.420214+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.420214+00:00\",\n    \"benchmark_name\": \"InfographicsQA\"\n  },\n  {\n    \"model_benchmark_id\": 393,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.68,\n    \"normalized_score\": 0.68,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.844378+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.844378+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 522,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.573,\n    \"normalized_score\": 0.573,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.084321+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.084321+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1283,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.869,\n    \"normalized_score\": 0.869,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.688987+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.688987+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 77,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.86,\n    \"normalized_score\": 0.86,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.245688+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.245688+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 565,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.603,\n    \"normalized_score\": 0.603,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.162828+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.162828+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1529,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.452,\n    \"normalized_score\": 0.452,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.287214+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.287214+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 908,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.735,\n    \"normalized_score\": 0.735,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.892927+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.892927+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1573,\n    \"benchmark_id\": \"vqav2\",\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"score\": 0.781,\n    \"normalized_score\": 0.781,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.412800+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.412800+00:00\",\n    \"benchmark_name\": \"VQAv2\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/meta/models/llama-3.2-90b-instruct/model.json",
    "content": "{\n  \"model_id\": \"llama-3.2-90b-instruct\",\n  \"name\": \"Llama 3.2 90B Instruct\",\n  \"organization_id\": \"meta\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Llama 3.2 90B is a large multimodal language model optimized for visual recognition, image reasoning, and captioning tasks. It supports a context length of 128,000 tokens and is designed for deployment on edge and mobile devices, offering state-of-the-art performance in image understanding and generative tasks.\",\n  \"release_date\": \"2024-09-25\",\n  \"announcement_date\": \"2024-09-25\",\n  \"license_id\": \"llama3_2\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 90000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/\",\n  \"source_repo_link\": \"https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct\",\n  \"source_weights_link\": \"https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.579590+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.579590+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/meta/models/llama-3.3-70b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1584,\n    \"benchmark_id\": \"bfcl-v2\",\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"score\": 0.773,\n    \"normalized_score\": 0.773,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.448863+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.448863+00:00\",\n    \"benchmark_name\": \"BFCL v2\"\n  },\n  {\n    \"model_benchmark_id\": 296,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"score\": 0.505,\n    \"normalized_score\": 0.505,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.669923+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.669923+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 781,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"score\": 0.884,\n    \"normalized_score\": 0.884,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.637990+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.637990+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 618,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"score\": 0.921,\n    \"normalized_score\": 0.921,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.274109+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.274109+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 399,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"score\": 0.77,\n    \"normalized_score\": 0.77,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.854268+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.854268+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1579,\n    \"benchmark_id\": \"mbpp-evalplus\",\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"score\": 0.876,\n    \"normalized_score\": 0.876,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.429699+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.429699+00:00\",\n    \"benchmark_name\": \"MBPP EvalPlus\"\n  },\n  {\n    \"model_benchmark_id\": 1288,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"score\": 0.911,\n    \"normalized_score\": 0.911,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.697414+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.697414+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 84,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"score\": 0.86,\n    \"normalized_score\": 0.86,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.259963+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.259963+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 189,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"score\": 0.689,\n    \"normalized_score\": 0.689,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.463251+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.463251+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/meta/models/llama-3.3-70b-instruct/model.json",
    "content": "{\n  \"model_id\": \"llama-3.3-70b-instruct\",\n  \"name\": \"Llama 3.3 70B Instruct\",\n  \"organization_id\": \"meta\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Llama 3.3 is a multilingual large language model optimized for dialogue use cases across multiple languages. It is a pretrained and instruction-tuned generative model with 70 billion parameters, outperforming many open-source and closed chat models on common industry benchmarks. Llama 3.3 supports a context length of 128,000 tokens and is designed for commercial and research use in multiple languages.\",\n  \"release_date\": \"2024-12-06\",\n  \"announcement_date\": \"2024-12-06\",\n  \"license_id\": \"llama_3_3_community_license_agreement\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 70000000000,\n  \"training_tokens\": 15000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct\",\n  \"source_playground\": \"https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md\",\n  \"source_weights_link\": \"https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.603412+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.603412+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/meta/models/llama-4-maverick/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 862,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.9,\n    \"normalized_score\": 0.9,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.803334+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.803334+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 884,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.944,\n    \"normalized_score\": 0.944,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.841331+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.841331+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 294,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.698,\n    \"normalized_score\": 0.698,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.666983+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.666983+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1115,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.434,\n    \"normalized_score\": 0.434,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.326624+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.326624+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 397,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.612,\n    \"normalized_score\": 0.612,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"4-shot em_maj1@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.851038+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.851038+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 524,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.737,\n    \"normalized_score\": 0.737,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.088308+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.088308+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1179,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.776,\n    \"normalized_score\": 0.776,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.485323+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.485323+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1286,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.923,\n    \"normalized_score\": 0.923,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.694238+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.694238+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 82,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.855,\n    \"normalized_score\": 0.855,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot macro_avg/acc_char\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.254352+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.254352+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 187,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.805,\n    \"normalized_score\": 0.805,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.460210+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.460210+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 567,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.734,\n    \"normalized_score\": 0.734,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.167124+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.167124+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1531,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.596,\n    \"normalized_score\": 0.596,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.290598+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.290598+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1591,\n    \"benchmark_id\": \"tydiqa\",\n    \"model_id\": \"llama-4-maverick\",\n    \"score\": 0.317,\n    \"normalized_score\": 0.317,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"1-shot average/f1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.475429+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.475429+00:00\",\n    \"benchmark_name\": \"TydiQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/meta/models/llama-4-maverick/model.json",
    "content": "{\n  \"model_id\": \"llama-4-maverick\",\n  \"name\": \"Llama 4 Maverick\",\n  \"organization_id\": \"meta\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Llama 4 Maverick is a natively multimodal model capable of processing both text and images. It features a 17 billion active parameter mixture-of-experts (MoE) architecture with 128 experts, supporting a wide range of multimodal tasks such as conversational interaction, image analysis, and code generation. The model includes a 1 million token context window.\",\n  \"release_date\": \"2025-04-05\",\n  \"announcement_date\": \"2025-04-05\",\n  \"license_id\": \"llama_4_community_license_agreement\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 400000000000,\n  \"training_tokens\": 22000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n  \"source_playground\": \"https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/meta-llama/llama-models/tree/main/models/llama4\",\n  \"source_weights_link\": \"https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.595636+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.595636+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/meta/models/llama-4-scout/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 863,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.888,\n    \"normalized_score\": 0.888,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.804916+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.804916+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 885,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.944,\n    \"normalized_score\": 0.944,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot (ANLS)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.842838+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.842838+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 295,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.572,\n    \"normalized_score\": 0.572,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot (accuracy)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.668436+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.668436+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1116,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.328,\n    \"normalized_score\": 0.328,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.328074+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.328074+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 398,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.503,\n    \"normalized_score\": 0.503,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"4-shot em_maj1@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.852669+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.852669+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 525,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.707,\n    \"normalized_score\": 0.707,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.089981+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.089981+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1180,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.678,\n    \"normalized_score\": 0.678,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.487376+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.487376+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1287,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.906,\n    \"normalized_score\": 0.906,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot (average/em)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.695659+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.695659+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 83,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.796,\n    \"normalized_score\": 0.796,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot macro_avg/acc_char\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.258246+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.258246+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 188,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.743,\n    \"normalized_score\": 0.743,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot (macro_avg/acc)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.461726+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.461726+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 568,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.694,\n    \"normalized_score\": 0.694,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.169227+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.169227+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1592,\n    \"benchmark_id\": \"tydiqa\",\n    \"model_id\": \"llama-4-scout\",\n    \"score\": 0.315,\n    \"normalized_score\": 0.315,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"1-shot average/f1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.477364+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.477364+00:00\",\n    \"benchmark_name\": \"TydiQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/meta/models/llama-4-scout/model.json",
    "content": "{\n  \"model_id\": \"llama-4-scout\",\n  \"name\": \"Llama 4 Scout\",\n  \"organization_id\": \"meta\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Llama 4 Scout is a natively multimodal model capable of processing both text and images. It features a 17 billion activated parameter (109B total) mixture-of-experts (MoE) architecture with 16 experts, supporting a wide range of multimodal tasks such as conversational interaction, image analysis, and code generation. The model includes a 10 million token context window.\",\n  \"release_date\": \"2025-04-05\",\n  \"announcement_date\": \"2025-04-05\",\n  \"license_id\": \"llama_4_community_license_agreement\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 109000000000,\n  \"training_tokens\": 40000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://ai.meta.com/blog/llama-4-multimodal-intelligence/\",\n  \"source_playground\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/meta-llama/llama-models/tree/main/models/llama4\",\n  \"source_weights_link\": \"https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.599841+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.599841+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/meta/organization.json",
    "content": "{\n  \"organization_id\": \"meta\",\n  \"name\": \"Meta\",\n  \"website\": \"https://meta.com\",\n  \"description\": \"Social media company with AI research\",\n  \"country\": \"US\",\n  \"created_at\": \"2025-07-19T19:49:05.572641+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.572641+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/microsoft/models/phi-3.5-mini-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 13,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.846,\n    \"normalized_score\": 0.846,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.111398+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.111398+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1448,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.37,\n    \"normalized_score\": 0.37,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.088299+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.088299+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1078,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.69,\n    \"normalized_score\": 0.69,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.245591+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.245591+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1025,\n    \"benchmark_id\": \"boolq\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.78,\n    \"normalized_score\": 0.78,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"2-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.132882+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.132882+00:00\",\n    \"benchmark_name\": \"BoolQ\"\n  },\n  {\n    \"model_benchmark_id\": 1504,\n    \"benchmark_id\": \"govreport\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.259,\n    \"normalized_score\": 0.259,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.222697+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.222697+00:00\",\n    \"benchmark_name\": \"GovReport\"\n  },\n  {\n    \"model_benchmark_id\": 285,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.304,\n    \"normalized_score\": 0.304,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.651230+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.651230+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 987,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.862,\n    \"normalized_score\": 0.862,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"8-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.070240+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.070240+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 43,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.694,\n    \"normalized_score\": 0.694,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.173447+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.173447+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 777,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.628,\n    \"normalized_score\": 0.628,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.631199+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.631199+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 392,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.485,\n    \"normalized_score\": 0.485,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.842901+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.842901+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1178,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.696,\n    \"normalized_score\": 0.696,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.481045+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.481045+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1494,\n    \"benchmark_id\": \"mega-mlqa\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.617,\n    \"normalized_score\": 0.617,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.191909+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.191909+00:00\",\n    \"benchmark_name\": \"MEGA MLQA\"\n  },\n  {\n    \"model_benchmark_id\": 1496,\n    \"benchmark_id\": \"mega-tydi-qa\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.622,\n    \"normalized_score\": 0.622,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.197084+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.197084+00:00\",\n    \"benchmark_name\": \"MEGA TyDi QA\"\n  },\n  {\n    \"model_benchmark_id\": 1498,\n    \"benchmark_id\": \"mega-udpos\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.465,\n    \"normalized_score\": 0.465,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.203616+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.203616+00:00\",\n    \"benchmark_name\": \"MEGA UDPOS\"\n  },\n  {\n    \"model_benchmark_id\": 1500,\n    \"benchmark_id\": \"mega-xcopa\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.631,\n    \"normalized_score\": 0.631,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.210364+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.210364+00:00\",\n    \"benchmark_name\": \"MEGA XCOPA\"\n  },\n  {\n    \"model_benchmark_id\": 1502,\n    \"benchmark_id\": \"mega-xstorycloze\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.735,\n    \"normalized_score\": 0.735,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.217597+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.217597+00:00\",\n    \"benchmark_name\": \"MEGA XStoryCloze\"\n  },\n  {\n    \"model_benchmark_id\": 1282,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.479,\n    \"normalized_score\": 0.479,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.687534+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.687534+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 75,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.69,\n    \"normalized_score\": 0.69,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.240966+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.240966+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 180,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.474,\n    \"normalized_score\": 0.474,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.447960+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.450171+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1476,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.554,\n    \"normalized_score\": 0.554,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.148935+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.148935+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1471,\n    \"benchmark_id\": \"openbookqa\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.792,\n    \"normalized_score\": 0.792,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.136354+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.136354+00:00\",\n    \"benchmark_name\": \"OpenBookQA\"\n  },\n  {\n    \"model_benchmark_id\": 1034,\n    \"benchmark_id\": \"piqa\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.81,\n    \"normalized_score\": 0.81,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.154444+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.154444+00:00\",\n    \"benchmark_name\": \"PIQA\"\n  },\n  {\n    \"model_benchmark_id\": 1488,\n    \"benchmark_id\": \"qasper\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.419,\n    \"normalized_score\": 0.419,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.173290+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.173290+00:00\",\n    \"benchmark_name\": \"Qasper\"\n  },\n  {\n    \"model_benchmark_id\": 1506,\n    \"benchmark_id\": \"qmsum\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.213,\n    \"normalized_score\": 0.213,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.228389+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.228389+00:00\",\n    \"benchmark_name\": \"QMSum\"\n  },\n  {\n    \"model_benchmark_id\": 1492,\n    \"benchmark_id\": \"repoqa\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.77,\n    \"normalized_score\": 0.77,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"average\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.186426+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.186426+00:00\",\n    \"benchmark_name\": \"RepoQA\"\n  },\n  {\n    \"model_benchmark_id\": 1490,\n    \"benchmark_id\": \"ruler\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.841,\n    \"normalized_score\": 0.841,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"128k\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.179307+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.179307+00:00\",\n    \"benchmark_name\": \"RULER\"\n  },\n  {\n    \"model_benchmark_id\": 1043,\n    \"benchmark_id\": \"social-iqa\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.747,\n    \"normalized_score\": 0.747,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.177860+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.177860+00:00\",\n    \"benchmark_name\": \"Social IQa\"\n  },\n  {\n    \"model_benchmark_id\": 825,\n    \"benchmark_id\": \"squality\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.243,\n    \"normalized_score\": 0.243,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.722570+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.722570+00:00\",\n    \"benchmark_name\": \"SQuALITY\"\n  },\n  {\n    \"model_benchmark_id\": 1508,\n    \"benchmark_id\": \"summscreenfd\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.16,\n    \"normalized_score\": 0.16,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.234498+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.234498+00:00\",\n    \"benchmark_name\": \"SummScreenFD\"\n  },\n  {\n    \"model_benchmark_id\": 134,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.64,\n    \"normalized_score\": 0.64,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.346508+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.346508+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 1063,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"score\": 0.685,\n    \"normalized_score\": 0.685,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.217697+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.217697+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/microsoft/models/phi-3.5-mini-instruct/model.json",
    "content": "{\n  \"model_id\": \"phi-3.5-mini-instruct\",\n  \"name\": \"Phi-3.5-mini-instruct\",\n  \"organization_id\": \"microsoft\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Phi-3.5-mini-instruct is a 3.8B-parameter model that supports up to 128K context tokens, with improved multilingual capabilities across over 20 languages. It underwent additional training and safety post-training to enhance instruction-following, reasoning, math, and code generation. Ideal for environments with memory or latency constraints, it uses an MIT license.\",\n  \"release_date\": \"2024-08-23\",\n  \"announcement_date\": \"2024-08-23\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 3800000000,\n  \"training_tokens\": 3400000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2404.14219\",\n  \"source_scorecard_blog_link\": \"https://techcommunity.microsoft.com/blog/azure-ai-services-blog/discover-the-new-multi-lingual-high-quality-phi-3-5-slms/4225280\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/microsoft/Phi-3.5-mini-instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.559796+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.559796+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/microsoft/models/phi-3.5-moe-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 12,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.91,\n    \"normalized_score\": 0.91,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.108027+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.108027+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1447,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.379,\n    \"normalized_score\": 0.379,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.086453+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.086453+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1077,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.791,\n    \"normalized_score\": 0.791,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.244054+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.244054+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1024,\n    \"benchmark_id\": \"boolq\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.846,\n    \"normalized_score\": 0.846,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"2-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.130867+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.130867+00:00\",\n    \"benchmark_name\": \"BoolQ\"\n  },\n  {\n    \"model_benchmark_id\": 1503,\n    \"benchmark_id\": \"govreport\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.264,\n    \"normalized_score\": 0.264,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.221191+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.221191+00:00\",\n    \"benchmark_name\": \"GovReport\"\n  },\n  {\n    \"model_benchmark_id\": 284,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.368,\n    \"normalized_score\": 0.368,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.649286+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.649286+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 986,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.887,\n    \"normalized_score\": 0.887,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"8-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.068601+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.068601+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 42,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.838,\n    \"normalized_score\": 0.838,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.171621+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.171621+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 776,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.707,\n    \"normalized_score\": 0.707,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.629465+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.629465+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 391,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.595,\n    \"normalized_score\": 0.595,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.841295+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.841295+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1177,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.808,\n    \"normalized_score\": 0.808,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.479387+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.479387+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1493,\n    \"benchmark_id\": \"mega-mlqa\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.653,\n    \"normalized_score\": 0.653,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.190086+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.190086+00:00\",\n    \"benchmark_name\": \"MEGA MLQA\"\n  },\n  {\n    \"model_benchmark_id\": 1495,\n    \"benchmark_id\": \"mega-tydi-qa\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.671,\n    \"normalized_score\": 0.671,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.195123+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.195123+00:00\",\n    \"benchmark_name\": \"MEGA TyDi QA\"\n  },\n  {\n    \"model_benchmark_id\": 1497,\n    \"benchmark_id\": \"mega-udpos\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.604,\n    \"normalized_score\": 0.604,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.201497+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.201497+00:00\",\n    \"benchmark_name\": \"MEGA UDPOS\"\n  },\n  {\n    \"model_benchmark_id\": 1499,\n    \"benchmark_id\": \"mega-xcopa\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.766,\n    \"normalized_score\": 0.766,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.208476+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.208476+00:00\",\n    \"benchmark_name\": \"MEGA XCOPA\"\n  },\n  {\n    \"model_benchmark_id\": 1501,\n    \"benchmark_id\": \"mega-xstorycloze\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.828,\n    \"normalized_score\": 0.828,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.214764+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.214764+00:00\",\n    \"benchmark_name\": \"MEGA XStoryCloze\"\n  },\n  {\n    \"model_benchmark_id\": 1281,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.587,\n    \"normalized_score\": 0.587,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot chain-of-thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.686017+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.686017+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 74,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.789,\n    \"normalized_score\": 0.789,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.239087+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.239087+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 178,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.453,\n    \"normalized_score\": 0.453,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.444580+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.446076+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1475,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.699,\n    \"normalized_score\": 0.699,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.147234+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.147234+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1470,\n    \"benchmark_id\": \"openbookqa\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.896,\n    \"normalized_score\": 0.896,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.134275+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.134275+00:00\",\n    \"benchmark_name\": \"OpenBookQA\"\n  },\n  {\n    \"model_benchmark_id\": 1033,\n    \"benchmark_id\": \"piqa\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.886,\n    \"normalized_score\": 0.886,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.152199+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.152199+00:00\",\n    \"benchmark_name\": \"PIQA\"\n  },\n  {\n    \"model_benchmark_id\": 1487,\n    \"benchmark_id\": \"qasper\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.4,\n    \"normalized_score\": 0.4,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.171579+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.171579+00:00\",\n    \"benchmark_name\": \"Qasper\"\n  },\n  {\n    \"model_benchmark_id\": 1505,\n    \"benchmark_id\": \"qmsum\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.199,\n    \"normalized_score\": 0.199,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.226358+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.226358+00:00\",\n    \"benchmark_name\": \"QMSum\"\n  },\n  {\n    \"model_benchmark_id\": 1491,\n    \"benchmark_id\": \"repoqa\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.85,\n    \"normalized_score\": 0.85,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"average\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.184432+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.184432+00:00\",\n    \"benchmark_name\": \"RepoQA\"\n  },\n  {\n    \"model_benchmark_id\": 1489,\n    \"benchmark_id\": \"ruler\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.871,\n    \"normalized_score\": 0.871,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"long context (128K) evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.177557+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.177557+00:00\",\n    \"benchmark_name\": \"RULER\"\n  },\n  {\n    \"model_benchmark_id\": 1042,\n    \"benchmark_id\": \"social-iqa\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.78,\n    \"normalized_score\": 0.78,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.176106+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.176106+00:00\",\n    \"benchmark_name\": \"Social IQa\"\n  },\n  {\n    \"model_benchmark_id\": 824,\n    \"benchmark_id\": \"squality\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.241,\n    \"normalized_score\": 0.241,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.720914+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.720914+00:00\",\n    \"benchmark_name\": \"SQuALITY\"\n  },\n  {\n    \"model_benchmark_id\": 1507,\n    \"benchmark_id\": \"summscreenfd\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.169,\n    \"normalized_score\": 0.169,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.232655+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.232655+00:00\",\n    \"benchmark_name\": \"SummScreenFD\"\n  },\n  {\n    \"model_benchmark_id\": 133,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.775,\n    \"normalized_score\": 0.775,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.344788+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.344788+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 1062,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"phi-3.5-moe-instruct\",\n    \"score\": 0.813,\n    \"normalized_score\": 0.813,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.215763+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.215763+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/microsoft/models/phi-3.5-moe-instruct/model.json",
    "content": "{\n  \"model_id\": \"phi-3.5-moe-instruct\",\n  \"name\": \"Phi-3.5-MoE-instruct\",\n  \"organization_id\": \"microsoft\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Phi-3.5-MoE-instruct is a mixture-of-experts model with ~42B total parameters (6.6B active) and a 128K context window. It excels at reasoning, math, coding, and multilingual tasks, outperforming larger dense models in many benchmarks. It underwent a thorough safety post-training process (SFT + DPO) and is licensed under MIT. This model is ideal for scenarios where efficiency and high performance are both required, particularly in multi-lingual or reasoning-intensive tasks.\",\n  \"release_date\": \"2024-08-23\",\n  \"announcement_date\": \"2024-08-23\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 60000000000,\n  \"training_tokens\": 4900000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2404.14219\",\n  \"source_scorecard_blog_link\": \"https://techcommunity.microsoft.com/blog/azure-ai-services-blog/discover-the-new-multi-lingual-high-quality-phi-3-5-slms/4225280\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/microsoft/Phi-3.5-MoE-instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.555819+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.555819+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/microsoft/models/phi-3.5-vision-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1250,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"phi-3.5-vision-instruct\",\n    \"score\": 0.781,\n    \"normalized_score\": 0.781,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-vision-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.626694+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.626694+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 858,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"phi-3.5-vision-instruct\",\n    \"score\": 0.818,\n    \"normalized_score\": 0.818,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-vision-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.795942+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.795942+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 1520,\n    \"benchmark_id\": \"intergps\",\n    \"model_id\": \"phi-3.5-vision-instruct\",\n    \"score\": 0.363,\n    \"normalized_score\": 0.363,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-vision-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.261813+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.261813+00:00\",\n    \"benchmark_name\": \"InterGPS\"\n  },\n  {\n    \"model_benchmark_id\": 520,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"phi-3.5-vision-instruct\",\n    \"score\": 0.439,\n    \"normalized_score\": 0.439,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-vision-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.080462+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.080462+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1509,\n    \"benchmark_id\": \"mmbench\",\n    \"model_id\": \"phi-3.5-vision-instruct\",\n    \"score\": 0.819,\n    \"normalized_score\": 0.819,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-vision-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.238017+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.238017+00:00\",\n    \"benchmark_name\": \"MMBench\"\n  },\n  {\n    \"model_benchmark_id\": 563,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"phi-3.5-vision-instruct\",\n    \"score\": 0.43,\n    \"normalized_score\": 0.43,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-vision-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.158730+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.158730+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1522,\n    \"benchmark_id\": \"pope\",\n    \"model_id\": \"phi-3.5-vision-instruct\",\n    \"score\": 0.861,\n    \"normalized_score\": 0.861,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-vision-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.266959+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.266959+00:00\",\n    \"benchmark_name\": \"POPE\"\n  },\n  {\n    \"model_benchmark_id\": 1519,\n    \"benchmark_id\": \"scienceqa\",\n    \"model_id\": \"phi-3.5-vision-instruct\",\n    \"score\": 0.913,\n    \"normalized_score\": 0.913,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-vision-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.258220+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.258220+00:00\",\n    \"benchmark_name\": \"ScienceQA\"\n  },\n  {\n    \"model_benchmark_id\": 906,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"phi-3.5-vision-instruct\",\n    \"score\": 0.72,\n    \"normalized_score\": 0.72,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-3.5-vision-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.888892+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.888892+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/microsoft/models/phi-3.5-vision-instruct/model.json",
    "content": "{\n  \"model_id\": \"phi-3.5-vision-instruct\",\n  \"name\": \"Phi-3.5-vision-instruct\",\n  \"organization_id\": \"microsoft\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Phi-3.5-vision-instruct is a 4.2B-parameter open multimodal model with up to 128K context tokens. It emphasizes multi-frame image understanding and reasoning, boosting performance on single-image benchmarks while enabling multi-image comparison, summarization, and even video analysis. The model underwent safety post-training for improved instruction-following, alignment, and robust handling of visual and text inputs, and is released under the MIT license.\",\n  \"release_date\": \"2024-08-23\",\n  \"announcement_date\": \"2024-08-23\",\n  \"license_id\": \"mit\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 4200000000,\n  \"training_tokens\": 500000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/microsoft/Phi-3.5-vision-instruct\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2404.14219\",\n  \"source_scorecard_blog_link\": \"https://techcommunity.microsoft.com/blog/azure-ai-services-blog/discover-the-new-multi-lingual-high-quality-phi-3-5-slms/4225280\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/microsoft/Phi-3.5-vision-instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.563203+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.563203+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1445,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.754,\n    \"normalized_score\": 0.754,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.082804+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.082804+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 947,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.755,\n    \"normalized_score\": 0.755,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.999411+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.999411+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 282,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.561,\n    \"normalized_score\": 0.561,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.644574+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.644574+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 775,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.826,\n    \"normalized_score\": 0.826,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.628035+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.628035+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1437,\n    \"benchmark_id\": \"humaneval+\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.828,\n    \"normalized_score\": 0.828,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.064824+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.064824+00:00\",\n    \"benchmark_name\": \"HumanEval+\"\n  },\n  {\n    \"model_benchmark_id\": 611,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.63,\n    \"normalized_score\": 0.63,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.261770+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.261770+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 746,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.476,\n    \"normalized_score\": 0.476,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.569213+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.569213+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 389,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.804,\n    \"normalized_score\": 0.804,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.837602+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.837602+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1279,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.806,\n    \"normalized_score\": 0.806,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.681417+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.681417+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 72,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.848,\n    \"normalized_score\": 0.848,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.236043+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.236043+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 176,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.704,\n    \"normalized_score\": 0.704,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.441164+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.441164+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1466,\n    \"benchmark_id\": \"phibench\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.562,\n    \"normalized_score\": 0.562,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.124860+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.124860+00:00\",\n    \"benchmark_name\": \"PhiBench\"\n  },\n  {\n    \"model_benchmark_id\": 233,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"phi-4\",\n    \"score\": 0.03,\n    \"normalized_score\": 0.03,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2412.08905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"simple-evals\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.546523+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.546523+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4/model.json",
    "content": "{\n  \"model_id\": \"phi-4\",\n  \"name\": \"Phi 4\",\n  \"organization_id\": \"microsoft\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"phi-4 is a state-of-the-art open model built to excel at advanced reasoning, coding, and knowledge tasks. It leverages a blend of synthetic data, filtered web data, academic texts, and supervised fine-tuning for precision, alignment, and safety.\",\n  \"release_date\": \"2024-12-12\",\n  \"announcement_date\": \"2024-12-12\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2024-06-01\",\n  \"param_count\": 14700000000,\n  \"training_tokens\": 9800000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/microsoft/phi-4\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/pdf/2412.08905\",\n  \"source_scorecard_blog_link\": \"https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/microsoft/phi-4\",\n  \"created_at\": \"2025-07-19T19:49:05.549276+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.549276+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4-mini/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 11,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.837,\n    \"normalized_score\": 0.837,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.105059+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.105059+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1446,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.328,\n    \"normalized_score\": 0.328,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.084727+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.084727+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1076,\n    \"benchmark_id\": \"big-bench-hard\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.704,\n    \"normalized_score\": 0.704,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.242363+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.242363+00:00\",\n    \"benchmark_name\": \"BIG-Bench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1023,\n    \"benchmark_id\": \"boolq\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.812,\n    \"normalized_score\": 0.812,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"2-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.129244+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.129244+00:00\",\n    \"benchmark_name\": \"BoolQ\"\n  },\n  {\n    \"model_benchmark_id\": 283,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.252,\n    \"normalized_score\": 0.252,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.646470+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.646470+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 985,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.886,\n    \"normalized_score\": 0.886,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"8-shot, CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.066927+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.066927+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 41,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.691,\n    \"normalized_score\": 0.691,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.169983+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.169983+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 390,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.64,\n    \"normalized_score\": 0.64,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.839081+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.839081+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1280,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.639,\n    \"normalized_score\": 0.639,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.683394+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.683394+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 73,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.673,\n    \"normalized_score\": 0.673,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.237489+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.237489+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 177,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.528,\n    \"normalized_score\": 0.528,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.443019+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.443019+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1473,\n    \"benchmark_id\": \"multilingual-mmlu\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.493,\n    \"normalized_score\": 0.493,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.141886+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.141886+00:00\",\n    \"benchmark_name\": \"Multilingual MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1469,\n    \"benchmark_id\": \"openbookqa\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.792,\n    \"normalized_score\": 0.792,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.132301+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.132301+00:00\",\n    \"benchmark_name\": \"OpenBookQA\"\n  },\n  {\n    \"model_benchmark_id\": 1032,\n    \"benchmark_id\": \"piqa\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.776,\n    \"normalized_score\": 0.776,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.150113+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.150113+00:00\",\n    \"benchmark_name\": \"PIQA\"\n  },\n  {\n    \"model_benchmark_id\": 1041,\n    \"benchmark_id\": \"social-iqa\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.725,\n    \"normalized_score\": 0.725,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.172567+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.172567+00:00\",\n    \"benchmark_name\": \"Social IQa\"\n  },\n  {\n    \"model_benchmark_id\": 132,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.664,\n    \"normalized_score\": 0.664,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MC2, 10-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.343180+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.343180+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 149,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"phi-4-mini\",\n    \"score\": 0.67,\n    \"normalized_score\": 0.67,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.382335+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.382335+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4-mini/model.json",
    "content": "{\n  \"model_id\": \"phi-4-mini\",\n  \"name\": \"Phi 4 Mini\",\n  \"organization_id\": \"microsoft\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Phi 4 Mini Instruct is a lightweight (3.8B parameters) open model built upon synthetic data and filtered web data, focusing on high-quality reasoning. It supports a 128K token context length and is enhanced for instruction adherence and safety via supervised fine-tuning and direct preference optimization.\",\n  \"release_date\": \"2025-02-01\",\n  \"announcement_date\": \"2025-02-01\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2024-06-01\",\n  \"param_count\": 3840000000,\n  \"training_tokens\": 5000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/pdf/2503.01743\",\n  \"source_scorecard_blog_link\": \"https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/microsoft/Phi-4-mini-instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.552796+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.552796+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4-mini-reasoning/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1436,\n    \"benchmark_id\": \"aime\",\n    \"model_id\": \"phi-4-mini-reasoning\",\n    \"score\": 0.575,\n    \"normalized_score\": 0.575,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-reasoning\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.061299+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.061299+00:00\",\n    \"benchmark_name\": \"AIME\"\n  },\n  {\n    \"model_benchmark_id\": 281,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"phi-4-mini-reasoning\",\n    \"score\": 0.52,\n    \"normalized_score\": 0.52,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-reasoning\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.642870+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.642870+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 494,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"phi-4-mini-reasoning\",\n    \"score\": 0.946,\n    \"normalized_score\": 0.946,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-mini-reasoning\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.032863+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.032863+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  }\n]"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4-mini-reasoning/model.json",
    "content": "{\n  \"model_id\": \"phi-4-mini-reasoning\",\n  \"name\": \"Phi 4 Mini Reasoning\",\n  \"organization_id\": \"microsoft\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Phi-4-mini-reasoning is designed for multi-step, logic-intensive mathematical problem-solving tasks under memory/compute constrained environments and latency bound scenarios. Some of the use cases include formal proof generation, symbolic computation, advanced word problems, and a wide range of mathematical reasoning scenarios. These models excel at maintaining context across steps, applying structured logic, and delivering accurate, reliable solutions in domains that require deep analytical thinking.\",\n  \"release_date\": \"2025-04-30\",\n  \"announcement_date\": \"2025-04-30\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2025-02-01\",\n  \"param_count\": 3800000000,\n  \"training_tokens\": 150000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://learn.microsoft.com/en-us/windows/ai/apis/phi-silica?tabs=csharp0,csharp1,csharp2,csharp3\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/pdf/2504.21233\",\n  \"source_scorecard_blog_link\": \"https://azure.microsoft.com/en-us/blog/one-year-of-phi-small-language-models-making-big-leaps-in-ai/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/microsoft/Phi-4-mini-reasoning\",\n  \"created_at\": \"2025-07-19T19:49:05.545846+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.545846+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4-multimodal-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1251,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.823,\n    \"normalized_score\": 0.823,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.628230+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.628230+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 1545,\n    \"benchmark_id\": \"blink\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.613,\n    \"normalized_score\": 0.613,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.329567+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.329567+00:00\",\n    \"benchmark_name\": \"BLINK\"\n  },\n  {\n    \"model_benchmark_id\": 859,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.814,\n    \"normalized_score\": 0.814,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.797898+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.797898+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 881,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.932,\n    \"normalized_score\": 0.932,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.836095+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.836095+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1241,\n    \"benchmark_id\": \"infovqa\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.727,\n    \"normalized_score\": 0.727,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.609397+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.609397+00:00\",\n    \"benchmark_name\": \"InfoVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1521,\n    \"benchmark_id\": \"intergps\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.486,\n    \"normalized_score\": 0.486,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"testmini\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.263464+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.263464+00:00\",\n    \"benchmark_name\": \"InterGPS\"\n  },\n  {\n    \"model_benchmark_id\": 521,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.624,\n    \"normalized_score\": 0.624,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"testmini\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.082453+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.082453+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1510,\n    \"benchmark_id\": \"mmbench\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.867,\n    \"normalized_score\": 0.867,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"dev-en\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.240071+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.240071+00:00\",\n    \"benchmark_name\": \"MMBench\"\n  },\n  {\n    \"model_benchmark_id\": 564,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.551,\n    \"normalized_score\": 0.551,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.161302+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.161302+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1528,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.385,\n    \"normalized_score\": 0.385,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"std/vision\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.285447+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.285447+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1538,\n    \"benchmark_id\": \"ocrbench\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.844,\n    \"normalized_score\": 0.844,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.309778+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.309778+00:00\",\n    \"benchmark_name\": \"OCRBench\"\n  },\n  {\n    \"model_benchmark_id\": 1523,\n    \"benchmark_id\": \"pope\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.856,\n    \"normalized_score\": 0.856,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.268923+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.268923+00:00\",\n    \"benchmark_name\": \"POPE\"\n  },\n  {\n    \"model_benchmark_id\": 1537,\n    \"benchmark_id\": \"scienceqa-visual\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.975,\n    \"normalized_score\": 0.975,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"img-test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.303456+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.303456+00:00\",\n    \"benchmark_name\": \"ScienceQA Visual\"\n  },\n  {\n    \"model_benchmark_id\": 907,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.756,\n    \"normalized_score\": 0.756,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard Evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.890738+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.890738+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1383,\n    \"benchmark_id\": \"video-mme\",\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"score\": 0.55,\n    \"normalized_score\": 0.55,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"16 frames\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.911859+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.911859+00:00\",\n    \"benchmark_name\": \"Video-MME\"\n  }\n]"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4-multimodal-instruct/model.json",
    "content": "{\n  \"model_id\": \"phi-4-multimodal-instruct\",\n  \"name\": \"Phi-4-multimodal-instruct\",\n  \"organization_id\": \"microsoft\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Phi-4-multimodal-instruct is a lightweight (5.57B parameters) open multimodal foundation model that leverages research and datasets from Phi-3.5 and 4.0. It processes text, image, and audio inputs to generate text outputs, supporting a 128K token context length. Enhanced via SFT, DPO, and RLHF for instruction following and safety.\",\n  \"release_date\": \"2025-02-01\",\n  \"announcement_date\": \"2025-02-01\",\n  \"license_id\": \"mit\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-06-01\",\n  \"param_count\": 5600000000,\n  \"training_tokens\": 5000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://ai.azure.com/explore/models?selectedCollection=phi&tid=72f988bf-86f1-41af-91ab-2d7cd011db47\",\n  \"source_paper\": \"https://arxiv.org/abs/2503.01743\",\n  \"source_scorecard_blog_link\": \"https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/microsoft/Phi-4-multimodal-instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.571307+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.571307+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4-reasoning/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 450,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"phi-4-reasoning\",\n    \"score\": 0.753,\n    \"normalized_score\": 0.753,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.955706+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.955706+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 688,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"phi-4-reasoning\",\n    \"score\": 0.629,\n    \"normalized_score\": 0.629,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.444086+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.444086+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1450,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"phi-4-reasoning\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.091856+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.091856+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1527,\n    \"benchmark_id\": \"flenqa\",\n    \"model_id\": \"phi-4-reasoning\",\n    \"score\": 0.977,\n    \"normalized_score\": 0.977,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3K-token subset\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.281300+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.281300+00:00\",\n    \"benchmark_name\": \"FlenQA\"\n  },\n  {\n    \"model_benchmark_id\": 287,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"phi-4-reasoning\",\n    \"score\": 0.658,\n    \"normalized_score\": 0.658,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.654843+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.654843+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1439,\n    \"benchmark_id\": \"humaneval+\",\n    \"model_id\": \"phi-4-reasoning\",\n    \"score\": 0.929,\n    \"normalized_score\": 0.929,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.068831+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.068831+00:00\",\n    \"benchmark_name\": \"HumanEval+\"\n  },\n  {\n    \"model_benchmark_id\": 613,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"phi-4-reasoning\",\n    \"score\": 0.834,\n    \"normalized_score\": 0.834,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Strict\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.265033+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.265033+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1114,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"phi-4-reasoning\",\n    \"score\": 0.538,\n    \"normalized_score\": 0.538,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"8/1/24\\u20132/1/25\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.324523+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.324523+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 183,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"phi-4-reasoning\",\n    \"score\": 0.743,\n    \"normalized_score\": 0.743,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.453150+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.453150+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1525,\n    \"benchmark_id\": \"omnimath\",\n    \"model_id\": \"phi-4-reasoning\",\n    \"score\": 0.766,\n    \"normalized_score\": 0.766,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.276205+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.276205+00:00\",\n    \"benchmark_name\": \"OmniMath\"\n  },\n  {\n    \"model_benchmark_id\": 1468,\n    \"benchmark_id\": \"phibench\",\n    \"model_id\": \"phi-4-reasoning\",\n    \"score\": 0.706,\n    \"normalized_score\": 0.706,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"2.21\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.127989+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.127989+00:00\",\n    \"benchmark_name\": \"PhiBench\"\n  }\n]"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4-reasoning/model.json",
    "content": "{\n  \"model_id\": \"phi-4-reasoning\",\n  \"name\": \"Phi 4 Reasoning\",\n  \"organization_id\": \"microsoft\",\n  \"fine_tuned_from_model_id\": \"phi-4\",\n  \"description\": \"Phi-4-reasoning is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning. It focuses on math, science, and coding skills.\",\n  \"release_date\": \"2025-04-30\",\n  \"announcement_date\": \"2025-04-30\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2025-03-01\",\n  \"param_count\": 14000000000,\n  \"training_tokens\": 16000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://learn.microsoft.com/en-us/windows/ai/apis/phi-silica?tabs=csharp0,csharp1,csharp2,csharp3\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2504.21318\",\n  \"source_scorecard_blog_link\": \"https://azure.microsoft.com/en-us/blog/one-year-of-phi-small-language-models-making-big-leaps-in-ai/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning\",\n  \"created_at\": \"2025-07-19T19:49:05.879382+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.879382+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4-reasoning-plus/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 449,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"phi-4-reasoning-plus\",\n    \"score\": 0.813,\n    \"normalized_score\": 0.813,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.953709+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.953709+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 687,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"phi-4-reasoning-plus\",\n    \"score\": 0.78,\n    \"normalized_score\": 0.78,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.440995+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.440995+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1449,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"phi-4-reasoning-plus\",\n    \"score\": 0.79,\n    \"normalized_score\": 0.79,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.090173+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.090173+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1526,\n    \"benchmark_id\": \"flenqa\",\n    \"model_id\": \"phi-4-reasoning-plus\",\n    \"score\": 0.979,\n    \"normalized_score\": 0.979,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3K-token subset\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.279654+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.279654+00:00\",\n    \"benchmark_name\": \"FlenQA\"\n  },\n  {\n    \"model_benchmark_id\": 286,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"phi-4-reasoning-plus\",\n    \"score\": 0.689,\n    \"normalized_score\": 0.689,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.652983+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.652983+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1438,\n    \"benchmark_id\": \"humaneval+\",\n    \"model_id\": \"phi-4-reasoning-plus\",\n    \"score\": 0.923,\n    \"normalized_score\": 0.923,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.066904+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.066904+00:00\",\n    \"benchmark_name\": \"HumanEval+\"\n  },\n  {\n    \"model_benchmark_id\": 612,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"phi-4-reasoning-plus\",\n    \"score\": 0.849,\n    \"normalized_score\": 0.849,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Strict\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.263243+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.263243+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1113,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"phi-4-reasoning-plus\",\n    \"score\": 0.531,\n    \"normalized_score\": 0.531,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"8/1/24\\u20132/1/25\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.322076+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.322076+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 182,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"phi-4-reasoning-plus\",\n    \"score\": 0.76,\n    \"normalized_score\": 0.76,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.451685+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.451685+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1524,\n    \"benchmark_id\": \"omnimath\",\n    \"model_id\": \"phi-4-reasoning-plus\",\n    \"score\": 0.819,\n    \"normalized_score\": 0.819,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.274539+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.274539+00:00\",\n    \"benchmark_name\": \"OmniMath\"\n  },\n  {\n    \"model_benchmark_id\": 1467,\n    \"benchmark_id\": \"phibench\",\n    \"model_id\": \"phi-4-reasoning-plus\",\n    \"score\": 0.742,\n    \"normalized_score\": 0.742,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"2.21\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.126449+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.126449+00:00\",\n    \"benchmark_name\": \"PhiBench\"\n  }\n]"
  },
  {
    "path": "data/organizations/microsoft/models/phi-4-reasoning-plus/model.json",
    "content": "{\n  \"model_id\": \"phi-4-reasoning-plus\",\n  \"name\": \"Phi 4 Reasoning Plus\",\n  \"organization_id\": \"microsoft\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning and reinforcement learning. It focuses on math, science, and coding skills. This 'plus' version has higher accuracy due to additional RL training but may have higher latency.\",\n  \"release_date\": \"2025-04-30\",\n  \"announcement_date\": \"2025-04-30\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2025-03-01\",\n  \"param_count\": 14000000000,\n  \"training_tokens\": 16000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://learn.microsoft.com/en-us/windows/ai/apis/phi-silica?tabs=csharp0,csharp1,csharp2,csharp3\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2504.21318\",\n  \"source_scorecard_blog_link\": \"https://azure.microsoft.com/en-us/blog/one-year-of-phi-small-language-models-making-big-leaps-in-ai/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/microsoft/Phi-4-reasoning-plus\",\n  \"created_at\": \"2025-07-19T19:49:05.567534+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.567534+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/microsoft/organization.json",
    "content": "{\n  \"organization_id\": \"microsoft\",\n  \"name\": \"Microsoft\",\n  \"website\": \"https://microsoft.com\",\n  \"description\": \"Technology company\",\n  \"country\": \"US\",\n  \"created_at\": \"2025-07-19T19:49:05.543205+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.543205+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/mistral/models/codestral-22b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1823,\n    \"benchmark_id\": \"cruxeval-o\",\n    \"model_id\": \"codestral-22b\",\n    \"score\": 0.513,\n    \"normalized_score\": 0.513,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/codestral/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.151317+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.151317+00:00\",\n    \"benchmark_name\": \"CruxEval-O\"\n  },\n  {\n    \"model_benchmark_id\": 809,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"codestral-22b\",\n    \"score\": 0.811,\n    \"normalized_score\": 0.811,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/codestral/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.685855+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.685855+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1827,\n    \"benchmark_id\": \"humaneval-average\",\n    \"model_id\": \"codestral-22b\",\n    \"score\": 0.615,\n    \"normalized_score\": 0.615,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/codestral/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.174206+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.174206+00:00\",\n    \"benchmark_name\": \"HumanEval-Average\"\n  },\n  {\n    \"model_benchmark_id\": 1826,\n    \"benchmark_id\": \"humanevalfim-average\",\n    \"model_id\": \"codestral-22b\",\n    \"score\": 0.916,\n    \"normalized_score\": 0.916,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/codestral/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.169908+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.169908+00:00\",\n    \"benchmark_name\": \"HumanEvalFIM-Average\"\n  },\n  {\n    \"model_benchmark_id\": 1196,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"codestral-22b\",\n    \"score\": 0.782,\n    \"normalized_score\": 0.782,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/codestral/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.517772+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.517772+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1824,\n    \"benchmark_id\": \"repobench\",\n    \"model_id\": \"codestral-22b\",\n    \"score\": 0.34,\n    \"normalized_score\": 0.34,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/codestral/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.155008+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.155008+00:00\",\n    \"benchmark_name\": \"RepoBench\"\n  },\n  {\n    \"model_benchmark_id\": 1825,\n    \"benchmark_id\": \"spider\",\n    \"model_id\": \"codestral-22b\",\n    \"score\": 0.635,\n    \"normalized_score\": 0.635,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/codestral/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.159626+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.159626+00:00\",\n    \"benchmark_name\": \"Spider\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/codestral-22b/model.json",
    "content": "{\n  \"model_id\": \"codestral-22b\",\n  \"name\": \"Codestral-22B\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A 22B parameter code generation model trained on 80+ programming languages including Python, Java, C, C++, JavaScript, and Bash. Supports both instruction-following and fill-in-the-middle (FIM) capabilities for code completion and generation tasks.\",\n  \"release_date\": \"2024-05-29\",\n  \"announcement_date\": \"2024-05-29\",\n  \"license_id\": \"mnpl_0_1\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 22200000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": false,\n  \"source_api_ref\": \"https://docs.mistral.ai/api/\",\n  \"source_playground\": \"https://chat.mistral.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/codestral/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Codestral-22B-v0.1\",\n  \"created_at\": \"2025-07-19T19:49:05.805621+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.805621+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/mistral/models/devstral-medium-2507/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1352,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"devstral-medium-2507\",\n    \"score\": 0.616,\n    \"normalized_score\": 0.616,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/devstral-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"N/A\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.845635+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.845635+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/devstral-medium-2507/model.json",
    "content": "{\n  \"model_id\": \"devstral-medium-2507\",\n  \"name\": \"Devstral Medium\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Devstral Medium builds upon the strengths of Devstral Small and takes performance to the next level with a score of 61.6% on SWE-Bench Verified. Devstral Medium is available through the Mistral public API, and offers exceptional performance at a competitive price point, making it an ideal choice for businesses and developers looking for a high-quality, cost-effective model.\",\n  \"release_date\": \"2025-07-10\",\n  \"announcement_date\": \"2025-07-10\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://console.mistral.ai\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/devstral-2507\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.783461+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.783461+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/devstral-small-2507/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1353,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"devstral-small-2507\",\n    \"score\": 0.536,\n    \"normalized_score\": 0.536,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Devstral-Small-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenHands scaffold\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.847228+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.847228+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/devstral-small-2507/model.json",
    "content": "{\n  \"model_id\": \"devstral-small-2507\",\n  \"name\": \"Devstral Small 1.1\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Devstral Small 1.1 (also called devstral-small-2507) is based on the Mistral-Small-3.1 foundation model and contains approximately 24 billion parameters. It supports a 128k token context window, which allows it to handle multi-file code inputs and long prompts typical in software engineering workflows. The model is fine-tuned specifically for structured outputs, including XML and function-calling formats. This makes it compatible with agent frameworks such as OpenHands and suitable for tasks like program navigation, multi-step edits, and code search. It is licensed under Apache 2.0 and available for both research and commercial use.\",\n  \"release_date\": \"2025-07-11\",\n  \"announcement_date\": \"2025-07-11\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 24000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://console.mistral.ai\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://huggingface.co/mistralai/Devstral-Small-2507\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Devstral-Small-2507/blob/main/model.safetensors.index.json\",\n  \"created_at\": \"2025-07-19T19:49:05.797947+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.797947+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/magistral-medium/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 665,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"magistral-medium\",\n    \"score\": 0.471,\n    \"normalized_score\": 0.471,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2506.10910\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.379075+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.379075+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 480,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"magistral-medium\",\n    \"score\": 0.736,\n    \"normalized_score\": 0.736,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2506.10910\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.011044+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.011044+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 704,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"magistral-medium\",\n    \"score\": 0.649,\n    \"normalized_score\": 0.649,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2506.10910\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.473748+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.473748+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 343,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"magistral-medium\",\n    \"score\": 0.708,\n    \"normalized_score\": 0.708,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2506.10910\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.745089+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.745089+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 724,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"magistral-medium\",\n    \"score\": 0.09,\n    \"normalized_score\": 0.09,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2506.10910\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"text subset\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.525031+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.525031+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 1145,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"magistral-medium\",\n    \"score\": 0.503,\n    \"normalized_score\": 0.503,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/pdf/2506.10910\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"v6\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.408465+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.410002+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/magistral-medium/model.json",
    "content": "{\n  \"model_id\": \"magistral-medium\",\n  \"name\": \"Magistral Medium\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Trained solely with reinforcement learning on top of Mistral Medium 3, Magistral Medium is a reasoning model that achieves strong performance on complex math and code tasks without relying on distillation from existing reasoning models. The training uses an RLVR framework with modifications to GRPO, enabling improved reasoning ability and multilingual consistency.\",\n  \"release_date\": \"2025-06-10\",\n  \"announcement_date\": \"2025-06-10\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2025-06-01\",\n  \"param_count\": 24000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.mistral.ai/api/\",\n  \"source_playground\": \"https://chat.mistral.ai/\",\n  \"source_paper\": \"https://arxiv.org/pdf/2506.10910\",\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/magistral\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.780565+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.780565+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/magistral-small-2506/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 479,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"magistral-small-2506\",\n    \"score\": 0.7068,\n    \"normalized_score\": 0.7068,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Magistral-Small-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.009597+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.009597+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 703,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"magistral-small-2506\",\n    \"score\": 0.6276,\n    \"normalized_score\": 0.6276,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Magistral-Small-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.471565+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.471565+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 342,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"magistral-small-2506\",\n    \"score\": 0.6818,\n    \"normalized_score\": 0.6818,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Magistral-Small-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.743610+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.743610+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1144,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"magistral-small-2506\",\n    \"score\": 0.513,\n    \"normalized_score\": 0.513,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/codestral/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"v5\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.406640+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.406640+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/magistral-small-2506/model.json",
    "content": "{\n  \"model_id\": \"magistral-small-2506\",\n  \"name\": \"Magistral Small 2506\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Building upon Mistral Small 3.1 (2503), with added reasoning capabilities, undergoing SFT from Magistral Medium traces and RL on top, it's a small, efficient reasoning model with 24B parameters. Magistral Small can be deployed locally, fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized.\",\n  \"release_date\": \"2025-06-10\",\n  \"announcement_date\": \"2025-06-10\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2025-06-01\",\n  \"param_count\": 24000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.mistral.ai/api/\",\n  \"source_playground\": \"https://chat.mistral.ai/\",\n  \"source_paper\": \"https://arxiv.org/pdf/2506.10910\",\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/magistral\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Magistral-Small-2506\",\n  \"created_at\": \"2025-07-19T19:49:05.777162+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.777162+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/ministral-8b-instruct-2410/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1410,\n    \"benchmark_id\": \"agieval\",\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"score\": 0.483,\n    \"normalized_score\": 0.483,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.978647+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.978647+00:00\",\n    \"benchmark_name\": \"AGIEval\"\n  },\n  {\n    \"model_benchmark_id\": 30,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"score\": 0.719,\n    \"normalized_score\": 0.719,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.142536+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.142536+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1464,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"score\": 0.709,\n    \"normalized_score\": 0.709,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.118772+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.118772+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1820,\n    \"benchmark_id\": \"french-mmlu\",\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"score\": 0.575,\n    \"normalized_score\": 0.575,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.137792+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.137792+00:00\",\n    \"benchmark_name\": \"French MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 806,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"score\": 0.348,\n    \"normalized_score\": 0.348,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.681246+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.681246+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 422,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"score\": 0.545,\n    \"normalized_score\": 0.545,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.895272+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.895272+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1821,\n    \"benchmark_id\": \"mbpp-pass@1\",\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"score\": 0.7,\n    \"normalized_score\": 0.7,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.141858+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.141858+00:00\",\n    \"benchmark_name\": \"MBPP pass@1\"\n  },\n  {\n    \"model_benchmark_id\": 112,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"score\": 0.65,\n    \"normalized_score\": 0.65,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.309619+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.309619+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1612,\n    \"benchmark_id\": \"mt-bench\",\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"score\": 0.83,\n    \"normalized_score\": 0.83,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.535003+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.535003+00:00\",\n    \"benchmark_name\": \"MT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 253,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"score\": 0.655,\n    \"normalized_score\": 0.655,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.582765+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.582765+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  },\n  {\n    \"model_benchmark_id\": 155,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"score\": 0.753,\n    \"normalized_score\": 0.753,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.394106+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.394106+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/ministral-8b-instruct-2410/model.json",
    "content": "{\n  \"model_id\": \"ministral-8b-instruct-2410\",\n  \"name\": \"Ministral 8B Instruct\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"The Ministral-8B-Instruct-2410 is an instruct fine-tuned model for local intelligence, on-device computing, and at-the-edge use cases, significantly outperforming existing models of similar size.\",\n  \"release_date\": \"2024-10-16\",\n  \"announcement_date\": \"2024-10-16\",\n  \"license_id\": \"mistral_research_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 8019808256,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/ministraux/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\",\n  \"created_at\": \"2025-07-19T19:49:05.786083+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.786083+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/mistral-large-2-2407/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1014,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"mistral-large-2-2407\",\n    \"score\": 0.93,\n    \"normalized_score\": 0.93,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Large-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.113392+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.113392+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 810,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"mistral-large-2-2407\",\n    \"score\": 0.92,\n    \"normalized_score\": 0.92,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Large-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.687406+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.687406+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 116,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"mistral-large-2-2407\",\n    \"score\": 0.84,\n    \"normalized_score\": 0.84,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/mistral-large-2407/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.316024+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.316024+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1828,\n    \"benchmark_id\": \"mmlu-french\",\n    \"model_id\": \"mistral-large-2-2407\",\n    \"score\": 0.828,\n    \"normalized_score\": 0.828,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Large-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.178056+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.178056+00:00\",\n    \"benchmark_name\": \"MMLU French\"\n  },\n  {\n    \"model_benchmark_id\": 1615,\n    \"benchmark_id\": \"mt-bench\",\n    \"model_id\": \"mistral-large-2-2407\",\n    \"score\": 0.863,\n    \"normalized_score\": 0.863,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Large-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.541051+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.541051+00:00\",\n    \"benchmark_name\": \"MT-Bench\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/mistral-large-2-2407/model.json",
    "content": "{\n  \"model_id\": \"mistral-large-2-2407\",\n  \"name\": \"Mistral Large 2\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A 123B parameter model with strong capabilities in code generation, mathematics, and reasoning. Features enhanced multilingual support across dozens of languages, 128k context window, and advanced function calling capabilities. Excels in instruction-following and maintains concise outputs.\",\n  \"release_date\": \"2024-07-24\",\n  \"announcement_date\": \"2024-07-24\",\n  \"license_id\": \"mistral_research_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 123000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.mistral.ai/\",\n  \"source_playground\": \"https://chat.mistral.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/mistral-large-2407/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Mistral-Large-Instruct-2407\",\n  \"created_at\": \"2025-07-19T19:49:05.813974+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.813974+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/mistral-nemo-instruct-2407/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1819,\n    \"benchmark_id\": \"commonsenseqa\",\n    \"model_id\": \"mistral-nemo-instruct-2407\",\n    \"score\": 0.704,\n    \"normalized_score\": 0.704,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.133096+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.133096+00:00\",\n    \"benchmark_name\": \"CommonSenseQA\"\n  },\n  {\n    \"model_benchmark_id\": 54,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"mistral-nemo-instruct-2407\",\n    \"score\": 0.835,\n    \"normalized_score\": 0.835,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.196732+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.196732+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 111,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"mistral-nemo-instruct-2407\",\n    \"score\": 0.68,\n    \"normalized_score\": 0.68,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.308247+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.308247+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1050,\n    \"benchmark_id\": \"natural-questions\",\n    \"model_id\": \"mistral-nemo-instruct-2407\",\n    \"score\": 0.312,\n    \"normalized_score\": 0.312,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.191770+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.191770+00:00\",\n    \"benchmark_name\": \"Natural Questions\"\n  },\n  {\n    \"model_benchmark_id\": 1472,\n    \"benchmark_id\": \"openbookqa\",\n    \"model_id\": \"mistral-nemo-instruct-2407\",\n    \"score\": 0.606,\n    \"normalized_score\": 0.606,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.138075+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.138075+00:00\",\n    \"benchmark_name\": \"OpenBookQA\"\n  },\n  {\n    \"model_benchmark_id\": 252,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"mistral-nemo-instruct-2407\",\n    \"score\": 0.738,\n    \"normalized_score\": 0.738,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.581108+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.581108+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  },\n  {\n    \"model_benchmark_id\": 146,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"mistral-nemo-instruct-2407\",\n    \"score\": 0.503,\n    \"normalized_score\": 0.503,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.369082+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.369082+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 154,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"mistral-nemo-instruct-2407\",\n    \"score\": 0.768,\n    \"normalized_score\": 0.768,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.392106+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.392106+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/mistral-nemo-instruct-2407/model.json",
    "content": "{\n  \"model_id\": \"mistral-nemo-instruct-2407\",\n  \"name\": \"Mistral NeMo Instruct\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A state-of-the-art 12B multilingual model with a 128k context window, designed for global applications and strong in multiple languages.\",\n  \"release_date\": \"2024-07-18\",\n  \"announcement_date\": \"2024-07-18\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 12000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.mistral.ai/getting-started/models/models_overview/\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/mistral-nemo/\",\n  \"source_repo_link\": \"https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407\",\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407\",\n  \"created_at\": \"2025-07-19T19:49:05.773595+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.773595+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/mistral-small-2409/model.json",
    "content": "{\n  \"model_id\": \"mistral-small-2409\",\n  \"name\": \"Mistral Small\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"An enterprise-grade 22B parameter model optimized for tasks like translation, summarization, and sentiment analysis. Offers significant improvements in human alignment, reasoning capabilities, and code generation compared to previous versions.\",\n  \"release_date\": \"2024-09-17\",\n  \"announcement_date\": \"2024-09-17\",\n  \"license_id\": \"mistral_research_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 22000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.mistral.ai/api/\",\n  \"source_playground\": \"https://console.mistral.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/september-24-release/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Mistral-Small-Instruct-2409\",\n  \"created_at\": \"2025-07-19T19:49:05.809465+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.809465+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/mistral-small-24b-base-2501/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1411,\n    \"benchmark_id\": \"agieval\",\n    \"model_id\": \"mistral-small-24b-base-2501\",\n    \"score\": 0.658,\n    \"normalized_score\": 0.658,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.980585+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.980585+00:00\",\n    \"benchmark_name\": \"AGIEval\"\n  },\n  {\n    \"model_benchmark_id\": 31,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"mistral-small-24b-base-2501\",\n    \"score\": 0.9129,\n    \"normalized_score\": 0.9129,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.143960+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.143960+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 345,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"mistral-small-24b-base-2501\",\n    \"score\": 0.3437,\n    \"normalized_score\": 0.3437,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot, CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.748111+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.748111+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1013,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"mistral-small-24b-base-2501\",\n    \"score\": 0.8073,\n    \"normalized_score\": 0.8073,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot, maj@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.111924+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.111924+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 424,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"mistral-small-24b-base-2501\",\n    \"score\": 0.4598,\n    \"normalized_score\": 0.4598,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot, MaJ\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.898806+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.898806+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1195,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"mistral-small-24b-base-2501\",\n    \"score\": 0.6964,\n    \"normalized_score\": 0.6964,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.516399+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.516399+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 113,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"mistral-small-24b-base-2501\",\n    \"score\": 0.8073,\n    \"normalized_score\": 0.8073,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.311218+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.311218+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 217,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"mistral-small-24b-base-2501\",\n    \"score\": 0.5437,\n    \"normalized_score\": 0.5437,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.511957+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.511957+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 254,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"mistral-small-24b-base-2501\",\n    \"score\": 0.8032,\n    \"normalized_score\": 0.8032,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.585944+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.585944+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/mistral-small-24b-base-2501/model.json",
    "content": "{\n  \"model_id\": \"mistral-small-24b-base-2501\",\n  \"name\": \"Mistral Small 3 24B Base\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Mistral Small 3 is competitive with larger models such as Llama 3.3 70B or Qwen 32B, and is an excellent open replacement for opaque proprietary models like GPT4o-mini. Mistral Small 3 is on par with Llama 3.3 70B instruct, while being more than 3x faster on the same hardware.\",\n  \"release_date\": \"2025-01-30\",\n  \"announcement_date\": \"2025-01-30\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2023-10-01\",\n  \"param_count\": 23600000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://console.mistral.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/mistral-small-3\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501\",\n  \"created_at\": \"2025-07-19T19:49:05.791166+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.791166+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/mistral-small-24b-instruct-2501/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1465,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"mistral-small-24b-instruct-2501\",\n    \"score\": 0.876,\n    \"normalized_score\": 0.876,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.120697+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.120697+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 344,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"mistral-small-24b-instruct-2501\",\n    \"score\": 0.453,\n    \"normalized_score\": 0.453,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5 shot COT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.746578+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.746578+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 807,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"mistral-small-24b-instruct-2501\",\n    \"score\": 0.848,\n    \"normalized_score\": 0.848,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5 shot COT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.682647+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.682647+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 630,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"mistral-small-24b-instruct-2501\",\n    \"score\": 0.829,\n    \"normalized_score\": 0.829,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.295754+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.295754+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 423,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"mistral-small-24b-instruct-2501\",\n    \"score\": 0.706,\n    \"normalized_score\": 0.706,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"instruct\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.896887+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.896887+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 216,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"mistral-small-24b-instruct-2501\",\n    \"score\": 0.663,\n    \"normalized_score\": 0.663,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5 shot COT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.510254+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.510254+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1613,\n    \"benchmark_id\": \"mt-bench\",\n    \"model_id\": \"mistral-small-24b-instruct-2501\",\n    \"score\": 0.835,\n    \"normalized_score\": 0.835,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.537073+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.537073+00:00\",\n    \"benchmark_name\": \"MT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 1818,\n    \"benchmark_id\": \"wild-bench\",\n    \"model_id\": \"mistral-small-24b-instruct-2501\",\n    \"score\": 0.522,\n    \"normalized_score\": 0.522,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.128734+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.128734+00:00\",\n    \"benchmark_name\": \"Wild Bench\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/mistral-small-24b-instruct-2501/model.json",
    "content": "{\n  \"model_id\": \"mistral-small-24b-instruct-2501\",\n  \"name\": \"Mistral Small 3 24B Instruct\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Mistral Small 3 is a 24B-parameter LLM licensed under Apache-2.0. It focuses on low-latency, high-efficiency instruction following, maintaining performance comparable to larger models. It provides quick, accurate responses for conversational agents, function calling, and domain-specific fine-tuning. Suitable for local inference when quantized, it rivals models 2\\u20133\\u00d7 its size while using significantly fewer compute resources.\",\n  \"release_date\": \"2025-01-30\",\n  \"announcement_date\": \"2025-01-30\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2023-10-01\",\n  \"param_count\": 24000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.mistral.ai/api/\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/mistral-small-3/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501\",\n  \"created_at\": \"2025-07-19T19:49:05.788628+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.788628+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/mistral-small-3.1-24b-base-2503/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 346,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"mistral-small-3.1-24b-base-2503\",\n    \"score\": 0.375,\n    \"normalized_score\": 0.375,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.749533+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.749533+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 114,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"mistral-small-3.1-24b-base-2503\",\n    \"score\": 0.8101,\n    \"normalized_score\": 0.8101,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.312907+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.312907+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 218,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"mistral-small-3.1-24b-base-2503\",\n    \"score\": 0.5603,\n    \"normalized_score\": 0.5603,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.513719+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.513719+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 587,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"mistral-small-3.1-24b-base-2503\",\n    \"score\": 0.5927,\n    \"normalized_score\": 0.5927,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"CoT accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.207080+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.207080+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 255,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"mistral-small-3.1-24b-base-2503\",\n    \"score\": 0.805,\n    \"normalized_score\": 0.805,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.587622+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.587622+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/mistral/models/mistral-small-3.1-24b-base-2503/model.json",
    "content": "{\n  \"model_id\": \"mistral-small-3.1-24b-base-2503\",\n  \"name\": \"Mistral Small 3.1 24B Base\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Pretrained base model version of Mistral Small 3.1. Features improved text performance, multimodal understanding, multilingual capabilities, and an expanded 128k token context window compared to Mistral Small 3. Designed for fine-tuning.\",\n  \"release_date\": \"2025-03-17\",\n  \"announcement_date\": \"2025-03-17\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 24000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://console.mistral.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/mistral-small-3-1\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503\",\n  \"created_at\": \"2025-07-19T19:49:05.793911+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.793911+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/mistral-small-3.1-24b-instruct-2503/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 340,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"mistral-small-3.1-24b-instruct-2503\",\n    \"score\": 0.4596,\n    \"normalized_score\": 0.4596,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, 5-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.740584+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.741944+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 805,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"mistral-small-3.1-24b-instruct-2503\",\n    \"score\": 0.8841,\n    \"normalized_score\": 0.8841,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.677771+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.677771+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 421,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"mistral-small-3.1-24b-instruct-2503\",\n    \"score\": 0.693,\n    \"normalized_score\": 0.693,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.893255+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.893255+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1194,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"mistral-small-3.1-24b-instruct-2503\",\n    \"score\": 0.7471,\n    \"normalized_score\": 0.7471,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.514872+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.514872+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 110,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"mistral-small-3.1-24b-instruct-2503\",\n    \"score\": 0.8062,\n    \"normalized_score\": 0.8062,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.306426+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.306426+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 215,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"mistral-small-3.1-24b-instruct-2503\",\n    \"score\": 0.6676,\n    \"normalized_score\": 0.6676,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.508555+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.508555+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 585,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"mistral-small-3.1-24b-instruct-2503\",\n    \"score\": 0.5927,\n    \"normalized_score\": 0.5927,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"CoT accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.203401+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.203401+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 237,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"mistral-small-3.1-24b-instruct-2503\",\n    \"score\": 0.1043,\n    \"normalized_score\": 0.1043,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"TotalAcc, Correct\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.552923+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.552923+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 251,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"mistral-small-3.1-24b-instruct-2503\",\n    \"score\": 0.805,\n    \"normalized_score\": 0.805,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.579482+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.579482+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/mistral-small-3.1-24b-instruct-2503/model.json",
    "content": "{\n  \"model_id\": \"mistral-small-3.1-24b-instruct-2503\",\n  \"name\": \"Mistral Small 3.1 24B Instruct\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.\",\n  \"release_date\": \"2025-03-17\",\n  \"announcement_date\": \"2025-03-17\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 24000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://console.mistral.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/mistral-small-3-1\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503\",\n  \"created_at\": \"2025-07-19T19:49:05.770816+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.770816+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/mistral-small-3.2-24b-instruct-2506/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 16767,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.9291,\n    \"normalized_score\": 0.9291,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.105841+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.105841+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 16768,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.431,\n    \"normalized_score\": 0.431,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"v2\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.107885+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.107885+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 16769,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.874,\n    \"normalized_score\": 0.874,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.109760+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.109760+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 16770,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.9486,\n    \"normalized_score\": 0.9486,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.111977+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.111977+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 16771,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.4422,\n    \"normalized_score\": 0.4422,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.113518+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.113518+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 16772,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.4613,\n    \"normalized_score\": 0.4613,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.115179+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.115179+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 16773,\n    \"benchmark_id\": \"humaneval-plus\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.929,\n    \"normalized_score\": 0.929,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@5\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.116763+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.116763+00:00\",\n    \"benchmark_name\": \"HumanEval Plus\"\n  },\n  {\n    \"model_benchmark_id\": 16774,\n    \"benchmark_id\": \"if\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.8478,\n    \"normalized_score\": 0.8478,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.118250+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.118250+00:00\",\n    \"benchmark_name\": \"IF\"\n  },\n  {\n    \"model_benchmark_id\": 16775,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.6942,\n    \"normalized_score\": 0.6942,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.119723+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.119723+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 16776,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.6709,\n    \"normalized_score\": 0.6709,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.121246+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.121246+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 16777,\n    \"benchmark_id\": \"mbpp-plus\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.7833,\n    \"normalized_score\": 0.7833,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@5\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.122828+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.122828+00:00\",\n    \"benchmark_name\": \"MBPP Plus\"\n  },\n  {\n    \"model_benchmark_id\": 16778,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.805,\n    \"normalized_score\": 0.805,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.124220+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.124220+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 16779,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.6906,\n    \"normalized_score\": 0.6906,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.125972+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.125972+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 16780,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.625,\n    \"normalized_score\": 0.625,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"-\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.127425+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.127425+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 16781,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.121,\n    \"normalized_score\": 0.121,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"TotalAcc\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.129114+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.129114+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 16782,\n    \"benchmark_id\": \"wild-bench\",\n    \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n    \"score\": 0.6533,\n    \"normalized_score\": 0.6533,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"v2\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:15.130665+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:15.130665+00:00\",\n    \"benchmark_name\": \"Wild Bench\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/mistral/models/mistral-small-3.2-24b-instruct-2506/model.json",
    "content": "{\n  \"model_id\": \"mistral-small-3.2-24b-instruct-2506\",\n  \"name\": \"Mistral Small 3.2 24B Instruct\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": \"mistral-small-3.1-24b-base-2503\",\n  \"description\": \"Mistral-Small-3.2-24B-Instruct-2506 is a minor update of Mistral-Small-3.1-24B-Instruct-2503.\",\n  \"release_date\": \"2025-06-20\",\n  \"announcement_date\": \"2025-06-20\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2023-10-01\",\n  \"param_count\": 23600000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://console.mistral.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506\",\n  \"created_at\": \"2025-08-03T22:06:11.933573+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:11.933573+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/pixtral-12b-2409/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 874,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.818,\n    \"normalized_score\": 0.818,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chain of Thought (CoT)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.822444+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.822444+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 899,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.907,\n    \"normalized_score\": 0.907,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"ANLS\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.871485+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.871485+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 808,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.72,\n    \"normalized_score\": 0.72,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.684555+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.684555+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 631,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.613,\n    \"normalized_score\": 0.613,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Text Instruction Following Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.297384+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.297384+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 425,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.481,\n    \"normalized_score\": 0.481,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.900275+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.900275+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 537,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.58,\n    \"normalized_score\": 0.58,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chain of Thought (CoT)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.111272+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.111272+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1822,\n    \"benchmark_id\": \"mm-if-eval\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.527,\n    \"normalized_score\": 0.527,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multimodal Instruction Following Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.145578+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.145578+00:00\",\n    \"benchmark_name\": \"MM IF-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 115,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.692,\n    \"normalized_score\": 0.692,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.314507+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.314507+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1733,\n    \"benchmark_id\": \"mm-mt-bench\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.605,\n    \"normalized_score\": 0.605,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multimodal MT-Bench Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.887276+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.887276+00:00\",\n    \"benchmark_name\": \"MM-MT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 588,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.525,\n    \"normalized_score\": 0.525,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chain of Thought (CoT)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.209409+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.209409+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1614,\n    \"benchmark_id\": \"mt-bench\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.768,\n    \"normalized_score\": 0.768,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Text MT-Bench Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.539185+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.539185+00:00\",\n    \"benchmark_name\": \"MT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 1575,\n    \"benchmark_id\": \"vqav2\",\n    \"model_id\": \"pixtral-12b-2409\",\n    \"score\": 0.786,\n    \"normalized_score\": 0.786,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-12b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"VQA Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.416120+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.416120+00:00\",\n    \"benchmark_name\": \"VQAv2\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/pixtral-12b-2409/model.json",
    "content": "{\n  \"model_id\": \"pixtral-12b-2409\",\n  \"name\": \"Pixtral-12B\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A 12B parameter multimodal model with a 400M parameter vision encoder, capable of understanding both natural images and documents. Excels at multimodal tasks while maintaining strong text-only performance. Supports variable image sizes and multiple images in context.\",\n  \"release_date\": \"2024-09-17\",\n  \"announcement_date\": \"2024-09-17\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 12400000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.mistral.ai/platform/endpoints/\",\n  \"source_playground\": \"https://chat.mistral.ai\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/pixtral-12b/\",\n  \"source_repo_link\": \"https://huggingface.co/mistralai/Pixtral-12B-2409\",\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Pixtral-12B-2409\",\n  \"created_at\": \"2025-07-19T19:49:05.802013+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.802013+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/models/pixtral-large/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1261,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"pixtral-large\",\n    \"score\": 0.938,\n    \"normalized_score\": 0.938,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-large/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"BBox\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.645378+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.645378+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 873,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"pixtral-large\",\n    \"score\": 0.881,\n    \"normalized_score\": 0.881,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-large/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.820802+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.820802+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 898,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"pixtral-large\",\n    \"score\": 0.933,\n    \"normalized_score\": 0.933,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-large/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"ANLS\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.869454+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.869454+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 536,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"pixtral-large\",\n    \"score\": 0.694,\n    \"normalized_score\": 0.694,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-large/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.109764+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.109764+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1732,\n    \"benchmark_id\": \"mm-mt-bench\",\n    \"model_id\": \"pixtral-large\",\n    \"score\": 0.74,\n    \"normalized_score\": 0.74,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-large/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o Judge\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.885715+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.885715+00:00\",\n    \"benchmark_name\": \"MM-MT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 586,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"pixtral-large\",\n    \"score\": 0.64,\n    \"normalized_score\": 0.64,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-large/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.205240+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.205240+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1574,\n    \"benchmark_id\": \"vqav2\",\n    \"model_id\": \"pixtral-large\",\n    \"score\": 0.809,\n    \"normalized_score\": 0.809,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://mistral.ai/news/pixtral-large/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"VQA Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.414450+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.414450+00:00\",\n    \"benchmark_name\": \"VQAv2\"\n  }\n]"
  },
  {
    "path": "data/organizations/mistral/models/pixtral-large/model.json",
    "content": "{\n  \"model_id\": \"pixtral-large\",\n  \"name\": \"Pixtral Large\",\n  \"organization_id\": \"mistral\",\n  \"fine_tuned_from_model_id\": \"mistral-large-2-2407\",\n  \"description\": \"A 124B parameter multimodal model built on top of Mistral Large 2, featuring frontier-level image understanding capabilities. Excels at understanding documents, charts, and natural images while maintaining strong text-only performance. Features a 123B multimodal decoder and 1B parameter vision encoder with a 128K context window supporting up to 30 high-resolution images.\",\n  \"release_date\": \"2024-11-18\",\n  \"announcement_date\": \"2024-11-18\",\n  \"license_id\": \"mistral_research_license_(mrl)_for_research;_mistral_commercial_license_for_commercial_use\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 124000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://mistral.ai/\",\n  \"source_playground\": \"https://chat.mistral.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://mistral.ai/news/pixtral-large/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411\",\n  \"created_at\": \"2025-07-19T19:49:05.913427+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.913427+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/mistral/organization.json",
    "content": "{\n  \"organization_id\": \"mistral\",\n  \"name\": \"Mistral AI\",\n  \"website\": \"https://mistral.ai\",\n  \"description\": \"French AI company\",\n  \"country\": \"FR\",\n  \"created_at\": \"2025-07-19T19:49:05.769198+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.769198+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/moonshotai/models/kimi-k1.5/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 444,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"kimi-k1.5\",\n    \"score\": 0.775,\n    \"normalized_score\": 0.775,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-k1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.945090+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.945090+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 435,\n    \"benchmark_id\": \"c-eval\",\n    \"model_id\": \"kimi-k1.5\",\n    \"score\": 0.883,\n    \"normalized_score\": 0.883,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-k1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.922484+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.922484+00:00\",\n    \"benchmark_name\": \"C-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 599,\n    \"benchmark_id\": \"cluewsc\",\n    \"model_id\": \"kimi-k1.5\",\n    \"score\": 0.914,\n    \"normalized_score\": 0.914,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-k1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.236097+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.236097+00:00\",\n    \"benchmark_name\": \"CLUEWSC\"\n  },\n  {\n    \"model_benchmark_id\": 602,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"kimi-k1.5\",\n    \"score\": 0.872,\n    \"normalized_score\": 0.872,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-k1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.244895+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.244895+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 514,\n    \"benchmark_id\": \"livecodebench-v5-24.12-25.2\",\n    \"model_id\": \"kimi-k1.5\",\n    \"score\": 0.625,\n    \"normalized_score\": 0.625,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-k1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.068737+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.068737+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v5 24.12-25.2\"\n  },\n  {\n    \"model_benchmark_id\": 492,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"kimi-k1.5\",\n    \"score\": 0.962,\n    \"normalized_score\": 0.962,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-k1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.029931+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.029931+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 515,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"kimi-k1.5\",\n    \"score\": 0.749,\n    \"normalized_score\": 0.749,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-k1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.071814+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.071814+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 58,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"kimi-k1.5\",\n    \"score\": 0.874,\n    \"normalized_score\": 0.874,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-k1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Exact Match\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.207582+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.207582+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 549,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"kimi-k1.5\",\n    \"score\": 0.7,\n    \"normalized_score\": 0.7,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-k1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.132422+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.132422+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  }\n]"
  },
  {
    "path": "data/organizations/moonshotai/models/kimi-k1.5/model.json",
    "content": "{\n  \"model_id\": \"kimi-k1.5\",\n  \"name\": \"Kimi-k1.5\",\n  \"organization_id\": \"moonshotai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Kimi 1.5 is a next-generation multimodal large language model developed by Moonshot AI. It incorporates advanced reinforcement learning (RL) and scalable multimodal reasoning, delivering state-of-the-art performance in math, code, vision, and long-context reasoning tasks.\",\n  \"release_date\": \"2025-01-20\",\n  \"announcement_date\": \"2025-01-20\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.moonshot.cn/docs/api-reference\",\n  \"source_playground\": \"https://kimi.ai/\",\n  \"source_paper\": \"https://arxiv.org/abs/2501.12599\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/MoonshotAI/Kimi-k1.5\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.426406+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.426406+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/moonshotai/models/kimi-k2-0905/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 9001,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"kimi-k2-0905\",\n    \"score\": 0.758,\n    \"normalized_score\": 0.758,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshot.cn/blog/kimi-k2-0905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2024-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2024-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9002,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"kimi-k2-0905\",\n    \"score\": 0.902,\n    \"normalized_score\": 0.902,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshot.cn/blog/kimi-k2-0905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2024-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2024-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 9003,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"kimi-k2-0905\",\n    \"score\": 0.891,\n    \"normalized_score\": 0.891,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshot.cn/blog/kimi-k2-0905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2024-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2024-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 9004,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"kimi-k2-0905\",\n    \"score\": 0.945,\n    \"normalized_score\": 0.945,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshot.cn/blog/kimi-k2-0905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2024-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2024-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 9005,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"kimi-k2-0905\",\n    \"score\": 0.825,\n    \"normalized_score\": 0.825,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshot.cn/blog/kimi-k2-0905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2024-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2024-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 9006,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"kimi-k2-0905\",\n    \"score\": 0.72,\n    \"normalized_score\": 0.72,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshot.cn/blog/kimi-k2-0905\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2024-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2024-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/moonshotai/models/kimi-k2-0905/model.json",
    "content": "{\n  \"model_id\": \"kimi-k2-0905\",\n  \"name\": \"Kimi K2 0905\",\n  \"organization_id\": \"moonshotai\",\n  \"fine_tuned_from_model_id\": \"kimi-k2-instruct\",\n  \"description\": \"Kimi K2 0905 is the September update of Kimi K2 0711. It is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It supports long-context inference up to 256k tokens, extended from the previous 128k. This update improves agentic coding with higher accuracy and better generalization across scaffolds, and enhances frontend coding with more aesthetic and functional outputs for web, 3D, and related tasks. The model is trained with a novel stack incorporating the MuonClip optimizer for stable large-scale MoE training.\",\n  \"release_date\": \"2025-09-05\",\n  \"announcement_date\": \"2025-09-05\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 1000000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.moonshot.cn/\",\n  \"source_playground\": \"https://kimi.moonshot.cn/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://moonshot.cn/blog/kimi-k2-0905\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-09-14T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-14T00:00:00.000000+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/moonshotai/models/kimi-k2-base/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 434,\n    \"benchmark_id\": \"c-eval\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.925,\n    \"normalized_score\": 0.925,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.920573+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.920573+00:00\",\n    \"benchmark_name\": \"C-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 440,\n    \"benchmark_id\": \"csimpleqa\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.776,\n    \"normalized_score\": 0.776,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Correct\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.934566+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.934566+00:00\",\n    \"benchmark_name\": \"CSimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 369,\n    \"benchmark_id\": \"evalplus\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.803,\n    \"normalized_score\": 0.803,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.796250+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.796250+00:00\",\n    \"benchmark_name\": \"EvalPlus\"\n  },\n  {\n    \"model_benchmark_id\": 256,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.481,\n    \"normalized_score\": 0.481,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond Avg@8\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.591508+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.591508+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 158,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.921,\n    \"normalized_score\": 0.921,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.403308+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.403308+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 367,\n    \"benchmark_id\": \"livecodebench-v6\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.263,\n    \"normalized_score\": 0.263,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.789592+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.789592+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v6\"\n  },\n  {\n    \"model_benchmark_id\": 373,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.702,\n    \"normalized_score\": 0.702,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.808795+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.808795+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 57,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.878,\n    \"normalized_score\": 0.878,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.205746+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.205746+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 161,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.692,\n    \"normalized_score\": 0.692,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.410852+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.410852+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 221,\n    \"benchmark_id\": \"mmlu-redux-2.0\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.902,\n    \"normalized_score\": 0.902,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.520883+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.520883+00:00\",\n    \"benchmark_name\": \"MMLU-redux-2.0\"\n  },\n  {\n    \"model_benchmark_id\": 222,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.353,\n    \"normalized_score\": 0.353,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Correct\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.524097+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.524097+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 364,\n    \"benchmark_id\": \"supergpqa\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.447,\n    \"normalized_score\": 0.447,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.781413+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.781413+00:00\",\n    \"benchmark_name\": \"SuperGPQA\"\n  },\n  {\n    \"model_benchmark_id\": 243,\n    \"benchmark_id\": \"triviaqa\",\n    \"model_id\": \"kimi-k2-base\",\n    \"score\": 0.851,\n    \"normalized_score\": 0.851,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.566226+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.566226+00:00\",\n    \"benchmark_name\": \"TriviaQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/moonshotai/models/kimi-k2-base/model.json",
    "content": "{\n  \"model_id\": \"kimi-k2-base\",\n  \"name\": \"Kimi K2 Base\",\n  \"organization_id\": \"moonshotai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Kimi K2 base model is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained on 15.5 trillion tokens with the MuonClip optimizer, this is the foundation model before instruction tuning. It demonstrates strong performance on knowledge, reasoning, and coding benchmarks while being optimized for agentic capabilities.\",\n  \"release_date\": \"2025-07-11\",\n  \"announcement_date\": \"2025-07-11\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 1000000000000,\n  \"training_tokens\": 15500000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.moonshot.ai\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n  \"source_repo_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n  \"source_weights_link\": \"https://huggingface.co/moonshotai/Kimi-K2-Base\",\n  \"created_at\": \"2025-07-19T19:49:05.422399+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.422399+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/moonshotai/models/kimi-k2-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 676,\n    \"benchmark_id\": \"acebench\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.765,\n    \"normalized_score\": 0.765,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.408910+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.408910+00:00\",\n    \"benchmark_name\": \"AceBench\"\n  },\n  {\n    \"model_benchmark_id\": 657,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.6,\n    \"normalized_score\": 0.6,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.362819+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.362819+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 445,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.696,\n    \"normalized_score\": 0.696,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.946639+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.946639+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 677,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.495,\n    \"normalized_score\": 0.495,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.412395+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.412395+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 715,\n    \"benchmark_id\": \"autologi\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.895,\n    \"normalized_score\": 0.895,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.506457+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.506457+00:00\",\n    \"benchmark_name\": \"AutoLogi\"\n  },\n  {\n    \"model_benchmark_id\": 757,\n    \"benchmark_id\": \"cbnsl\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.956,\n    \"normalized_score\": 0.956,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.594017+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.594017+00:00\",\n    \"benchmark_name\": \"CBNSL\"\n  },\n  {\n    \"model_benchmark_id\": 709,\n    \"benchmark_id\": \"cnmo-2024\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.743,\n    \"normalized_score\": 0.743,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@16\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.489469+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.489469+00:00\",\n    \"benchmark_name\": \"CNMO 2024\"\n  },\n  {\n    \"model_benchmark_id\": 441,\n    \"benchmark_id\": \"csimpleqa\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.784,\n    \"normalized_score\": 0.784,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Correct\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.936097+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.936097+00:00\",\n    \"benchmark_name\": \"CSimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 257,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.751,\n    \"normalized_score\": 0.751,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond Avg@8\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.593256+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.593256+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 159,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.973,\n    \"normalized_score\": 0.973,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.405113+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.405113+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 707,\n    \"benchmark_id\": \"hmmt-2025\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.388,\n    \"normalized_score\": 0.388,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@32\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.482540+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.482540+00:00\",\n    \"benchmark_name\": \"HMMT 2025\"\n  },\n  {\n    \"model_benchmark_id\": 758,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.933,\n    \"normalized_score\": 0.933,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.598519+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.598519+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 819,\n    \"benchmark_id\": \"humaneval-er\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.811,\n    \"normalized_score\": 0.811,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.707650+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.707650+00:00\",\n    \"benchmark_name\": \"HumanEval-ER\"\n  },\n  {\n    \"model_benchmark_id\": 716,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.047,\n    \"normalized_score\": 0.047,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (Text Only)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.510122+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.510122+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 603,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.898,\n    \"normalized_score\": 0.898,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Prompt Strict\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.247003+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.247003+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 745,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.764,\n    \"normalized_score\": 0.764,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.567525+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.567525+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 368,\n    \"benchmark_id\": \"livecodebench-v6\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.537,\n    \"normalized_score\": 0.537,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.791826+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.791826+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v6\"\n  },\n  {\n    \"model_benchmark_id\": 493,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.974,\n    \"normalized_score\": 0.974,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.031465+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.031465+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 59,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.895,\n    \"normalized_score\": 0.895,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.209924+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.209924+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 162,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.811,\n    \"normalized_score\": 0.811,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.412849+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.412849+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 727,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.927,\n    \"normalized_score\": 0.927,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.531649+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.531649+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 739,\n    \"benchmark_id\": \"multichallenge\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.541,\n    \"normalized_score\": 0.541,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.554319+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.554319+00:00\",\n    \"benchmark_name\": \"MultiChallenge\"\n  },\n  {\n    \"model_benchmark_id\": 639,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.857,\n    \"normalized_score\": 0.857,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.314432+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.314432+00:00\",\n    \"benchmark_name\": \"MultiPL-E\"\n  },\n  {\n    \"model_benchmark_id\": 820,\n    \"benchmark_id\": \"musr\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.764,\n    \"normalized_score\": 0.764,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.711252+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.711252+00:00\",\n    \"benchmark_name\": \"MuSR\"\n  },\n  {\n    \"model_benchmark_id\": 638,\n    \"benchmark_id\": \"ojbench\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.271,\n    \"normalized_score\": 0.271,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.310963+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.310963+00:00\",\n    \"benchmark_name\": \"OJBench\"\n  },\n  {\n    \"model_benchmark_id\": 713,\n    \"benchmark_id\": \"polymath-en\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.651,\n    \"normalized_score\": 0.651,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@4\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.499339+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.499339+00:00\",\n    \"benchmark_name\": \"PolyMath-en\"\n  },\n  {\n    \"model_benchmark_id\": 223,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.31,\n    \"normalized_score\": 0.31,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Correct\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.526736+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.526736+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 365,\n    \"benchmark_id\": \"supergpqa\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.572,\n    \"normalized_score\": 0.572,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.782850+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.782850+00:00\",\n    \"benchmark_name\": \"SuperGPQA\"\n  },\n  {\n    \"model_benchmark_id\": 651,\n    \"benchmark_id\": \"swe-bench-multilingual\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.473,\n    \"normalized_score\": 0.473,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Single Attempt\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.343981+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.343981+00:00\",\n    \"benchmark_name\": \"SWE-bench Multilingual\"\n  },\n  {\n    \"model_benchmark_id\": 649,\n    \"benchmark_id\": \"swe-bench-verified-(agentic-coding)\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.658,\n    \"normalized_score\": 0.658,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Single Attempt\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.333761+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.333761+00:00\",\n    \"benchmark_name\": \"SWE-bench Verified (Agentic Coding)\"\n  },\n  {\n    \"model_benchmark_id\": 648,\n    \"benchmark_id\": \"swe-bench-verified-(agentless)\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.518,\n    \"normalized_score\": 0.518,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Single Patch without Test\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.330548+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.330548+00:00\",\n    \"benchmark_name\": \"SWE-bench Verified (Agentless)\"\n  },\n  {\n    \"model_benchmark_id\": 650,\n    \"benchmark_id\": \"swe-bench-verified-(multiple-attempts)\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.716,\n    \"normalized_score\": 0.716,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multiple Attempts with parallel test-time compute\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.339305+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.339305+00:00\",\n    \"benchmark_name\": \"SWE-bench Verified (Multiple Attempts)\"\n  },\n  {\n    \"model_benchmark_id\": 674,\n    \"benchmark_id\": \"tau2-airline\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.565,\n    \"normalized_score\": 0.565,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@4\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.401229+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.401229+00:00\",\n    \"benchmark_name\": \"Tau2 airline\"\n  },\n  {\n    \"model_benchmark_id\": 673,\n    \"benchmark_id\": \"tau2-retail\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.706,\n    \"normalized_score\": 0.706,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@4\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.395604+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.395604+00:00\",\n    \"benchmark_name\": \"Tau2 retail\"\n  },\n  {\n    \"model_benchmark_id\": 675,\n    \"benchmark_id\": \"tau2-telecom\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.658,\n    \"normalized_score\": 0.658,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@4\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.405145+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.405145+00:00\",\n    \"benchmark_name\": \"Tau2 telecom\"\n  },\n  {\n    \"model_benchmark_id\": 652,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.3,\n    \"normalized_score\": 0.3,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Inhouse Framework\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.348003+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.348003+00:00\",\n    \"benchmark_name\": \"Terminal-bench\"\n  },\n  {\n    \"model_benchmark_id\": 656,\n    \"benchmark_id\": \"terminus\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.25,\n    \"normalized_score\": 0.25,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.358921+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.358921+00:00\",\n    \"benchmark_name\": \"Terminus\"\n  },\n  {\n    \"model_benchmark_id\": 714,\n    \"benchmark_id\": \"zebralogic\",\n    \"model_id\": \"kimi-k2-instruct\",\n    \"score\": 0.89,\n    \"normalized_score\": 0.89,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.502879+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.502879+00:00\",\n    \"benchmark_name\": \"ZebraLogic\"\n  }\n]"
  },
  {
    "path": "data/organizations/moonshotai/models/kimi-k2-instruct/model.json",
    "content": "{\n  \"model_id\": \"kimi-k2-instruct\",\n  \"name\": \"Kimi K2 Instruct\",\n  \"organization_id\": \"moonshotai\",\n  \"fine_tuned_from_model_id\": \"kimi-k2-base\",\n  \"description\": \"Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the MuonClip optimizer, it achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities. The instruct variant is post-trained for drop-in, general-purpose chat and agentic experiences without long thinking.\",\n  \"release_date\": \"2025-07-11\",\n  \"announcement_date\": \"2025-07-11\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 1000000000000,\n  \"training_tokens\": 15500000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.moonshot.ai\",\n  \"source_playground\": \"https://kimi.com\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n  \"source_repo_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n  \"source_weights_link\": \"https://huggingface.co/moonshotai/Kimi-K2-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.875884+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.875884+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/moonshotai/models/kimi-k2-instruct-0905/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 10001,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.658,\n    \"normalized_score\": 0.658,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic Coding - Single Attempt\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"65.8% single attempt, 71.6% multiple\",\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Swe Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 10002,\n    \"benchmark_id\": \"swe-bench-multilingual\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.473,\n    \"normalized_score\": 0.473,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Agentic Coding - Single Attempt\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Swe Bench Multilingual\"\n  },\n  {\n    \"model_benchmark_id\": 10003,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.25,\n    \"normalized_score\": 0.25,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Terminus\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Terminal Bench\"\n  },\n  {\n    \"model_benchmark_id\": 10004,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.537,\n    \"normalized_score\": 0.537,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"v6 (Aug 24-May 25) Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Livecodebench\"\n  },\n  {\n    \"model_benchmark_id\": 10005,\n    \"benchmark_id\": \"ojbench\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.271,\n    \"normalized_score\": 0.271,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Ojbench\"\n  },\n  {\n    \"model_benchmark_id\": 10006,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.857,\n    \"normalized_score\": 0.857,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Multiple\"\n  },\n  {\n    \"model_benchmark_id\": 10007,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.6,\n    \"normalized_score\": 0.6,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Aider Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 10008,\n    \"benchmark_id\": \"tau2-retail\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.706,\n    \"normalized_score\": 0.706,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@4\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 Retail\"\n  },\n  {\n    \"model_benchmark_id\": 10009,\n    \"benchmark_id\": \"tau2-airline\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.565,\n    \"normalized_score\": 0.565,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@4\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 Airline\"\n  },\n  {\n    \"model_benchmark_id\": 10010,\n    \"benchmark_id\": \"tau2-telecom\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.658,\n    \"normalized_score\": 0.658,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@4\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 Telecom\"\n  },\n  {\n    \"model_benchmark_id\": 10011,\n    \"benchmark_id\": \"acebench\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.765,\n    \"normalized_score\": 0.765,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Acebench\"\n  },\n  {\n    \"model_benchmark_id\": 10012,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.696,\n    \"normalized_score\": 0.696,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Aime 2024\"\n  },\n  {\n    \"model_benchmark_id\": 10013,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.495,\n    \"normalized_score\": 0.495,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Aime 2025\"\n  },\n  {\n    \"model_benchmark_id\": 10014,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.974,\n    \"normalized_score\": 0.974,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Math 500\"\n  },\n  {\n    \"model_benchmark_id\": 10015,\n    \"benchmark_id\": \"hmmt-2025\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.388,\n    \"normalized_score\": 0.388,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@32\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Hmmt 2025\"\n  },\n  {\n    \"model_benchmark_id\": 10016,\n    \"benchmark_id\": \"cnmo-2024\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.743,\n    \"normalized_score\": 0.743,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@16\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Cnmo 2024\"\n  },\n  {\n    \"model_benchmark_id\": 10017,\n    \"benchmark_id\": \"polymath-en\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.651,\n    \"normalized_score\": 0.651,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@4\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Polymath En\"\n  },\n  {\n    \"model_benchmark_id\": 10018,\n    \"benchmark_id\": \"zebralogic\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.89,\n    \"normalized_score\": 0.89,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Zebralogic\"\n  },\n  {\n    \"model_benchmark_id\": 10019,\n    \"benchmark_id\": \"autologi\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.895,\n    \"normalized_score\": 0.895,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Autologi\"\n  },\n  {\n    \"model_benchmark_id\": 10020,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.751,\n    \"normalized_score\": 0.751,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond - Avg@8\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Gpqa\"\n  },\n  {\n    \"model_benchmark_id\": 10021,\n    \"benchmark_id\": \"supergpqa\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.572,\n    \"normalized_score\": 0.572,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Supergpqa\"\n  },\n  {\n    \"model_benchmark_id\": 10022,\n    \"benchmark_id\": \"hle\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.047,\n    \"normalized_score\": 0.047,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Text Only\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Hle\"\n  },\n  {\n    \"model_benchmark_id\": 10023,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.895,\n    \"normalized_score\": 0.895,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Mmlu\"\n  },\n  {\n    \"model_benchmark_id\": 10024,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.927,\n    \"normalized_score\": 0.927,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Mmlu Redux\"\n  },\n  {\n    \"model_benchmark_id\": 10025,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.811,\n    \"normalized_score\": 0.811,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Mmlu Pro\"\n  },\n  {\n    \"model_benchmark_id\": 10026,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.898,\n    \"normalized_score\": 0.898,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Prompt Strict\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Ifeval\"\n  },\n  {\n    \"model_benchmark_id\": 10027,\n    \"benchmark_id\": \"multichallenge\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.541,\n    \"normalized_score\": 0.541,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Multichallenge\"\n  },\n  {\n    \"model_benchmark_id\": 10028,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.31,\n    \"normalized_score\": 0.31,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Correct\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Simpleqa\"\n  },\n  {\n    \"model_benchmark_id\": 10029,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"kimi-k2-instruct-0905\",\n    \"score\": 0.764,\n    \"normalized_score\": 0.764,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"2024/11/25 Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Livebench\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/moonshotai/models/kimi-k2-instruct-0905/model.json",
    "content": "{\n  \"model_id\": \"kimi-k2-instruct-0905\",\n  \"name\": \"Kimi K2-Instruct-0905\",\n  \"organization_id\": \"moonshotai\",\n  \"model_family_id\": null,\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Kimi K2-Instruct-0905 is the latest, most capable version of Kimi K2, achieving state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. This Mixture-of-Experts model features 32 billion activated parameters and 1 trillion total parameters, meticulously optimized for agentic tasks. Key features include enhanced agentic coding intelligence, extended context length to 256K tokens, and a hybrid architecture trained with MuonClip optimizer on 15.5T tokens. The model achieves 65.8% on SWE-bench Verified (single attempt), 47.3% on SWE-bench Multilingual, and excels at tool use with 70.6% on Tau2-retail. It is a reflex-grade model without long thinking, designed to act and execute complex tasks seamlessly.\",\n  \"release_date\": \"2025-09-05\",\n  \"announcement_date\": \"2025-09-05\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 1000000000000,\n  \"training_tokens\": 15500000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.moonshot.ai\",\n  \"source_playground\": \"https://kimi.moonshot.cn/\",\n  \"source_paper\": \"https://github.com/MoonshotAI/Kimi-K2/blob/main/tech_report.pdf\",\n  \"source_scorecard_blog_link\": \"https://moonshotai.github.io/Kimi-K2/\",\n  \"source_repo_link\": \"https://github.com/MoonshotAI/Kimi-K2\",\n  \"source_weights_link\": \"https://huggingface.co/MoonshotAI\",\n  \"created_at\": \"2025-09-05T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/moonshotai/organization.json",
    "content": "{\n  \"organization_id\": \"moonshotai\",\n  \"name\": \"Moonshot AI\",\n  \"website\": \"https://moonshot.cn\",\n  \"description\": \"Chinese AI company developing the Kimi series of large language models, including state-of-the-art mixture-of-experts models with long-context capabilities\",\n  \"country\": \"CN\",\n  \"created_at\": \"2025-07-19T19:49:05.419295+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/nvidia/models/llama-3.1-nemotron-70b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 24,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n    \"score\": 0.692,\n    \"normalized_score\": 0.692,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.133318+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.133318+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1005,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n    \"score\": 0.9143,\n    \"normalized_score\": 0.9143,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.099846+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.099846+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 1811,\n    \"benchmark_id\": \"gsm8k-chat\",\n    \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n    \"score\": 0.8188,\n    \"normalized_score\": 0.8188,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chat evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.104394+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.104394+00:00\",\n    \"benchmark_name\": \"GSM8K Chat\"\n  },\n  {\n    \"model_benchmark_id\": 50,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n    \"score\": 0.8558,\n    \"normalized_score\": 0.8558,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.188734+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.188734+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 1812,\n    \"benchmark_id\": \"instruct-humaneval\",\n    \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n    \"score\": 0.7384,\n    \"normalized_score\": 0.7384,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Code evaluation (n=20)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.108307+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.108307+00:00\",\n    \"benchmark_name\": \"Instruct HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 102,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n    \"score\": 0.802,\n    \"normalized_score\": 0.802,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.292516+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.292516+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1810,\n    \"benchmark_id\": \"mmlu-chat\",\n    \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n    \"score\": 0.8058,\n    \"normalized_score\": 0.8058,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chat evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.100072+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.100072+00:00\",\n    \"benchmark_name\": \"MMLU Chat\"\n  },\n  {\n    \"model_benchmark_id\": 1611,\n    \"benchmark_id\": \"mt-bench\",\n    \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n    \"score\": 0.0899,\n    \"normalized_score\": 0.0899,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Chat evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.532800+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.532800+00:00\",\n    \"benchmark_name\": \"MT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 143,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n    \"score\": 0.5863,\n    \"normalized_score\": 0.5863,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.363751+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.363751+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 153,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n    \"score\": 0.8453,\n    \"normalized_score\": 0.8453,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.390043+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.390043+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  },\n  {\n    \"model_benchmark_id\": 1809,\n    \"benchmark_id\": \"xlsum-english\",\n    \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n    \"score\": 0.3161,\n    \"normalized_score\": 0.3161,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.094560+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.094560+00:00\",\n    \"benchmark_name\": \"XLSum English\"\n  }\n]"
  },
  {
    "path": "data/organizations/nvidia/models/llama-3.1-nemotron-70b-instruct/model.json",
    "content": "{\n  \"model_id\": \"llama-3.1-nemotron-70b-instruct\",\n  \"name\": \"Llama 3.1 Nemotron 70B Instruct\",\n  \"organization_id\": \"nvidia\",\n  \"fine_tuned_from_model_id\": \"llama-3.1-70b-instruct\",\n  \"description\": \"A large language model customized by NVIDIA to improve the helpfulness of LLM generated responses. It is a fine-tuned version of Llama 3.1 70B Instruct. The model was trained using RLHF (REINFORCE) with HelpSteer2-Preference prompts.\",\n  \"release_date\": \"2024-10-01\",\n  \"announcement_date\": \"2024-10-01\",\n  \"license_id\": \"llama_3_1_community_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2023-12-01\",\n  \"param_count\": 70000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2410.01257\",\n  \"source_scorecard_blog_link\": \"https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.908923+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.908923+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/nvidia/models/llama-3.1-nemotron-nano-8b-v1/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 698,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"llama-3.1-nemotron-nano-8b-v1\",\n    \"score\": 0.471,\n    \"normalized_score\": 0.471,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.461794+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.461794+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1586,\n    \"benchmark_id\": \"bfcl-v2\",\n    \"model_id\": \"llama-3.1-nemotron-nano-8b-v1\",\n    \"score\": 0.636,\n    \"normalized_score\": 0.636,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.454860+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.454860+00:00\",\n    \"benchmark_name\": \"BFCL v2\"\n  },\n  {\n    \"model_benchmark_id\": 327,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-3.1-nemotron-nano-8b-v1\",\n    \"score\": 0.541,\n    \"normalized_score\": 0.541,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, Pass@1, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.719213+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.719213+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 627,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"llama-3.1-nemotron-nano-8b-v1\",\n    \"score\": 0.793,\n    \"normalized_score\": 0.793,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Strict Accuracy, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.289960+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.289960+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 510,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"llama-3.1-nemotron-nano-8b-v1\",\n    \"score\": 0.954,\n    \"normalized_score\": 0.954,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.059893+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.059893+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 1193,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"llama-3.1-nemotron-nano-8b-v1\",\n    \"score\": 0.846,\n    \"normalized_score\": 0.846,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, Pass@1, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.512976+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.512976+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1610,\n    \"benchmark_id\": \"mt-bench\",\n    \"model_id\": \"llama-3.1-nemotron-nano-8b-v1\",\n    \"score\": 0.81,\n    \"normalized_score\": 0.81,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.530016+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.530016+00:00\",\n    \"benchmark_name\": \"MT-Bench\"\n  }\n]"
  },
  {
    "path": "data/organizations/nvidia/models/llama-3.1-nemotron-nano-8b-v1/model.json",
    "content": "{\n  \"model_id\": \"llama-3.1-nemotron-nano-8b-v1\",\n  \"name\": \"Llama 3.1 Nemotron Nano 8B V1\",\n  \"organization_id\": \"nvidia\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Llama-3.1-Nemotron-Nano-8B-v1 is a large language model (LLM) which is a derivative of Meta Llama-3.1-8B-Instruct (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling.\",\n  \"release_date\": \"2025-03-18\",\n  \"announcement_date\": \"2025-03-18\",\n  \"license_id\": \"llama_3_1_community_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2023-12-31\",\n  \"param_count\": 8000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1\",\n  \"source_paper\": \"https://arxiv.org/abs/2502.00203\",\n  \"source_scorecard_blog_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1/modelcard\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1\",\n  \"created_at\": \"2025-07-19T19:49:05.733231+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.733231+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/nvidia/models/llama-3.1-nemotron-ultra-253b-v1/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 699,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"llama-3.1-nemotron-ultra-253b-v1\",\n    \"score\": 0.725,\n    \"normalized_score\": 0.725,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.463355+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.463355+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1587,\n    \"benchmark_id\": \"bfcl-v2\",\n    \"model_id\": \"llama-3.1-nemotron-ultra-253b-v1\",\n    \"score\": 0.741,\n    \"normalized_score\": 0.741,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.456840+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.456840+00:00\",\n    \"benchmark_name\": \"BFCL v2\"\n  },\n  {\n    \"model_benchmark_id\": 328,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-3.1-nemotron-ultra-253b-v1\",\n    \"score\": 0.7601,\n    \"normalized_score\": 0.7601,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.721348+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.721348+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 628,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"llama-3.1-nemotron-ultra-253b-v1\",\n    \"score\": 0.8945,\n    \"normalized_score\": 0.8945,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Strict Accuracy, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.292359+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.292359+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1143,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"llama-3.1-nemotron-ultra-253b-v1\",\n    \"score\": 0.6631,\n    \"normalized_score\": 0.6631,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.404565+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.404565+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 511,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"llama-3.1-nemotron-ultra-253b-v1\",\n    \"score\": 0.97,\n    \"normalized_score\": 0.97,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.061892+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.061892+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  }\n]"
  },
  {
    "path": "data/organizations/nvidia/models/llama-3.1-nemotron-ultra-253b-v1/model.json",
    "content": "{\n  \"model_id\": \"llama-3.1-nemotron-ultra-253b-v1\",\n  \"name\": \"Llama 3.1 Nemotron Ultra 253B v1\",\n  \"organization_id\": \"nvidia\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A 253B parameter derivative of Meta Llama 3.1 405B Instruct, developed by NVIDIA using Neural Architecture Search (NAS) and vertical compression. It underwent multi-phase post-training (SFT for Math, Code, Reasoning, Chat, Tool Calling; RL with GRPO) to enhance reasoning and instruction-following. Optimized for accuracy/efficiency tradeoff on NVIDIA GPUs. Supports 128k context.\",\n  \"release_date\": \"2025-04-07\",\n  \"announcement_date\": \"2025-04-07\",\n  \"license_id\": \"llama_3_1_community_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2023-12-01\",\n  \"param_count\": 253000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1\",\n  \"source_paper\": \"https://arxiv.org/abs/2502.00203\",\n  \"source_scorecard_blog_link\": \"https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1/modelcard\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1\",\n  \"created_at\": \"2025-07-19T19:49:05.735588+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.735588+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/nvidia/models/llama-3.3-nemotron-super-49b-v1/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 697,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"llama-3.3-nemotron-super-49b-v1\",\n    \"score\": 0.584,\n    \"normalized_score\": 0.584,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.459628+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.459628+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1461,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"llama-3.3-nemotron-super-49b-v1\",\n    \"score\": 0.883,\n    \"normalized_score\": 0.883,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning Off\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.113375+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.113375+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 1585,\n    \"benchmark_id\": \"bfcl-v2\",\n    \"model_id\": \"llama-3.3-nemotron-super-49b-v1\",\n    \"score\": 0.737,\n    \"normalized_score\": 0.737,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.452681+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.452681+00:00\",\n    \"benchmark_name\": \"BFCL v2\"\n  },\n  {\n    \"model_benchmark_id\": 326,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"llama-3.3-nemotron-super-49b-v1\",\n    \"score\": 0.6667,\n    \"normalized_score\": 0.6667,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.717785+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.717785+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 509,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"llama-3.3-nemotron-super-49b-v1\",\n    \"score\": 0.966,\n    \"normalized_score\": 0.966,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.058280+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.058280+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 1192,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"llama-3.3-nemotron-super-49b-v1\",\n    \"score\": 0.913,\n    \"normalized_score\": 0.913,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.511549+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.511549+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1609,\n    \"benchmark_id\": \"mt-bench\",\n    \"model_id\": \"llama-3.3-nemotron-super-49b-v1\",\n    \"score\": 0.917,\n    \"normalized_score\": 0.917,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.527840+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.527840+00:00\",\n    \"benchmark_name\": \"MT-Bench\"\n  }\n]"
  },
  {
    "path": "data/organizations/nvidia/models/llama-3.3-nemotron-super-49b-v1/model.json",
    "content": "{\n  \"model_id\": \"llama-3.3-nemotron-super-49b-v1\",\n  \"name\": \"Llama-3.3 Nemotron Super 49B v1\",\n  \"organization_id\": \"nvidia\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Llama-3.3-Nemotron-Super-49B-v1 is a large language model (LLM) derived from Meta Llama-3.3-70B-Instruct. It's post-trained for reasoning, chat, RAG, and tool calling, offering a balance between accuracy and efficiency (optimized for single H100). It underwent multi-phase post-training including SFT and RL (RLOO, RPO).\",\n  \"release_date\": \"2025-03-18\",\n  \"announcement_date\": \"2025-03-18\",\n  \"license_id\": \"llama_3_1_community_license\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2023-12-31\",\n  \"param_count\": 49900000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1\",\n  \"source_paper\": \"https://arxiv.org/abs/2502.00203\",\n  \"source_scorecard_blog_link\": \"https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1/modelcard\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1\",\n  \"created_at\": \"2025-07-19T19:49:05.730826+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.730826+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/nvidia/models/nemotron-nano-9b-v2/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 12345,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"nvidia-nemotron-nano-9b-v2\",\n    \"score\": 0.721,\n    \"normalized_score\": 0.721,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"updated_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 12345,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"nvidia-nemotron-nano-9b-v2\",\n    \"score\": 0.978,\n    \"normalized_score\": 0.978,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"updated_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 12345,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"nvidia-nemotron-nano-9b-v2\",\n    \"score\": 0.640,\n    \"normalized_score\": 0.640,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"updated_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 12345,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"nvidia-nemotron-nano-9b-v2\",\n    \"score\": 0.711,\n    \"normalized_score\": 0.711,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"updated_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 12345,\n    \"benchmark_id\": \"bfcl-v3-multiturn\",\n    \"model_id\": \"nvidia-nemotron-nano-9b-v2\",\n    \"score\": 0.669,\n    \"normalized_score\": 0.669,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"updated_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"benchmark_name\": \"BFCL v3\"\n  },\n  {\n    \"model_benchmark_id\": 12345,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"nvidia-nemotron-nano-9b-v2\",\n    \"score\": 0.903,\n    \"normalized_score\": 0.903,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/modelcard\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score, Reasoning On\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"updated_at\": \"2025-10-04T16:07:30.482+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  }\n]"
  },
  {
    "path": "data/organizations/nvidia/models/nemotron-nano-9b-v2/model.json",
    "content": "{\n  \"model_id\": \"nvidia-nemotron-nano-9b-v2\",\n  \"name\": \"Nemotron Nano 9B v2\",\n  \"organization_id\": \"nvidia\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"NVIDIA-Nemotron-Nano-9B-v2 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks.\",\n  \"release_date\": \"2025-08-18\",\n  \"announcement_date\": \"2025-08-18\",\n  \"license_id\": \"nvidia_open_model_license_agreement\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2024-09\",\n  \"param_count\": 8900000000,\n  \"training_tokens\": 21100000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2\",\n  \"source_paper\": \"https://arxiv.org/abs/2508.14444\",\n  \"source_scorecard_blog_link\": \"https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/modelcard\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": \"https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2\",\n  \"created_at\": \"2025-10-02T21:51:16.835+00:00\",\n  \"updated_at\": \"2025-10-02T21:51:16.835+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/nvidia/organization.json",
    "content": "{\n  \"organization_id\": \"nvidia\",\n  \"name\": \"NVIDIA\",\n  \"website\": \"https://nvidia.com\",\n  \"description\": \"GPU and AI company\",\n  \"country\": \"US\",\n  \"created_at\": \"2025-07-19T19:49:05.728519+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.728519+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/openai/models/gpt-3.5-turbo-0125/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 963,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"gpt-3.5-turbo-0125\",\n    \"score\": 0.702,\n    \"normalized_score\": 0.702,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.025267+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.025267+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 359,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-3.5-turbo-0125\",\n    \"score\": 0.308,\n    \"normalized_score\": 0.308,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.770449+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.770449+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 815,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gpt-3.5-turbo-0125\",\n    \"score\": 0.68,\n    \"normalized_score\": 0.68,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.697970+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.697970+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 429,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gpt-3.5-turbo-0125\",\n    \"score\": 0.431,\n    \"normalized_score\": 0.431,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.906977+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.906977+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 547,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gpt-3.5-turbo-0125\",\n    \"score\": 0.0,\n    \"normalized_score\": 0.0,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.127494+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.127494+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1299,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"gpt-3.5-turbo-0125\",\n    \"score\": 0.563,\n    \"normalized_score\": 0.563,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.717321+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.717321+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 126,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-3.5-turbo-0125\",\n    \"score\": 0.698,\n    \"normalized_score\": 0.698,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.331664+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.331664+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 597,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gpt-3.5-turbo-0125\",\n    \"score\": 0.0,\n    \"normalized_score\": 0.0,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://example.com/benchmark-image\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.230222+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.230222+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/gpt-3.5-turbo-0125/model.json",
    "content": "{\n  \"model_id\": \"gpt-3.5-turbo-0125\",\n  \"name\": \"GPT-3.5 Turbo\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls.\",\n  \"release_date\": \"2023-03-21\",\n  \"announcement_date\": \"2023-03-21\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2021-09-30\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/gpt-3-5-turbo\",\n  \"source_playground\": \"https://platform.openai.com/playground\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/blog/new-models-and-developer-products-announced-at-devday\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.858492+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.858492+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/gpt-4-0613/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1917,\n    \"benchmark_id\": \"ai2-reasoning-challenge-(arc)\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.963,\n    \"normalized_score\": 0.963,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/research/gpt-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25-shot, Grade-school multiple choice science questions (Challenge-set)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.421959+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.421959+00:00\",\n    \"benchmark_name\": \"AI2 Reasoning Challenge (ARC)\"\n  },\n  {\n    \"model_benchmark_id\": 965,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.809,\n    \"normalized_score\": 0.809,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/research/gpt-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"3-shot, Reading comprehension & arithmetic (f1 score)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.028099+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.028099+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 362,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.357,\n    \"normalized_score\": 0.357,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot, Commonsense reasoning\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.775863+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.775863+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 55,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.953,\n    \"normalized_score\": 0.953,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/research/gpt-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"10-shot, Commonsense reasoning around everyday events\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.199031+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.199031+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 817,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.67,\n    \"normalized_score\": 0.67,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/research/gpt-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot, Python coding tasks\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.702020+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.702020+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1915,\n    \"benchmark_id\": \"lsat\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.88,\n    \"normalized_score\": 0.88,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/research/gpt-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Percentile score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.413295+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.413295+00:00\",\n    \"benchmark_name\": \"LSAT\"\n  },\n  {\n    \"model_benchmark_id\": 432,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.42,\n    \"normalized_score\": 0.42,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Mathematics problem-solving\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.913379+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.913379+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1302,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.745,\n    \"normalized_score\": 0.745,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Mathematics problem-solving\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.721873+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.721873+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 129,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.864,\n    \"normalized_score\": 0.864,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/research/gpt-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot, Multiple-choice questions in 57 subjects (professional & academic)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.336601+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.336601+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1916,\n    \"benchmark_id\": \"sat-math\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.89,\n    \"normalized_score\": 0.89,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/research/gpt-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Estimated from reported score of 710 out of 800\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.417889+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.417889+00:00\",\n    \"benchmark_name\": \"SAT Math\"\n  },\n  {\n    \"model_benchmark_id\": 1914,\n    \"benchmark_id\": \"uniform-bar-exam\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.9,\n    \"normalized_score\": 0.9,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/research/gpt-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Percentage score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.408427+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.408427+00:00\",\n    \"benchmark_name\": \"Uniform Bar Exam\"\n  },\n  {\n    \"model_benchmark_id\": 156,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"gpt-4-0613\",\n    \"score\": 0.875,\n    \"normalized_score\": 0.875,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/research/gpt-4\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot, Commonsense reasoning around pronoun resolution\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.396099+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.396099+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/gpt-4-0613/model.json",
    "content": "{\n  \"model_id\": \"gpt-4-0613\",\n  \"name\": \"GPT-4\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-4 is a large multimodal model capable of processing both image and text inputs and generating human-like text outputs. It demonstrates human-level performance on various professional and academic benchmarks.\",\n  \"release_date\": \"2023-06-13\",\n  \"announcement_date\": \"2023-06-13\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2022-12-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/api-reference/chat\",\n  \"source_playground\": \"https://platform.openai.com/playground\",\n  \"source_paper\": \"https://arxiv.org/abs/2303.08774\",\n  \"source_scorecard_blog_link\": \"https://openai.com/research/gpt-4\",\n  \"source_repo_link\": \"https://github.com/openai/gpt-4\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.869531+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.869531+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/gpt-4-turbo-2024-04-09/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 966,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"gpt-4-turbo-2024-04-09\",\n    \"score\": 0.86,\n    \"normalized_score\": 0.86,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Reading comprehension & arithmetic (f1 score)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.030041+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.030041+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 363,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-4-turbo-2024-04-09\",\n    \"score\": 0.48,\n    \"normalized_score\": 0.48,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"General-Purpose Question Answering\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.777899+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.777899+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 818,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gpt-4-turbo-2024-04-09\",\n    \"score\": 0.871,\n    \"normalized_score\": 0.871,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Python coding tasks\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.703615+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.703615+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 433,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gpt-4-turbo-2024-04-09\",\n    \"score\": 0.726,\n    \"normalized_score\": 0.726,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Mathematics problem-solving\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.916360+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.916360+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1303,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"gpt-4-turbo-2024-04-09\",\n    \"score\": 0.885,\n    \"normalized_score\": 0.885,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Grade School Math Word Problems\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.723556+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.723556+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 130,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-4-turbo-2024-04-09\",\n    \"score\": 0.865,\n    \"normalized_score\": 0.865,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multiple-choice questions in 57 subjects (professional & academic)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.337995+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.337995+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/gpt-4-turbo-2024-04-09/model.json",
    "content": "{\n  \"model_id\": \"gpt-4-turbo-2024-04-09\",\n  \"name\": \"GPT-4 Turbo\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"The latest GPT-4 model with improved performance, updated knowledge, and enhanced capabilities. It offers faster response times and more affordable pricing compared to previous versions.\",\n  \"release_date\": \"2024-04-09\",\n  \"announcement_date\": \"2024-04-09\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2023-12-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4\",\n  \"source_playground\": \"https://platform.openai.com/playground\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/index/new-models-and-developer-products-announced-at-devday/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.872559+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.872559+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/gpt-4.1-2025-04-14/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 671,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.516,\n    \"normalized_score\": 0.516,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.389292+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.389292+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 1335,\n    \"benchmark_id\": \"aider-polyglot-edit\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.529,\n    \"normalized_score\": 0.529,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.808732+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.808732+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot Edit\"\n  },\n  {\n    \"model_benchmark_id\": 486,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.481,\n    \"normalized_score\": 0.481,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.019979+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.019979+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 1889,\n    \"benchmark_id\": \"charxiv-d\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.879,\n    \"normalized_score\": 0.879,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.330689+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.330689+00:00\",\n    \"benchmark_name\": \"CharXiv-D\"\n  },\n  {\n    \"model_benchmark_id\": 1837,\n    \"benchmark_id\": \"charxiv-r\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.567,\n    \"normalized_score\": 0.567,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.201588+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.201588+00:00\",\n    \"benchmark_name\": \"CharXiv-R\"\n  },\n  {\n    \"model_benchmark_id\": 1860,\n    \"benchmark_id\": \"collie\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.658,\n    \"normalized_score\": 0.658,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.261360+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.261360+00:00\",\n    \"benchmark_name\": \"COLLIE\"\n  },\n  {\n    \"model_benchmark_id\": 1895,\n    \"benchmark_id\": \"complexfuncbench\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.655,\n    \"normalized_score\": 0.655,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.348011+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.348011+00:00\",\n    \"benchmark_name\": \"ComplexFuncBench\"\n  },\n  {\n    \"model_benchmark_id\": 353,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.663,\n    \"normalized_score\": 0.663,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.761405+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.761405+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1874,\n    \"benchmark_id\": \"graphwalks-bfs-<128k\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.617,\n    \"normalized_score\": 0.617,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.294683+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.294683+00:00\",\n    \"benchmark_name\": \"Graphwalks BFS <128k\"\n  },\n  {\n    \"model_benchmark_id\": 1877,\n    \"benchmark_id\": \"graphwalks-bfs->128k\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.19,\n    \"normalized_score\": 0.19,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.302353+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.302353+00:00\",\n    \"benchmark_name\": \"Graphwalks BFS >128k\"\n  },\n  {\n    \"model_benchmark_id\": 1881,\n    \"benchmark_id\": \"graphwalks-parents-<128k\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.58,\n    \"normalized_score\": 0.58,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.312231+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.312231+00:00\",\n    \"benchmark_name\": \"Graphwalks parents <128k\"\n  },\n  {\n    \"model_benchmark_id\": 1886,\n    \"benchmark_id\": \"graphwalks-parents->128k\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.25,\n    \"normalized_score\": 0.25,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.324002+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.324002+00:00\",\n    \"benchmark_name\": \"Graphwalks parents >128k\"\n  },\n  {\n    \"model_benchmark_id\": 635,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.874,\n    \"normalized_score\": 0.874,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.304284+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.304284+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1848,\n    \"benchmark_id\": \"internal-api-instruction-following-(hard)\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.491,\n    \"normalized_score\": 0.491,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.230360+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.230360+00:00\",\n    \"benchmark_name\": \"Internal API instruction following (hard)\"\n  },\n  {\n    \"model_benchmark_id\": 543,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.722,\n    \"normalized_score\": 0.722,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.121168+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.121168+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 121,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.902,\n    \"normalized_score\": 0.902,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.323612+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.323612+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1483,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.873,\n    \"normalized_score\": 0.873,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.161058+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.161058+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 593,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.748,\n    \"normalized_score\": 0.748,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.222754+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.222754+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 743,\n    \"benchmark_id\": \"multichallenge\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.383,\n    \"normalized_score\": 0.383,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark (GPT-4o grader)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.561934+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.561934+00:00\",\n    \"benchmark_name\": \"MultiChallenge\"\n  },\n  {\n    \"model_benchmark_id\": 1854,\n    \"benchmark_id\": \"multichallenge-(o3-mini-grader)\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.462,\n    \"normalized_score\": 0.462,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark (o3-mini grader, see footnote [3])\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.244951+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.244951+00:00\",\n    \"benchmark_name\": \"MultiChallenge (o3-mini grader)\"\n  },\n  {\n    \"model_benchmark_id\": 1653,\n    \"benchmark_id\": \"multi-if\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.708,\n    \"normalized_score\": 0.708,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.648170+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.648170+00:00\",\n    \"benchmark_name\": \"Multi-IF\"\n  },\n  {\n    \"model_benchmark_id\": 1866,\n    \"benchmark_id\": \"openai-mrcr:-2-needle-128k\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.572,\n    \"normalized_score\": 0.572,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.275855+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.275855+00:00\",\n    \"benchmark_name\": \"OpenAI-MRCR: 2 needle 128k\"\n  },\n  {\n    \"model_benchmark_id\": 1871,\n    \"benchmark_id\": \"openai-mrcr:-2-needle-1m\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.463,\n    \"normalized_score\": 0.463,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.286394+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.286394+00:00\",\n    \"benchmark_name\": \"OpenAI-MRCR: 2 needle 1M\"\n  },\n  {\n    \"model_benchmark_id\": 1358,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.546,\n    \"normalized_score\": 0.546,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal methodology, see source footnote [2]\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.858938+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.858938+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1780,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.494,\n    \"normalized_score\": 0.494,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg 5 runs, no custom tools/prompting (footnote [4])\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.015514+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.015514+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1766,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.68,\n    \"normalized_score\": 0.68,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg 5 runs, no custom tools/prompting (footnote [4], GPT-4o user model)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.986496+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.986496+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  },\n  {\n    \"model_benchmark_id\": 1907,\n    \"benchmark_id\": \"video-mme-(long,-no-subtitles)\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.72,\n    \"normalized_score\": 0.72,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.377204+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.377204+00:00\",\n    \"benchmark_name\": \"Video-MME (long, no subtitles)\"\n  },\n  {\n    \"model_benchmark_id\": 10011,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.464,\n    \"normalized_score\": 0.464,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4.1 with no tools - Competition mathematics (AIME 2025).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 10012,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.054,\n    \"normalized_score\": 0.054,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4.1 with no tools - Expert-level questions across subjects.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 10013,\n    \"benchmark_id\": \"hmmt-2025\",\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"score\": 0.289,\n    \"normalized_score\": 0.289,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4.1 with no tools - Harvard-MIT Mathematics Tournament.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT 2025\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/openai/models/gpt-4.1-2025-04-14/model.json",
    "content": "{\n  \"model_id\": \"gpt-4.1-2025-04-14\",\n  \"name\": \"GPT-4.1\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-4.1 is OpenAI's latest and most advanced flagship model, significantly improving upon GPT-4 Turbo in performance across benchmarks, speed, and cost-effectiveness.\",\n  \"release_date\": \"2025-04-14\",\n  \"announcement_date\": \"2025-04-14\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-06-01\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/gpt-4.1\",\n  \"source_playground\": \"https://platform.openai.com/playground?mode=chat&model=gpt-4.1\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/index/gpt-4-1/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.841143+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.841143+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/gpt-4.1-mini-2025-04-14/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 667,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.347,\n    \"normalized_score\": 0.347,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.382631+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.382631+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 1331,\n    \"benchmark_id\": \"aider-polyglot-edit\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.316,\n    \"normalized_score\": 0.316,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.801113+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.801113+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot Edit\"\n  },\n  {\n    \"model_benchmark_id\": 482,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.496,\n    \"normalized_score\": 0.496,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.013761+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.013761+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 1887,\n    \"benchmark_id\": \"charxiv-d\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.884,\n    \"normalized_score\": 0.884,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.327509+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.327509+00:00\",\n    \"benchmark_name\": \"CharXiv-D\"\n  },\n  {\n    \"model_benchmark_id\": 1834,\n    \"benchmark_id\": \"charxiv-r\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.568,\n    \"normalized_score\": 0.568,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.195563+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.195563+00:00\",\n    \"benchmark_name\": \"CharXiv-R\"\n  },\n  {\n    \"model_benchmark_id\": 1857,\n    \"benchmark_id\": \"collie\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.546,\n    \"normalized_score\": 0.546,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.255006+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.255006+00:00\",\n    \"benchmark_name\": \"COLLIE\"\n  },\n  {\n    \"model_benchmark_id\": 1892,\n    \"benchmark_id\": \"complexfuncbench\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.493,\n    \"normalized_score\": 0.493,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.339307+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.339307+00:00\",\n    \"benchmark_name\": \"ComplexFuncBench\"\n  },\n  {\n    \"model_benchmark_id\": 348,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.65,\n    \"normalized_score\": 0.65,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.752534+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.752534+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1872,\n    \"benchmark_id\": \"graphwalks-bfs-<128k\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.617,\n    \"normalized_score\": 0.617,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.289789+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.289789+00:00\",\n    \"benchmark_name\": \"Graphwalks BFS <128k\"\n  },\n  {\n    \"model_benchmark_id\": 1875,\n    \"benchmark_id\": \"graphwalks-bfs->128k\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.15,\n    \"normalized_score\": 0.15,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.298708+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.298708+00:00\",\n    \"benchmark_name\": \"Graphwalks BFS >128k\"\n  },\n  {\n    \"model_benchmark_id\": 1878,\n    \"benchmark_id\": \"graphwalks-parents-<128k\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.605,\n    \"normalized_score\": 0.605,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.306151+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.306151+00:00\",\n    \"benchmark_name\": \"Graphwalks parents <128k\"\n  },\n  {\n    \"model_benchmark_id\": 1884,\n    \"benchmark_id\": \"graphwalks-parents->128k\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.11,\n    \"normalized_score\": 0.11,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.319823+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.319823+00:00\",\n    \"benchmark_name\": \"Graphwalks parents >128k\"\n  },\n  {\n    \"model_benchmark_id\": 632,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.841,\n    \"normalized_score\": 0.841,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.299050+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.299050+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1845,\n    \"benchmark_id\": \"internal-api-instruction-following-(hard)\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.451,\n    \"normalized_score\": 0.451,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.225405+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.225405+00:00\",\n    \"benchmark_name\": \"Internal API instruction following (hard)\"\n  },\n  {\n    \"model_benchmark_id\": 539,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.731,\n    \"normalized_score\": 0.731,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.114367+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.114367+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 117,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.875,\n    \"normalized_score\": 0.875,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.317652+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.317652+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1481,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.785,\n    \"normalized_score\": 0.785,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.157799+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.157799+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 590,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.727,\n    \"normalized_score\": 0.727,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.217019+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.217019+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 740,\n    \"benchmark_id\": \"multichallenge\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.358,\n    \"normalized_score\": 0.358,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark (GPT-4o grader)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.555824+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.555824+00:00\",\n    \"benchmark_name\": \"MultiChallenge\"\n  },\n  {\n    \"model_benchmark_id\": 1851,\n    \"benchmark_id\": \"multichallenge-(o3-mini-grader)\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.422,\n    \"normalized_score\": 0.422,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark (o3-mini grader, see footnote [3])\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.239021+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.239021+00:00\",\n    \"benchmark_name\": \"MultiChallenge (o3-mini grader)\"\n  },\n  {\n    \"model_benchmark_id\": 1650,\n    \"benchmark_id\": \"multi-if\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.67,\n    \"normalized_score\": 0.67,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.643303+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.643303+00:00\",\n    \"benchmark_name\": \"Multi-IF\"\n  },\n  {\n    \"model_benchmark_id\": 1863,\n    \"benchmark_id\": \"openai-mrcr:-2-needle-128k\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.472,\n    \"normalized_score\": 0.472,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.270008+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.270008+00:00\",\n    \"benchmark_name\": \"OpenAI-MRCR: 2 needle 128k\"\n  },\n  {\n    \"model_benchmark_id\": 1869,\n    \"benchmark_id\": \"openai-mrcr:-2-needle-1m\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.333,\n    \"normalized_score\": 0.333,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.282718+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.282718+00:00\",\n    \"benchmark_name\": \"OpenAI-MRCR: 2 needle 1M\"\n  },\n  {\n    \"model_benchmark_id\": 1355,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.236,\n    \"normalized_score\": 0.236,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal methodology, see source footnote [2]\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.852737+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.852737+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1776,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.36,\n    \"normalized_score\": 0.36,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg 5 runs, no custom tools/prompting (footnote [4])\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.007636+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.007636+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1762,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.558,\n    \"normalized_score\": 0.558,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg 5 runs, no custom tools/prompting (footnote [4], GPT-4o user model)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.978528+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.978528+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  },\n  {\n    \"model_benchmark_id\": 10014,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.402,\n    \"normalized_score\": 0.402,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4.1 mini with no tools - Competition mathematics (AIME 2025).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 10015,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.037,\n    \"normalized_score\": 0.037,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4.1 mini with no tools - Expert-level questions across subjects.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 10016,\n    \"benchmark_id\": \"hmmt-2025\",\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"score\": 0.35,\n    \"normalized_score\": 0.35,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4.1 mini with no tools - Harvard-MIT Mathematics Tournament.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT 2025\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/openai/models/gpt-4.1-mini-2025-04-14/model.json",
    "content": "{\n  \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n  \"name\": \"GPT-4.1 mini\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-4.1 mini provides a balance between intelligence, speed, and cost. It's a significant leap in small model performance, even beating GPT-4o in many benchmarks while reducing latency and cost.\",\n  \"release_date\": \"2025-04-14\",\n  \"announcement_date\": \"2025-04-14\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-05-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/gpt-4.1-mini\",\n  \"source_playground\": \"https://platform.openai.com/playground?mode=chat&model=gpt-4.1-mini\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/index/gpt-4-1/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.821382+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.821382+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/gpt-4.1-nano-2025-04-14/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 669,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.098,\n    \"normalized_score\": 0.098,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.385924+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.385924+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 1333,\n    \"benchmark_id\": \"aider-polyglot-edit\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.062,\n    \"normalized_score\": 0.062,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.804864+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.804864+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot Edit\"\n  },\n  {\n    \"model_benchmark_id\": 484,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.294,\n    \"normalized_score\": 0.294,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.016856+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.016856+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 1888,\n    \"benchmark_id\": \"charxiv-d\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.739,\n    \"normalized_score\": 0.739,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.329021+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.329021+00:00\",\n    \"benchmark_name\": \"CharXiv-D\"\n  },\n  {\n    \"model_benchmark_id\": 1836,\n    \"benchmark_id\": \"charxiv-r\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.405,\n    \"normalized_score\": 0.405,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.199274+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.199274+00:00\",\n    \"benchmark_name\": \"CharXiv-R\"\n  },\n  {\n    \"model_benchmark_id\": 1858,\n    \"benchmark_id\": \"collie\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.425,\n    \"normalized_score\": 0.425,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.257208+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.257208+00:00\",\n    \"benchmark_name\": \"COLLIE\"\n  },\n  {\n    \"model_benchmark_id\": 1893,\n    \"benchmark_id\": \"complexfuncbench\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.057,\n    \"normalized_score\": 0.057,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.341699+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.341699+00:00\",\n    \"benchmark_name\": \"ComplexFuncBench\"\n  },\n  {\n    \"model_benchmark_id\": 350,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.503,\n    \"normalized_score\": 0.503,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.756178+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.756178+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1873,\n    \"benchmark_id\": \"graphwalks-bfs-<128k\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.25,\n    \"normalized_score\": 0.25,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.291775+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.291775+00:00\",\n    \"benchmark_name\": \"Graphwalks BFS <128k\"\n  },\n  {\n    \"model_benchmark_id\": 1876,\n    \"benchmark_id\": \"graphwalks-bfs->128k\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.029,\n    \"normalized_score\": 0.029,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.300453+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.300453+00:00\",\n    \"benchmark_name\": \"Graphwalks BFS >128k\"\n  },\n  {\n    \"model_benchmark_id\": 1879,\n    \"benchmark_id\": \"graphwalks-parents-<128k\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.094,\n    \"normalized_score\": 0.094,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.308330+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.308330+00:00\",\n    \"benchmark_name\": \"Graphwalks parents <128k\"\n  },\n  {\n    \"model_benchmark_id\": 1885,\n    \"benchmark_id\": \"graphwalks-parents->128k\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.056,\n    \"normalized_score\": 0.056,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.322097+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.322097+00:00\",\n    \"benchmark_name\": \"Graphwalks parents >128k\"\n  },\n  {\n    \"model_benchmark_id\": 633,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.745,\n    \"normalized_score\": 0.745,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.300562+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.300562+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1846,\n    \"benchmark_id\": \"internal-api-instruction-following-(hard)\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.316,\n    \"normalized_score\": 0.316,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.227248+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.227248+00:00\",\n    \"benchmark_name\": \"Internal API instruction following (hard)\"\n  },\n  {\n    \"model_benchmark_id\": 541,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.562,\n    \"normalized_score\": 0.562,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.117553+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.117553+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 118,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.801,\n    \"normalized_score\": 0.801,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.319012+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.319012+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1482,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.669,\n    \"normalized_score\": 0.669,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.159419+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.159419+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 592,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.554,\n    \"normalized_score\": 0.554,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.220951+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.220951+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 741,\n    \"benchmark_id\": \"multichallenge\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.15,\n    \"normalized_score\": 0.15,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark (GPT-4o grader)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.557571+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.557571+00:00\",\n    \"benchmark_name\": \"MultiChallenge\"\n  },\n  {\n    \"model_benchmark_id\": 1852,\n    \"benchmark_id\": \"multichallenge-(o3-mini-grader)\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.311,\n    \"normalized_score\": 0.311,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark (o3-mini grader, see footnote [3])\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.241054+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.241054+00:00\",\n    \"benchmark_name\": \"MultiChallenge (o3-mini grader)\"\n  },\n  {\n    \"model_benchmark_id\": 1651,\n    \"benchmark_id\": \"multi-if\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.572,\n    \"normalized_score\": 0.572,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.645047+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.645047+00:00\",\n    \"benchmark_name\": \"Multi-IF\"\n  },\n  {\n    \"model_benchmark_id\": 1864,\n    \"benchmark_id\": \"openai-mrcr:-2-needle-128k\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.366,\n    \"normalized_score\": 0.366,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.272341+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.272341+00:00\",\n    \"benchmark_name\": \"OpenAI-MRCR: 2 needle 128k\"\n  },\n  {\n    \"model_benchmark_id\": 1870,\n    \"benchmark_id\": \"openai-mrcr:-2-needle-1m\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.12,\n    \"normalized_score\": 0.12,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Internal benchmark\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.284545+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.284545+00:00\",\n    \"benchmark_name\": \"OpenAI-MRCR: 2 needle 1M\"\n  },\n  {\n    \"model_benchmark_id\": 1778,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.14,\n    \"normalized_score\": 0.14,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg 5 runs, no custom tools/prompting (footnote [4])\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.011934+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.011934+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1764,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"score\": 0.226,\n    \"normalized_score\": 0.226,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg 5 runs, no custom tools/prompting (footnote [4], GPT-4o user model)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.982239+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.982239+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/gpt-4.1-nano-2025-04-14/model.json",
    "content": "{\n  \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n  \"name\": \"GPT-4.1 nano\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-4.1 nano is OpenAI's fastest and cheapest model available in the GPT-4.1 family. It delivers exceptional performance at a small size with its 1 million token context window. Ideal for tasks like classification or autocompletion.\",\n  \"release_date\": \"2025-04-14\",\n  \"announcement_date\": \"2025-04-14\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-05-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/gpt-4.1-nano\",\n  \"source_playground\": \"https://platform.openai.com/playground?mode=chat&model=gpt-4.1-nano\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/index/gpt-4-1/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.827978+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.827978+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/gpt-4.5/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1337,\n    \"benchmark_id\": \"aider-polyglot-edit\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.449,\n    \"normalized_score\": 0.449,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.811839+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.811839+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot Edit\"\n  },\n  {\n    \"model_benchmark_id\": 489,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.367,\n    \"normalized_score\": 0.367,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.024273+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.024273+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 1891,\n    \"benchmark_id\": \"charxiv-d\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.9,\n    \"normalized_score\": 0.9,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.335527+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.335527+00:00\",\n    \"benchmark_name\": \"CharXiv-D\"\n  },\n  {\n    \"model_benchmark_id\": 1839,\n    \"benchmark_id\": \"charxiv-r\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.554,\n    \"normalized_score\": 0.554,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.204875+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.204875+00:00\",\n    \"benchmark_name\": \"CharXiv-R\"\n  },\n  {\n    \"model_benchmark_id\": 1862,\n    \"benchmark_id\": \"collie\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.723,\n    \"normalized_score\": 0.723,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.265565+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.265565+00:00\",\n    \"benchmark_name\": \"COLLIE\"\n  },\n  {\n    \"model_benchmark_id\": 1897,\n    \"benchmark_id\": \"complexfuncbench\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.63,\n    \"normalized_score\": 0.63,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.351430+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.351430+00:00\",\n    \"benchmark_name\": \"ComplexFuncBench\"\n  },\n  {\n    \"model_benchmark_id\": 357,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.695,\n    \"normalized_score\": 0.695,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (Diamond)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.767414+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.767414+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1906,\n    \"benchmark_id\": \"graphwalks-bfs-<128k\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.723,\n    \"normalized_score\": 0.723,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.372855+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.372855+00:00\",\n    \"benchmark_name\": \"Graphwalks BFS <128k\"\n  },\n  {\n    \"model_benchmark_id\": 1883,\n    \"benchmark_id\": \"graphwalks-parents-<128k\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.726,\n    \"normalized_score\": 0.726,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.315697+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.315697+00:00\",\n    \"benchmark_name\": \"Graphwalks parents <128k\"\n  },\n  {\n    \"model_benchmark_id\": 1015,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.97,\n    \"normalized_score\": 0.97,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-4-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Answer accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.114869+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.114869+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 813,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.88,\n    \"normalized_score\": 0.88,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-4-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.694244+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.694244+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 637,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.882,\n    \"normalized_score\": 0.882,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.307682+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.307682+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1850,\n    \"benchmark_id\": \"internal-api-instruction-following-(hard)\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.54,\n    \"normalized_score\": 0.54,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.234022+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.234022+00:00\",\n    \"benchmark_name\": \"Internal API instruction following (hard)\"\n  },\n  {\n    \"model_benchmark_id\": 545,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.723,\n    \"normalized_score\": 0.723,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.124115+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.124115+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 124,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.908,\n    \"normalized_score\": 0.908,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Multiple-choice accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.328688+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.328688+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1485,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.851,\n    \"normalized_score\": 0.851,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.164320+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.164320+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 595,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.752,\n    \"normalized_score\": 0.752,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.226731+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.226731+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 744,\n    \"benchmark_id\": \"multichallenge\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.438,\n    \"normalized_score\": 0.438,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.563438+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.563438+00:00\",\n    \"benchmark_name\": \"MultiChallenge\"\n  },\n  {\n    \"model_benchmark_id\": 1856,\n    \"benchmark_id\": \"multichallenge-(o3-mini-grader)\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.501,\n    \"normalized_score\": 0.501,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.249385+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.249385+00:00\",\n    \"benchmark_name\": \"MultiChallenge (o3-mini grader)\"\n  },\n  {\n    \"model_benchmark_id\": 1655,\n    \"benchmark_id\": \"multi-if\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.708,\n    \"normalized_score\": 0.708,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.652033+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.652033+00:00\",\n    \"benchmark_name\": \"Multi-IF\"\n  },\n  {\n    \"model_benchmark_id\": 1868,\n    \"benchmark_id\": \"openai-mrcr:-2-needle-128k\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.385,\n    \"normalized_score\": 0.385,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.279311+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.279311+00:00\",\n    \"benchmark_name\": \"OpenAI-MRCR: 2 needle 128k\"\n  },\n  {\n    \"model_benchmark_id\": 240,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.625,\n    \"normalized_score\": 0.625,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-4-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.559622+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.559622+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 1360,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.38,\n    \"normalized_score\": 0.38,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Success rate\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.863719+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.863719+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1900,\n    \"benchmark_id\": \"swe-lancer\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.373,\n    \"normalized_score\": 0.373,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Success rate ($186K equivalent)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.358579+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.358579+00:00\",\n    \"benchmark_name\": \"SWE-Lancer\"\n  },\n  {\n    \"model_benchmark_id\": 1903,\n    \"benchmark_id\": \"swe-lancer-(ic-diamond-subset)\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.174,\n    \"normalized_score\": 0.174,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Success rate ($41K equivalent)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.365353+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.365353+00:00\",\n    \"benchmark_name\": \"SWE-Lancer (IC-Diamond subset)\"\n  },\n  {\n    \"model_benchmark_id\": 1782,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.5,\n    \"normalized_score\": 0.5,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.020093+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.020093+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1768,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"gpt-4.5\",\n    \"score\": 0.684,\n    \"normalized_score\": 0.684,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.989887+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.989887+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/gpt-4.5/model.json",
    "content": "{\n  \"model_id\": \"gpt-4.5\",\n  \"name\": \"GPT-4.5\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-4.5 is OpenAI's most advanced model, offering improved reasoning, coding, and creative capabilities with faster performance and longer context handling than GPT-4. It features enhanced instruction following, reduced hallucinations, and better factual accuracy.\",\n  \"release_date\": \"2025-02-27\",\n  \"announcement_date\": \"2025-02-27\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/gpt-4-5#gpt-4-5\",\n  \"source_playground\": \"https://platform.openai.com/playground\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/index/introducing-gpt-4-5/\",\n  \"source_repo_link\": \"https://github.com/openai\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.852855+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.852855+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/gpt-4o-2024-05-13/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 962,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"gpt-4o-2024-05-13\",\n    \"score\": 0.834,\n    \"normalized_score\": 0.834,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"F1 Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.023727+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.023727+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 352,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-4o-2024-05-13\",\n    \"score\": 0.536,\n    \"normalized_score\": 0.536,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.759539+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.759539+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 811,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gpt-4o-2024-05-13\",\n    \"score\": 0.902,\n    \"normalized_score\": 0.902,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.689969+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.689969+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 427,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gpt-4o-2024-05-13\",\n    \"score\": 0.766,\n    \"normalized_score\": 0.766,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.903446+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.903446+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 542,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gpt-4o-2024-05-13\",\n    \"score\": 0.638,\n    \"normalized_score\": 0.638,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.119289+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.119289+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1297,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"gpt-4o-2024-05-13\",\n    \"score\": 0.905,\n    \"normalized_score\": 0.905,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.714155+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.714155+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 120,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-4o-2024-05-13\",\n    \"score\": 0.887,\n    \"normalized_score\": 0.887,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.322163+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.322163+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 219,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gpt-4o-2024-05-13\",\n    \"score\": 0.726,\n    \"normalized_score\": 0.726,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.515262+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.515262+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/gpt-4o-2024-05-13/model.json",
    "content": "{\n  \"model_id\": \"gpt-4o-2024-05-13\",\n  \"name\": \"GPT-4o\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-4o ('o' for 'omni') is a multimodal AI model that accepts text, audio, image, and video inputs, and generates text, audio, and image outputs. It matches GPT-4 Turbo performance on text and code, with improvements in non-English languages, vision, and audio understanding.\",\n  \"release_date\": \"2024-05-13\",\n  \"announcement_date\": \"2024-05-13\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/api-reference\",\n  \"source_playground\": \"https://chat.openai.com/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/index/hello-gpt-4o/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.838358+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.838358+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/gpt-4o-2024-08-06/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1908,\n    \"benchmark_id\": \"activitynet\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.619,\n    \"normalized_score\": 0.619,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test set evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.381219+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.381219+00:00\",\n    \"benchmark_name\": \"ActivityNet\"\n  },\n  {\n    \"model_benchmark_id\": 1262,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.942,\n    \"normalized_score\": 0.942,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test set evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.646808+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.646808+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 672,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.307,\n    \"normalized_score\": 0.307,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.391433+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.391433+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 1336,\n    \"benchmark_id\": \"aider-polyglot-edit\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.182,\n    \"normalized_score\": 0.182,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.810263+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.810263+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot Edit\"\n  },\n  {\n    \"model_benchmark_id\": 488,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.131,\n    \"normalized_score\": 0.131,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.022775+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.022775+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 875,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.857,\n    \"normalized_score\": 0.857,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test set evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.824155+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.824155+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 1890,\n    \"benchmark_id\": \"charxiv-d\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.853,\n    \"normalized_score\": 0.853,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.333294+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.333294+00:00\",\n    \"benchmark_name\": \"CharXiv-D\"\n  },\n  {\n    \"model_benchmark_id\": 1838,\n    \"benchmark_id\": \"charxiv-r\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.588,\n    \"normalized_score\": 0.588,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o without thinking mode - Scientific figure reasoning and interpretation.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.203285+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.203285+00:00\",\n    \"benchmark_name\": \"CharXiv-R\"\n  },\n  {\n    \"model_benchmark_id\": 1861,\n    \"benchmark_id\": \"collie\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.61,\n    \"normalized_score\": 0.61,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o without thinking mode - Instruction-following in freeform writing.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.262884+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.262884+00:00\",\n    \"benchmark_name\": \"COLLIE\"\n  },\n  {\n    \"model_benchmark_id\": 1867,\n    \"benchmark_id\": \"tau2-airline\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.455,\n    \"normalized_score\": 0.455,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o without thinking mode - Function calling benchmark (airline domain).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 airline\"\n  },\n  {\n    \"model_benchmark_id\": 1868,\n    \"benchmark_id\": \"tau2-retail\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.634,\n    \"normalized_score\": 0.634,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o without thinking mode - Function calling benchmark (retail domain).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 retail\"\n  },\n  {\n    \"model_benchmark_id\": 1869,\n    \"benchmark_id\": \"tau2-telecom\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.235,\n    \"normalized_score\": 0.235,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o without thinking mode - Function calling benchmark (telecom domain).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 telecom\"\n  },\n  {\n    \"model_benchmark_id\": 1870,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.599,\n    \"normalized_score\": 0.599,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o without thinking mode - Graduate-level visual problem-solving with advanced multimodal reasoning.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1871,\n    \"benchmark_id\": \"videommmu\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.612,\n    \"normalized_score\": 0.612,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o without thinking mode - Video-based multimodal reasoning (max frame 256).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"VideoMMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1872,\n    \"benchmark_id\": \"erqa\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.352,\n    \"normalized_score\": 0.352,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o without thinking mode - Multimodal spatial reasoning.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"ERQA\"\n  },\n  {\n    \"model_benchmark_id\": 1896,\n    \"benchmark_id\": \"complexfuncbench\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.665,\n    \"normalized_score\": 0.665,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.349679+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.349679+00:00\",\n    \"benchmark_name\": \"ComplexFuncBench\"\n  },\n  {\n    \"model_benchmark_id\": 900,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.928,\n    \"normalized_score\": 0.928,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test set evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.873722+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.873722+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 926,\n    \"benchmark_id\": \"egoschema\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.722,\n    \"normalized_score\": 0.722,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/hello-gpt-4o/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test set evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.935728+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.935728+00:00\",\n    \"benchmark_name\": \"EgoSchema\"\n  },\n  {\n    \"model_benchmark_id\": 355,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.701,\n    \"normalized_score\": 0.701,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o - Diamond no thinking no tools\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.764329+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.764329+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1905,\n    \"benchmark_id\": \"graphwalks-bfs-<128k\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.417,\n    \"normalized_score\": 0.417,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.370259+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.370259+00:00\",\n    \"benchmark_name\": \"Graphwalks BFS <128k\"\n  },\n  {\n    \"model_benchmark_id\": 1882,\n    \"benchmark_id\": \"graphwalks-parents-<128k\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.354,\n    \"normalized_score\": 0.354,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.314044+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.314044+00:00\",\n    \"benchmark_name\": \"Graphwalks parents <128k\"\n  },\n  {\n    \"model_benchmark_id\": 636,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.81,\n    \"normalized_score\": 0.81,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.306083+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.306083+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1849,\n    \"benchmark_id\": \"internal-api-instruction-following-(hard)\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.292,\n    \"normalized_score\": 0.292,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.232334+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.232334+00:00\",\n    \"benchmark_name\": \"Internal API instruction following (hard)\"\n  },\n  {\n    \"model_benchmark_id\": 544,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.614,\n    \"normalized_score\": 0.614,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.122558+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.122558+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 122,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.857,\n    \"normalized_score\": 0.857,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.325082+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.325082+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 220,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.747,\n    \"normalized_score\": 0.747,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot CoT\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.517058+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.517058+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1484,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.814,\n    \"normalized_score\": 0.814,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.162717+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.162717+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 594,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.722,\n    \"normalized_score\": 0.722,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o without thinking mode - College-level visual problem-solving with multimodal reasoning.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.224513+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.224513+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1855,\n    \"benchmark_id\": \"multichallenge-(o3-mini-grader)\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.399,\n    \"normalized_score\": 0.399,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.246431+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.246431+00:00\",\n    \"benchmark_name\": \"MultiChallenge (o3-mini grader)\"\n  },\n  {\n    \"model_benchmark_id\": 1654,\n    \"benchmark_id\": \"multi-if\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.609,\n    \"normalized_score\": 0.609,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.650416+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.650416+00:00\",\n    \"benchmark_name\": \"Multi-IF\"\n  },\n  {\n    \"model_benchmark_id\": 1867,\n    \"benchmark_id\": \"openai-mrcr:-2-needle-128k\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.319,\n    \"normalized_score\": 0.319,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.277538+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.277538+00:00\",\n    \"benchmark_name\": \"OpenAI-MRCR: 2 needle 128k\"\n  },\n  {\n    \"model_benchmark_id\": 239,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.382,\n    \"normalized_score\": 0.382,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-4-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.557852+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.557852+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 1359,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.332,\n    \"normalized_score\": 0.332,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.861280+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.861280+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1899,\n    \"benchmark_id\": \"swe-lancer\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.326,\n    \"normalized_score\": 0.326,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"percentage score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.356738+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.356738+00:00\",\n    \"benchmark_name\": \"SWE-Lancer\"\n  },\n  {\n    \"model_benchmark_id\": 1902,\n    \"benchmark_id\": \"swe-lancer-(ic-diamond-subset)\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.124,\n    \"normalized_score\": 0.124,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"percentage score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.363614+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.363614+00:00\",\n    \"benchmark_name\": \"SWE-Lancer (IC-Diamond subset)\"\n  },\n  {\n    \"model_benchmark_id\": 1781,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.428,\n    \"normalized_score\": 0.428,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.017725+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.017725+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1767,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.603,\n    \"normalized_score\": 0.603,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.988086+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.988086+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  },\n  {\n    \"model_benchmark_id\": 2003,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.053,\n    \"normalized_score\": 0.053,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o without thinking mode (no tools) - Full set of expert-level questions across subjects.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 2005,\n    \"benchmark_id\": \"scale-multichallenge\",\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"score\": 0.403,\n    \"normalized_score\": 0.403,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4o without thinking mode - Multi-turn instruction following benchmark.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Scale MultiChallenge\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/openai/models/gpt-4o-2024-08-06/model.json",
    "content": "{\n  \"model_id\": \"gpt-4o-2024-08-06\",\n  \"name\": \"GPT-4o\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-4o ('o' for 'omni') is a multimodal AI model that accepts text, audio, image, and video inputs, and generates text, audio, and image outputs. It matches GPT-4 Turbo performance on text and code, with improvements in non-English languages, vision, and audio understanding.\",\n  \"release_date\": \"2024-08-06\",\n  \"announcement_date\": \"2024-08-06\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/api-reference\",\n  \"source_playground\": \"https://chat.openai.com/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/index/hello-gpt-4o/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.847621+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.847621+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/gpt-4o-mini-2024-07-18/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 964,\n    \"benchmark_id\": \"drop\",\n    \"model_id\": \"gpt-4o-mini-2024-07-18\",\n    \"score\": 0.797,\n    \"normalized_score\": 0.797,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o-mini-announcement\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"F1 Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.026741+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.026741+00:00\",\n    \"benchmark_name\": \"DROP\"\n  },\n  {\n    \"model_benchmark_id\": 361,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-4o-mini-2024-07-18\",\n    \"score\": 0.402,\n    \"normalized_score\": 0.402,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o-mini-announcement\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.774361+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.774361+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 816,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gpt-4o-mini-2024-07-18\",\n    \"score\": 0.872,\n    \"normalized_score\": 0.872,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o-mini-announcement\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.700095+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.700095+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 431,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gpt-4o-mini-2024-07-18\",\n    \"score\": 0.702,\n    \"normalized_score\": 0.702,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o-mini-announcement\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.911917+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.911917+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 548,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"gpt-4o-mini-2024-07-18\",\n    \"score\": 0.567,\n    \"normalized_score\": 0.567,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o-mini-announcement\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.128984+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.128984+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1301,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"gpt-4o-mini-2024-07-18\",\n    \"score\": 0.87,\n    \"normalized_score\": 0.87,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o-mini-announcement\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.720445+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.720445+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 128,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-4o-mini-2024-07-18\",\n    \"score\": 0.82,\n    \"normalized_score\": 0.82,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o-mini-announcement\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.335061+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.335061+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 598,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gpt-4o-mini-2024-07-18\",\n    \"score\": 0.594,\n    \"normalized_score\": 0.594,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/blog/gpt-4o-mini-announcement\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.232157+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.232157+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1363,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gpt-4o-mini-2024-07-18\",\n    \"score\": 0.087,\n    \"normalized_score\": 0.087,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass Rate\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.870038+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.870038+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/gpt-4o-mini-2024-07-18/model.json",
    "content": "{\n  \"model_id\": \"gpt-4o-mini-2024-07-18\",\n  \"name\": \"GPT-4o mini\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-4o mini is OpenAI's latest cost-efficient small model, designed to make AI intelligence more accessible and affordable. It excels in textual intelligence and multimodal reasoning, outperforming previous models like GPT-3.5 Turbo. With a context window of 128K tokens and support for text and vision, it offers low-cost, real-time applications such as customer support chatbots. Priced at 15 cents per million input tokens and 60 cents per million output tokens, it is significantly cheaper than its predecessors. Safety is prioritized with built-in measures and improved resistance to security threats.\",\n  \"release_date\": \"2024-07-18\",\n  \"announcement_date\": \"2024-07-18\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2023-10-01\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/api-reference\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.866393+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.866393+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/gpt-5-2025-08-07/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 9002,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.749,\n    \"normalized_score\": 0.749,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode enabled (up to 128K tokens) with enhanced reasoning capabilities and iterative problem-solving approach.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 9004,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.88,\n    \"normalized_score\": 0.88,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode enabled (up to 128K tokens) with step-by-step reasoning and multi-language code understanding.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 10027,\n    \"benchmark_id\": \"swe-lancer-(ic-diamond-subset)\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 1.0,\n    \"normalized_score\": 1.0,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 - IC SWE Diamond Freelance Coding Tasks (earnings-based evaluation).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Lancer (IC-Diamond subset)\"\n  },\n  {\n    \"model_benchmark_id\": 9020,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.946,\n    \"normalized_score\": 0.946,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 standard with thinking mode enabled (no tools) - competition mathematics.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 9009,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.842,\n    \"normalized_score\": 0.842,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 with thinking mode - College-level visual problem-solving with multimodal reasoning.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 9006,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.925,\n    \"normalized_score\": 0.925,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Standard benchmark across multiple academic subjects with comprehensive knowledge evaluation.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 9007,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.934,\n    \"normalized_score\": 0.934,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Code generation benchmark with function completion tasks in Python.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 9008,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.847,\n    \"normalized_score\": 0.847,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode enabled with step-by-step mathematical problem solving and verification.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 9013,\n    \"benchmark_id\": \"healthbench-hard\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.016,\n    \"normalized_score\": 0.016,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode enabled for medical hallucination detection. Measured inaccuracies on challenging healthcare conversations.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"HealthBench Hard\"\n  },\n  {\n    \"model_benchmark_id\": 9024,\n    \"benchmark_id\": \"frontiermath\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.263,\n    \"normalized_score\": 0.263,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 standard with thinking mode enabled (with python tool only) - FrontierMath Tier 1-3 expert-level mathematics.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"FrontierMath\"\n  },\n  {\n    \"model_benchmark_id\": 9028,\n    \"benchmark_id\": \"hmmt-2025\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.933,\n    \"normalized_score\": 0.933,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 standard with thinking mode enabled (no tools) - Harvard-MIT Mathematics Tournament.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT 2025\"\n  },\n  {\n    \"model_benchmark_id\": 9032,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.857,\n    \"normalized_score\": 0.857,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 - Diamond thinking no tools\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9037,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.248,\n    \"normalized_score\": 0.248,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 standard with thinking mode (no tools) - Full set of expert-level questions across subjects.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 9041,\n    \"benchmark_id\": \"scale-multichallenge\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.696,\n    \"normalized_score\": 0.696,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 with thinking mode enabled - Multi-turn instruction following benchmark.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Scale MultiChallenge\"\n  },\n  {\n    \"model_benchmark_id\": 9043,\n    \"benchmark_id\": \"browsecomp\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.549,\n    \"normalized_score\": 0.549,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 with thinking mode enabled - Agentic search & browsing benchmark.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp\"\n  },\n  {\n    \"model_benchmark_id\": 9045,\n    \"benchmark_id\": \"collie\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.99,\n    \"normalized_score\": 0.99,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 with thinking mode enabled - Instruction-following in freeform writing.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"COLLIE\"\n  },\n  {\n    \"model_benchmark_id\": 10034,\n    \"benchmark_id\": \"multichallenge-(o3-mini-grader)\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.696,\n    \"normalized_score\": 0.696,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 with o3-mini grader - Multi-turn instruction following benchmark with improved grading accuracy.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"MultiChallenge (o3-mini grader)\"\n  },\n  {\n    \"model_benchmark_id\": 10035,\n    \"benchmark_id\": \"internal-api-instruction-following-(hard)\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.64,\n    \"normalized_score\": 0.64,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 - Internal API instruction following evaluation (hard difficulty).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Internal API instruction following (hard)\"\n  },\n  {\n    \"model_benchmark_id\": 9047,\n    \"benchmark_id\": \"tau2-airline\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.626,\n    \"normalized_score\": 0.626,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 - Function calling benchmark (airline domain).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 airline\"\n  },\n  {\n    \"model_benchmark_id\": 9049,\n    \"benchmark_id\": \"tau2-retail\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.811,\n    \"normalized_score\": 0.811,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 with thinking mode - Function calling benchmark (retail domain).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 retail\"\n  },\n  {\n    \"model_benchmark_id\": 9051,\n    \"benchmark_id\": \"tau2-telecom\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.967,\n    \"normalized_score\": 0.967,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 with thinking mode - Function calling benchmark (telecom domain).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 telecom\"\n  },\n  {\n    \"model_benchmark_id\": 9053,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.784,\n    \"normalized_score\": 0.784,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 with thinking mode - Graduate-level visual problem-solving with advanced multimodal reasoning.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 9055,\n    \"benchmark_id\": \"videommmu\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.846,\n    \"normalized_score\": 0.846,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 with thinking mode - Video-based multimodal reasoning (max frame 256).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"VideoMMMU\"\n  },\n  {\n    \"model_benchmark_id\": 9057,\n    \"benchmark_id\": \"charxiv-r\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.811,\n    \"normalized_score\": 0.811,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 with thinking mode - Scientific figure reasoning and interpretation.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"CharXiv-R\"\n  },\n  {\n    \"model_benchmark_id\": 9059,\n    \"benchmark_id\": \"erqa\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.657,\n    \"normalized_score\": 0.657,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 with thinking mode - Multimodal spatial reasoning.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"ERQA\"\n  },\n  {\n    \"model_benchmark_id\": 10048,\n    \"benchmark_id\": \"openai-mrcr:-2-needle-128k\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.952,\n    \"normalized_score\": 0.952,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI-MRCR 2-needle retrieval at 128k tokens.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"OpenAI-MRCR: 2 needle 128k\"\n  },\n  {\n    \"model_benchmark_id\": 10049,\n    \"benchmark_id\": \"openai-mrcr:-2-needle-256k\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.868,\n    \"normalized_score\": 0.868,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI-MRCR 2-needle retrieval at 256k tokens.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"OpenAI-MRCR: 2 needle 256k\"\n  },\n  {\n    \"model_benchmark_id\": 10050,\n    \"benchmark_id\": \"graphwalks-bfs-<128k\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.783,\n    \"normalized_score\": 0.783,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Graphwalks BFS (<128k) long-context reasoning.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Graphwalks BFS <128k\"\n  },\n  {\n    \"model_benchmark_id\": 10051,\n    \"benchmark_id\": \"graphwalks-parents-<128k\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Graphwalks parents (<128k) long-context reasoning.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Graphwalks parents <128k\"\n  },\n  {\n    \"model_benchmark_id\": 10052,\n    \"benchmark_id\": \"browsecomp-long-128k\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.9,\n    \"normalized_score\": 0.9,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"BrowseComp long-context 128k variant.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp Long Context 128k\"\n  },\n  {\n    \"model_benchmark_id\": 10053,\n    \"benchmark_id\": \"browsecomp-long-256k\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.888,\n    \"normalized_score\": 0.888,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"BrowseComp long-context 256k variant.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp Long Context 256k\"\n  },\n  {\n    \"model_benchmark_id\": 10054,\n    \"benchmark_id\": \"videomme-w-sub.\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.867,\n    \"normalized_score\": 0.867,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"VideoMME (long) with subtitles category.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"VideoMME w sub.\"\n  },\n  {\n    \"model_benchmark_id\": 10069,\n    \"benchmark_id\": \"longfact-concepts\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.007,\n    \"normalized_score\": 0.007,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode enabled for hallucination detection. Measured on open-source prompts for concept-based factual queries.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"LongFact-Concepts\"\n  },\n  {\n    \"model_benchmark_id\": 10070,\n    \"benchmark_id\": \"longfact-objects\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.008,\n    \"normalized_score\": 0.008,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode enabled for hallucination detection. Measured on open-source prompts for object-based factual queries.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"LongFact-Objects\"\n  },\n  {\n    \"model_benchmark_id\": 10071,\n    \"benchmark_id\": \"factscore\",\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"score\": 0.01,\n    \"normalized_score\": 0.01,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-5-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Thinking mode enabled for factual accuracy assessment. Measured hallucination rate on open-source prompts.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"FactScore\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/openai/models/gpt-5-2025-08-07/model.json",
    "content": "{\n  \"model_id\": \"gpt-5-2025-08-07\",\n  \"name\": \"GPT-5\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-5 is our flagship model for coding, reasoning, and agentic tasks across domains. The best model for coding and agentic tasks with higher reasoning capabilities and medium speed.\",\n  \"release_date\": \"2025-08-07\",\n  \"announcement_date\": \"2025-08-07\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-09-30\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/gpt-5\",\n  \"source_playground\": \"https://platform.openai.com/playground?mode=chat&model=gpt-5\",\n  \"source_paper\": \"https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf\",\n  \"source_scorecard_blog_link\": \"https://openai.com/index/gpt-5/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/openai/models/gpt-5-codex-2025-09-15/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 10100,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"gpt-5-codex-2025-09-15\",\n    \"score\": 0.745,\n    \"normalized_score\": 0.745,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-upgrades-to-codex/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 Codex specialized for code review and critical flaw detection with enhanced agentic coding capabilities.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-09-18T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-18T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/gpt-5-codex-2025-09-15/model.json",
    "content": "{\n  \"model_id\": \"gpt-5-codex-2025-09-15\",\n  \"name\": \"GPT-5 Codex\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-5 Codex has been trained specifically for conducting code reviews and finding critical flaws. When reviewing, it navigates your codebase and analyzes code patterns to identify potential security vulnerabilities, performance issues, and bugs.\",\n  \"release_date\": \"2025-09-15\",\n  \"announcement_date\": \"2025-09-15\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2024-09-30\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": false,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/gpt-5-codex\",\n  \"source_playground\": \"https://platform.openai.com/playground?mode=chat&model=gpt-5-codex\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/index/introducing-upgrades-to-codex/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-09-18T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-18T00:00:00.000000+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/gpt-5-mini-2025-08-07/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 9021,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gpt-5-mini-2025-08-07\",\n    \"score\": 0.911,\n    \"normalized_score\": 0.911,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 mini with thinking mode enabled (no tools) - competition mathematics.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 9025,\n    \"benchmark_id\": \"frontiermath\",\n    \"model_id\": \"gpt-5-mini-2025-08-07\",\n    \"score\": 0.221,\n    \"normalized_score\": 0.221,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 mini with thinking mode enabled (with python tool only) - FrontierMath Tier 1-3 expert-level mathematics.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"FrontierMath\"\n  },\n  {\n    \"model_benchmark_id\": 9033,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-5-mini-2025-08-07\",\n    \"score\": 0.823,\n    \"normalized_score\": 0.823,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 mini - Diamond thinking no tools\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9038,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gpt-5-mini-2025-08-07\",\n    \"score\": 0.167,\n    \"normalized_score\": 0.167,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 mini with thinking mode (no tools) - Full set of expert-level questions across subjects.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 9029,\n    \"benchmark_id\": \"hmmt-2025\",\n    \"model_id\": \"gpt-5-mini-2025-08-07\",\n    \"score\": 0.878,\n    \"normalized_score\": 0.878,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 mini with thinking mode enabled (no tools) - Harvard-MIT Mathematics Tournament.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT 2025\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/openai/models/gpt-5-mini-2025-08-07/model.json",
    "content": "{\n  \"model_id\": \"gpt-5-mini-2025-08-07\",\n  \"name\": \"GPT-5 mini\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A faster, more cost-efficient version of GPT-5 for well-defined tasks. Great for well-defined tasks and precise prompts with high reasoning capabilities at reduced cost.\",\n  \"release_date\": \"2025-08-07\",\n  \"announcement_date\": \"2025-08-07\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-05-30\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/gpt-5-mini\",\n  \"source_playground\": \"https://platform.openai.com/playground?mode=chat&model=gpt-5-mini\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/index/gpt-5/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/openai/models/gpt-5-nano-2025-08-07/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 9022,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"gpt-5-nano-2025-08-07\",\n    \"score\": 0.852,\n    \"normalized_score\": 0.852,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 nano with thinking mode enabled (no tools) - competition mathematics.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 9026,\n    \"benchmark_id\": \"frontiermath\",\n    \"model_id\": \"gpt-5-nano-2025-08-07\",\n    \"score\": 0.096,\n    \"normalized_score\": 0.096,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 nano with thinking mode enabled (with python tool only) - FrontierMath Tier 1-3 expert-level mathematics.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"FrontierMath\"\n  },\n  {\n    \"model_benchmark_id\": 9034,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-5-nano-2025-08-07\",\n    \"score\": 0.712,\n    \"normalized_score\": 0.712,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 nano - Diamond thinking no tools\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9039,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gpt-5-nano-2025-08-07\",\n    \"score\": 0.087,\n    \"normalized_score\": 0.087,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 nano with thinking mode (no tools) - Full set of expert-level questions across subjects.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 9030,\n    \"benchmark_id\": \"hmmt-2025\",\n    \"model_id\": \"gpt-5-nano-2025-08-07\",\n    \"score\": 0.756,\n    \"normalized_score\": 0.756,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-5 nano with thinking mode enabled (no tools) - Harvard-MIT Mathematics Tournament.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT 2025\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/openai/models/gpt-5-nano-2025-08-07/model.json",
    "content": "{\n  \"model_id\": \"gpt-5-nano-2025-08-07\",\n  \"name\": \"GPT-5 nano\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-5 nano is our fastest, cheapest version of GPT-5. It's great for summarization and classification tasks with average reasoning capabilities and very fast speed.\",\n  \"release_date\": \"2025-08-07\",\n  \"announcement_date\": \"2025-08-07\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-05-30\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/gpt-5-nano\",\n  \"source_playground\": \"https://platform.openai.com/playground?mode=chat&model=gpt-5-nano\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://openai.com/index/gpt-5/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/openai/models/gpt-oss-120b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 224,\n    \"benchmark_id\": \"codeforces\",\n    \"model_id\": \"gpt-oss-120b\",\n    \"score\": 0.874,\n    \"normalized_score\": 0.874,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Elo (with tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"Codeforces Competition code\"\n  },\n  {\n    \"model_benchmark_id\": 224,\n    \"benchmark_id\": \"codeforces\",\n    \"model_id\": \"gpt-oss-120b\",\n    \"score\": 0.821,\n    \"normalized_score\": 0.821,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Elo (without tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"Codeforces Competition code\"\n  },\n  {\n    \"model_benchmark_id\": 224,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gpt-oss-120b\",\n    \"score\": 0.19,\n    \"normalized_score\": 0.19,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (with tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 224,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gpt-oss-120b\",\n    \"score\": 0.149,\n    \"normalized_score\": 0.149,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (without tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 224,\n    \"benchmark_id\": \"healthbench\",\n    \"model_id\": \"gpt-oss-120b\",\n    \"score\": 0.576,\n    \"normalized_score\": 0.576,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"HealthBench - Realistic health conversations\"\n  },\n  {\n    \"model_benchmark_id\": 225,\n    \"benchmark_id\": \"healthbench-hard\",\n    \"model_id\": \"gpt-oss-120b\",\n    \"score\": 0.3,\n    \"normalized_score\": 0.3,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"HealthBench Hard - Challenging health conversations\"\n  },\n  {\n    \"model_benchmark_id\": 2226,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-oss-120b\",\n    \"score\": 0.801,\n    \"normalized_score\": 0.801,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Without tools\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 22226,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-oss-120b\",\n    \"score\": 0.9,\n    \"normalized_score\": 0.9,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Without tools\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"MMLU benchmark\"\n  },\n  {\n    \"model_benchmark_id\": 22226,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"gpt-oss-120b\",\n    \"score\": 0.678,\n    \"normalized_score\": 0.678,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Function calling\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail benchmark\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/gpt-oss-120b/model.json",
    "content": "{\n  \"model_id\": \"gpt-oss-120b\",\n  \"name\": \"GPT OSS 120B\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GPT-OSS-120B is an open-weight, 116.8B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized to run on a single H100 GPU with native MXFP4 quantization. The model supports configurable reasoning depth, full chain-of-thought access, and native tool use, including function calling, browsing, and structured output generation. It achieves near-parity with OpenAI o4-mini on core reasoning benchmarks. Note: While referred to as '120b' for simplicity, it technically has 116.8B parameters.\",\n  \"release_date\": \"2025-08-05\",\n  \"announcement_date\": \"2025-08-05\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 116800000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://gpt-oss.com/\",\n  \"source_paper\": \"https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf\",\n  \"source_scorecard_blog_link\": \"https://openai.com/index/gpt-oss-model-card/\",\n  \"source_repo_link\": \"https://github.com/openai/gpt-oss\",\n  \"source_weights_link\": \"https://huggingface.co/openai/gpt-oss-120b\",\n  \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n  \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/openai/models/gpt-oss-20b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 224,\n    \"benchmark_id\": \"codeforces\",\n    \"model_id\": \"gpt-oss-20b\",\n    \"score\": 0.8387,\n    \"normalized_score\": 0.8387,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Elo (with tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"Codeforces Competition code\"\n  },\n  {\n    \"model_benchmark_id\": 224,\n    \"benchmark_id\": \"codeforces\",\n    \"model_id\": \"gpt-oss-20b\",\n    \"score\": 0.7433,\n    \"normalized_score\": 0.7433,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Elo (without tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"Codeforces Competition code\"\n  },\n  {\n    \"model_benchmark_id\": 224,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gpt-oss-20b\",\n    \"score\": 0.173,\n    \"normalized_score\": 0.173,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (with tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 224,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"gpt-oss-20b\",\n    \"score\": 0.109,\n    \"normalized_score\": 0.109,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy (without tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 224,\n    \"benchmark_id\": \"healthbench\",\n    \"model_id\": \"gpt-oss-20b\",\n    \"score\": 0.425,\n    \"normalized_score\": 0.425,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"HealthBench - Realistic health conversations\"\n  },\n  {\n    \"model_benchmark_id\": 225,\n    \"benchmark_id\": \"healthbench-hard\",\n    \"model_id\": \"gpt-oss-20b\",\n    \"score\": 0.108,\n    \"normalized_score\": 0.108,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"HealthBench Hard - Challenging health conversations\"\n  },\n  {\n    \"model_benchmark_id\": 2226,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"gpt-oss-20b\",\n    \"score\": 0.715,\n    \"normalized_score\": 0.715,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond (without tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 22226,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"gpt-oss-20b\",\n    \"score\": 0.853,\n    \"normalized_score\": 0.853,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Without tools\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"MMLU benchmark\"\n  },\n  {\n    \"model_benchmark_id\": 22226,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"gpt-oss-20b\",\n    \"score\": 0.548,\n    \"normalized_score\": 0.548,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-oss/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Function calling\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail benchmark\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/gpt-oss-20b/model.json",
    "content": "{\n  \"model_id\": \"gpt-oss-20b\",\n  \"name\": \"GPT OSS 20B\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"The gpt-oss-20b model (technically 20.9B parameters) achieves near-parity with OpenAI o4-mini on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The gpt-oss-20b model delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory, making it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure. Both models also perform strongly on tool use, few-shot function calling, CoT reasoning (as seen in results on the Tau-Bench agentic evaluation suite) and HealthBench (even outperforming proprietary models like OpenAI o1 and GPT‑4o). Note: While referred to as '20b' for simplicity, it technically has 20.9B parameters.\",\n  \"release_date\": \"2025-08-05\",\n  \"announcement_date\": \"2025-08-05\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 20900000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://gpt-oss.com/\",\n  \"source_paper\": \"https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf\",\n  \"source_scorecard_blog_link\": \"https://openai.com/index/gpt-oss-model-card/\",\n  \"source_repo_link\": \"https://github.com/openai/gpt-oss\",\n  \"source_weights_link\": \"https://huggingface.co/openai/gpt-oss-20b\",\n  \"created_at\": \"2025-08-05T19:49:05.852855+00:00\",\n  \"updated_at\": \"2025-08-05T19:49:05.852855+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/openai/models/o1-2024-12-17/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 490,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.743,\n    \"normalized_score\": 0.743,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.025628+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.025628+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 1831,\n    \"benchmark_id\": \"frontiermath\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.055,\n    \"normalized_score\": 0.055,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/o1-and-new-tools-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.186673+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.186673+00:00\",\n    \"benchmark_name\": \"FrontierMath\"\n  },\n  {\n    \"model_benchmark_id\": 358,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.78,\n    \"normalized_score\": 0.78,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o3-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.768954+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.768954+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1911,\n    \"benchmark_id\": \"gpqa-biology\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.692,\n    \"normalized_score\": 0.692,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.394088+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.394088+00:00\",\n    \"benchmark_name\": \"GPQA Biology\"\n  },\n  {\n    \"model_benchmark_id\": 1912,\n    \"benchmark_id\": \"gpqa-chemistry\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.647,\n    \"normalized_score\": 0.647,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.399030+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.399030+00:00\",\n    \"benchmark_name\": \"GPQA Chemistry\"\n  },\n  {\n    \"model_benchmark_id\": 1913,\n    \"benchmark_id\": \"gpqa-physics\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.928,\n    \"normalized_score\": 0.928,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.403790+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.403790+00:00\",\n    \"benchmark_name\": \"GPQA Physics\"\n  },\n  {\n    \"model_benchmark_id\": 1016,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.971,\n    \"normalized_score\": 0.971,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.116437+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.116437+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 814,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.881,\n    \"normalized_score\": 0.881,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.696047+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.696047+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 755,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.67,\n    \"normalized_score\": 0.67,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o3-mini//\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"coding\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.587814+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.587814+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 428,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.964,\n    \"normalized_score\": 0.964,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.905279+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.905279+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 546,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.718,\n    \"normalized_score\": 0.718,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.126058+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.126058+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1298,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.893,\n    \"normalized_score\": 0.893,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/o1-and-new-tools-for-developers/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.715686+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.715686+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 125,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.918,\n    \"normalized_score\": 0.918,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.330211+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.330211+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1486,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.877,\n    \"normalized_score\": 0.877,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.165932+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.165932+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 596,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.776,\n    \"normalized_score\": 0.776,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.228467+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.228467+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 241,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.47,\n    \"normalized_score\": 0.47,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-4-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.561209+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.561209+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 1361,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.41,\n    \"normalized_score\": 0.41,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"verified\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.865799+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.865799+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1783,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.5,\n    \"normalized_score\": 0.5,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"agents\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.021642+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.021642+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1769,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"o1-2024-12-17\",\n    \"score\": 0.708,\n    \"normalized_score\": 0.708,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"agents\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.992114+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.992114+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/o1-2024-12-17/model.json",
    "content": "{\n  \"model_id\": \"o1-2024-12-17\",\n  \"name\": \"o1\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A research preview model focused on mathematical and logical reasoning capabilities, demonstrating improved performance on tasks requiring step-by-step reasoning, mathematical problem-solving, and code generation. The model shows enhanced capabilities in formal reasoning while maintaining strong general capabilities.\",\n  \"release_date\": \"2024-12-17\",\n  \"announcement_date\": \"2024-12-17\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://cdn.openai.com/o1-system-card-20240917.pdf\",\n  \"source_scorecard_blog_link\": \"https://openai.com/index/learning-to-reason-with-llms\",\n  \"source_repo_link\": \"https://openai.com/index/o1-and-new-tools-for-developers/\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.855348+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.855348+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/o1-mini/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1910,\n    \"benchmark_id\": \"cybersecurity-ctfs\",\n    \"model_id\": \"o1-mini\",\n    \"score\": 0.287,\n    \"normalized_score\": 0.287,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@12 accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.390045+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.390045+00:00\",\n    \"benchmark_name\": \"Cybersecurity CTFs\"\n  },\n  {\n    \"model_benchmark_id\": 356,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"o1-mini\",\n    \"score\": 0.6,\n    \"normalized_score\": 0.6,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, 0-shot Chain of Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.765864+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.765864+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 812,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"o1-mini\",\n    \"score\": 0.924,\n    \"normalized_score\": 0.924,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1 accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.692107+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.692107+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 513,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"o1-mini\",\n    \"score\": 0.9,\n    \"normalized_score\": 0.9,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain of Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.065288+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.065288+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 123,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"o1-mini\",\n    \"score\": 0.852,\n    \"normalized_score\": 0.852,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot Chain of Thought\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.327239+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.327239+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1909,\n    \"benchmark_id\": \"superglue\",\n    \"model_id\": \"o1-mini\",\n    \"score\": 0.75,\n    \"normalized_score\": 0.75,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Evaluation on validation set\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.385801+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.385801+00:00\",\n    \"benchmark_name\": \"SuperGLUE\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/o1-mini/model.json",
    "content": "{\n  \"model_id\": \"o1-mini\",\n  \"name\": \"o1-mini\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"o1-mini is a cost-efficient language model developed by OpenAI, designed for advanced reasoning tasks while minimizing computational resources.\",\n  \"release_date\": \"2024-09-12\",\n  \"announcement_date\": \"2024-09-12\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://openai.com/api/o1-mini\",\n  \"source_playground\": \"https://platform.openai.com/playground\",\n  \"source_paper\": \"https://cdn.openai.com/o1-system-card-20240917.pdf\",\n  \"source_scorecard_blog_link\": \"https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.850010+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.850010+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/o1-preview/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 491,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"o1-preview\",\n    \"score\": 0.42,\n    \"normalized_score\": 0.42,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.027037+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.027037+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 360,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"o1-preview\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.772534+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.772534+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 756,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"o1-preview\",\n    \"score\": 0.523,\n    \"normalized_score\": 0.523,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Coding\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.589687+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.589687+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 430,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"o1-preview\",\n    \"score\": 0.855,\n    \"normalized_score\": 0.855,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.910412+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.910412+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1300,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"o1-preview\",\n    \"score\": 0.908,\n    \"normalized_score\": 0.908,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.718867+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.718867+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 127,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"o1-preview\",\n    \"score\": 0.908,\n    \"normalized_score\": 0.908,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.333269+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.333269+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 242,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"o1-preview\",\n    \"score\": 0.424,\n    \"normalized_score\": 0.424,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Factuality\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.562695+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.562695+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 1362,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"o1-preview\",\n    \"score\": 0.413,\n    \"normalized_score\": 0.413,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/learning-to-reason-with-llms/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Verified\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.867753+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.867753+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/o1-preview/model.json",
    "content": "{\n  \"model_id\": \"o1-preview\",\n  \"name\": \"o1-preview\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A research preview model focused on mathematical and logical reasoning capabilities, demonstrating improved performance on tasks requiring step-by-step reasoning, mathematical problem-solving, and code generation. The model shows enhanced capabilities in formal reasoning while maintaining strong general capabilities.\",\n  \"release_date\": \"2024-09-12\",\n  \"announcement_date\": \"2024-09-12\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://cdn.openai.com/o1-system-card-20240917.pdf\",\n  \"source_scorecard_blog_link\": \"https://openai.com/index/learning-to-reason-with-llms\",\n  \"source_repo_link\": \"https://github.com/openai\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.862671+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.862671+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/o1-pro/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 487,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"o1-pro\",\n    \"score\": 0.86,\n    \"normalized_score\": 0.86,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-chatgpt-pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1 accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.021363+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.021363+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 354,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"o1-pro\",\n    \"score\": 0.79,\n    \"normalized_score\": 0.79,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-chatgpt-pro/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Diamond, Pass@1 accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.762804+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.762804+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/o1-pro/model.json",
    "content": "{\n  \"model_id\": \"o1-pro\",\n  \"name\": \"o1-pro\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"o1-pro is OpenAI's advanced language model optimized for complex reasoning and specialized professional tasks, offering enhanced capabilities while maintaining high efficiency.\",\n  \"release_date\": \"2024-12-17\",\n  \"announcement_date\": \"2024-12-17\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2023-09-30\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://openai.com/api\",\n  \"source_playground\": \"https://platform.openai.com/playground\",\n  \"source_paper\": \"https://cdn.openai.com/o1-system-card-20240917.pdf\",\n  \"source_scorecard_blog_link\": \"https://openai.com/index/introducing-chatgpt-pro/\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.844613+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.844613+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/o3-2025-04-16/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 666,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.813,\n    \"normalized_score\": 0.813,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (whole)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.380617+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.380617+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 481,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.916,\n    \"normalized_score\": 0.916,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (no tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.012342+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.012342+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 705,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.864,\n    \"normalized_score\": 0.864,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1 (no tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.475926+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.475926+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1832,\n    \"benchmark_id\": \"arc-agi\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.88,\n    \"normalized_score\": 0.88,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.youtube.com/live/SKBG1sqdyIU?si=lWccKHt8bnttuYta\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test set evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.190370+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.190370+00:00\",\n    \"benchmark_name\": \"ARC-AGI\"\n  },\n  {\n    \"model_benchmark_id\": 1389,\n    \"benchmark_id\": \"arc-agi-v2\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.065,\n    \"normalized_score\": 0.065,\n    \"is_self_reported\": false,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.925569+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.925569+00:00\",\n    \"benchmark_name\": \"ARC-AGI v2\"\n  },\n  {\n    \"model_benchmark_id\": 1842,\n    \"benchmark_id\": \"browsecomp\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.497,\n    \"normalized_score\": 0.497,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (with python + browsing)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.215315+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.215315+00:00\",\n    \"benchmark_name\": \"BrowseComp\"\n  },\n  {\n    \"model_benchmark_id\": 1833,\n    \"benchmark_id\": \"charxiv-r\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.786,\n    \"normalized_score\": 0.786,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode - Scientific figure reasoning and interpretation.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.193874+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.193874+00:00\",\n    \"benchmark_name\": \"CharXiv-R\"\n  },\n  {\n    \"model_benchmark_id\": 1829,\n    \"benchmark_id\": \"frontiermath\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.158,\n    \"normalized_score\": 0.158,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://www.youtube.com/live/SKBG1sqdyIU?si=lWccKHt8bnttuYta\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.181554+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.181554+00:00\",\n    \"benchmark_name\": \"FrontierMath\"\n  },\n  {\n    \"model_benchmark_id\": 347,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.833,\n    \"normalized_score\": 0.833,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 - Diamond thinking no tools\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.750986+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.750986+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 725,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.202,\n    \"normalized_score\": 0.202,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (no tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.526631+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.526631+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 2001,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.243,\n    \"normalized_score\": 0.243,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode enabled (Python + browser tools) - Full set of expert-level questions across subjects.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 2002,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.147,\n    \"normalized_score\": 0.147,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode enabled (no tools) - Full set of expert-level questions across subjects.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 538,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.868,\n    \"normalized_score\": 0.868,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.112692+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.112692+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 589,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.829,\n    \"normalized_score\": 0.829,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode - College-level visual problem-solving with multimodal reasoning.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.211231+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.211231+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1840,\n    \"benchmark_id\": \"scale-multichallenge\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.565,\n    \"normalized_score\": 0.565,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.208929+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.208929+00:00\",\n    \"benchmark_name\": \"Scale MultiChallenge\"\n  },\n  {\n    \"model_benchmark_id\": 2004,\n    \"benchmark_id\": \"scale-multichallenge\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.604,\n    \"normalized_score\": 0.604,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode enabled - Multi-turn instruction following benchmark.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Scale MultiChallenge\"\n  },\n  {\n    \"model_benchmark_id\": 2006,\n    \"benchmark_id\": \"collie\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.984,\n    \"normalized_score\": 0.984,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode enabled - Instruction-following in freeform writing.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"COLLIE\"\n  },\n  {\n    \"model_benchmark_id\": 2007,\n    \"benchmark_id\": \"tau2-airline\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.648,\n    \"normalized_score\": 0.648,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode - Function calling benchmark (airline domain).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 airline\"\n  },\n  {\n    \"model_benchmark_id\": 2008,\n    \"benchmark_id\": \"tau2-retail\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.802,\n    \"normalized_score\": 0.802,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode - Function calling benchmark (retail domain).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 retail\"\n  },\n  {\n    \"model_benchmark_id\": 2009,\n    \"benchmark_id\": \"tau2-telecom\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.582,\n    \"normalized_score\": 0.582,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode - Function calling benchmark (telecom domain).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"Tau2 telecom\"\n  },\n  {\n    \"model_benchmark_id\": 2010,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.764,\n    \"normalized_score\": 0.764,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode - Graduate-level visual problem-solving with advanced multimodal reasoning.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 2011,\n    \"benchmark_id\": \"videommmu\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.833,\n    \"normalized_score\": 0.833,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode - Video-based multimodal reasoning (max frame 256).\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"VideoMMMU\"\n  },\n  {\n    \"model_benchmark_id\": 2012,\n    \"benchmark_id\": \"erqa\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.64,\n    \"normalized_score\": 0.64,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenAI o3 with thinking mode - Multimodal spatial reasoning.\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"benchmark_name\": \"ERQA\"\n  },\n  {\n    \"model_benchmark_id\": 1354,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.691,\n    \"normalized_score\": 0.691,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.851256+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.851256+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1844,\n    \"benchmark_id\": \"tau-bench\",\n    \"model_id\": \"o3-2025-04-16\",\n    \"score\": 0.63,\n    \"normalized_score\": 0.63,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (avg Airline/Retail)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.221470+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.221470+00:00\",\n    \"benchmark_name\": \"Tau-bench\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/openai/models/o3-2025-04-16/model.json",
    "content": "{\n  \"model_id\": \"o3-2025-04-16\",\n  \"name\": \"o3\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"OpenAI's most powerful reasoning model. o3 is a well-rounded and powerful model across domains. It sets a new standard for math, science, coding, and visual reasoning tasks. It also excels at technical writing and instruction-following. Use it to think through multi-step problems that involve analysis across text, code, and images.\",\n  \"release_date\": \"2025-04-16\",\n  \"announcement_date\": \"2025-04-16\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-05-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/o3\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.818000+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.818000+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/o3-mini/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 670,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.667,\n    \"normalized_score\": 0.667,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.387419+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.387419+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 1334,\n    \"benchmark_id\": \"aider-polyglot-edit\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.604,\n    \"normalized_score\": 0.604,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.806560+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.806560+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot Edit\"\n  },\n  {\n    \"model_benchmark_id\": 485,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.873,\n    \"normalized_score\": 0.873,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o3-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"test set evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.018382+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.018382+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 1859,\n    \"benchmark_id\": \"collie\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.987,\n    \"normalized_score\": 0.987,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.259314+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.259314+00:00\",\n    \"benchmark_name\": \"COLLIE\"\n  },\n  {\n    \"model_benchmark_id\": 1894,\n    \"benchmark_id\": \"complexfuncbench\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.176,\n    \"normalized_score\": 0.176,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.344047+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.344047+00:00\",\n    \"benchmark_name\": \"ComplexFuncBench\"\n  },\n  {\n    \"model_benchmark_id\": 1830,\n    \"benchmark_id\": \"frontiermath\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.092,\n    \"normalized_score\": 0.092,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o3-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass @ 1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.183728+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.183728+00:00\",\n    \"benchmark_name\": \"FrontierMath\"\n  },\n  {\n    \"model_benchmark_id\": 351,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.772,\n    \"normalized_score\": 0.772,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"diamond\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.758026+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.758026+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1904,\n    \"benchmark_id\": \"graphwalks-bfs-<128k\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.51,\n    \"normalized_score\": 0.51,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.368369+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.368369+00:00\",\n    \"benchmark_name\": \"Graphwalks BFS <128k\"\n  },\n  {\n    \"model_benchmark_id\": 1880,\n    \"benchmark_id\": \"graphwalks-parents-<128k\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.583,\n    \"normalized_score\": 0.583,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.310391+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.310391+00:00\",\n    \"benchmark_name\": \"Graphwalks parents <128k\"\n  },\n  {\n    \"model_benchmark_id\": 634,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.939,\n    \"normalized_score\": 0.939,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.302770+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.302770+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 1847,\n    \"benchmark_id\": \"internal-api-instruction-following-(hard)\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.5,\n    \"normalized_score\": 0.5,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.228737+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.228737+00:00\",\n    \"benchmark_name\": \"Internal API instruction following (hard)\"\n  },\n  {\n    \"model_benchmark_id\": 754,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.846,\n    \"normalized_score\": 0.846,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o3-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"o3-mini high\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.585789+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.585789+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 426,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.979,\n    \"normalized_score\": 0.979,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o3-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"o3-mini high\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.901889+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.901889+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1296,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.92,\n    \"normalized_score\": 0.92,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o3-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"o3-mini high\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.712633+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.712633+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 119,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.869,\n    \"normalized_score\": 0.869,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o3-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"o3-mini high\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.320589+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.320589+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 742,\n    \"benchmark_id\": \"multichallenge\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.399,\n    \"normalized_score\": 0.399,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.560158+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.560158+00:00\",\n    \"benchmark_name\": \"MultiChallenge\"\n  },\n  {\n    \"model_benchmark_id\": 1853,\n    \"benchmark_id\": \"multichallenge-(o3-mini-grader)\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.502,\n    \"normalized_score\": 0.502,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.243415+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.243415+00:00\",\n    \"benchmark_name\": \"MultiChallenge (o3-mini grader)\"\n  },\n  {\n    \"model_benchmark_id\": 1652,\n    \"benchmark_id\": \"multi-if\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.795,\n    \"normalized_score\": 0.795,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.646496+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.646496+00:00\",\n    \"benchmark_name\": \"Multi-IF\"\n  },\n  {\n    \"model_benchmark_id\": 1474,\n    \"benchmark_id\": \"multilingual-mmlu\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.807,\n    \"normalized_score\": 0.807,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.143822+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.143822+00:00\",\n    \"benchmark_name\": \"Multilingual MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1865,\n    \"benchmark_id\": \"openai-mrcr:-2-needle-128k\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.187,\n    \"normalized_score\": 0.187,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.274261+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.274261+00:00\",\n    \"benchmark_name\": \"OpenAI-MRCR: 2 needle 128k\"\n  },\n  {\n    \"model_benchmark_id\": 238,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.15,\n    \"normalized_score\": 0.15,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-gpt-4-5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.554563+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.554563+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 1357,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.493,\n    \"normalized_score\": 0.493,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/openai-o3-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"verified\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.856039+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.856039+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1898,\n    \"benchmark_id\": \"swe-lancer\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.18,\n    \"normalized_score\": 0.18,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"percentage score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.355089+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.355089+00:00\",\n    \"benchmark_name\": \"SWE-Lancer\"\n  },\n  {\n    \"model_benchmark_id\": 1901,\n    \"benchmark_id\": \"swe-lancer-(ic-diamond-subset)\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.074,\n    \"normalized_score\": 0.074,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"percentage score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.362026+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.362026+00:00\",\n    \"benchmark_name\": \"SWE-Lancer (IC-Diamond subset)\"\n  },\n  {\n    \"model_benchmark_id\": 1779,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.324,\n    \"normalized_score\": 0.324,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.013372+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.013372+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1765,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"o3-mini\",\n    \"score\": 0.576,\n    \"normalized_score\": 0.576,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/gpt-4-1/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"benchmark score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.984653+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.984653+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  }\n]"
  },
  {
    "path": "data/organizations/openai/models/o3-mini/model.json",
    "content": "{\n  \"model_id\": \"o3-mini\",\n  \"name\": \"o3-mini\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A smaller variant of O3, expected to offer enhanced multimodal capabilities, improved reasoning, and more efficient resource utilization compared to previous models while maintaining strong performance on core tasks.\",\n  \"release_date\": \"2025-01-30\",\n  \"announcement_date\": \"2025-01-30\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2023-09-30\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://cdn.openai.com/o3-mini-system-card.pdf\",\n  \"source_scorecard_blog_link\": \"https://openai.com/index/openai-o3-mini/\",\n  \"source_repo_link\": \"https://github.com/openai\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.835007+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.835007+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/o3-pro-2025-06-10/model.json",
    "content": "{\n  \"model_id\": \"o3-pro-2025-06-10\",\n  \"name\": \"o3-pro\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Version of o3 with more compute for better responses. The o3-pro model uses more compute to think harder and provide consistently better answers. Designed to tackle tough problems with advanced reasoning capabilities.\",\n  \"release_date\": \"2025-06-10\",\n  \"announcement_date\": \"2025-06-10\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-05-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/o3-pro\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.832229+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.832229+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/models/o4-mini/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 668,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.689,\n    \"normalized_score\": 0.689,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (whole, o4-mini-high)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.384371+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.384371+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 1332,\n    \"benchmark_id\": \"aider-polyglot-edit\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.582,\n    \"normalized_score\": 0.582,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (diff, o4-mini-high)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.803065+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.803065+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot Edit\"\n  },\n  {\n    \"model_benchmark_id\": 483,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.934,\n    \"normalized_score\": 0.934,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (no tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.015345+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.015345+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 706,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.927,\n    \"normalized_score\": 0.927,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (no tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.477657+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.477657+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1843,\n    \"benchmark_id\": \"browsecomp\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.515,\n    \"normalized_score\": 0.515,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (with python + browsing)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.217475+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.217475+00:00\",\n    \"benchmark_name\": \"BrowseComp\"\n  },\n  {\n    \"model_benchmark_id\": 1835,\n    \"benchmark_id\": \"charxiv-r\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.72,\n    \"normalized_score\": 0.72,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.197036+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.197036+00:00\",\n    \"benchmark_name\": \"CharXiv-R\"\n  },\n  {\n    \"model_benchmark_id\": 349,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.814,\n    \"normalized_score\": 0.814,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"diamond accuracy (no tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.754610+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.754610+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 726,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.147,\n    \"normalized_score\": 0.147,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (no tools)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.528160+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.528160+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 540,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.843,\n    \"normalized_score\": 0.843,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.115868+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.115868+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 591,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.816,\n    \"normalized_score\": 0.816,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.218993+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.218993+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1841,\n    \"benchmark_id\": \"scale-multichallenge\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.43,\n    \"normalized_score\": 0.43,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.211372+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.211372+00:00\",\n    \"benchmark_name\": \"Scale MultiChallenge\"\n  },\n  {\n    \"model_benchmark_id\": 1356,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.681,\n    \"normalized_score\": 0.681,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.854236+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.854236+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  },\n  {\n    \"model_benchmark_id\": 1777,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.492,\n    \"normalized_score\": 0.492,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (o4-mini-high)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.009611+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.009611+00:00\",\n    \"benchmark_name\": \"TAU-bench Airline\"\n  },\n  {\n    \"model_benchmark_id\": 1763,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"o4-mini\",\n    \"score\": 0.718,\n    \"normalized_score\": 0.718,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://openai.com/index/introducing-o3-and-o4-mini/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy (o4-mini-high)\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.980200+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.980200+00:00\",\n    \"benchmark_name\": \"TAU-bench Retail\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/openai/models/o4-mini/model.json",
    "content": "{\n  \"model_id\": \"o4-mini\",\n  \"name\": \"o4-mini\",\n  \"organization_id\": \"openai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"o4-mini is OpenAI's latest small o-series model, optimized for fast, effective reasoning with exceptionally efficient performance in coding and visual tasks. It is faster and more affordable than o3.\",\n  \"release_date\": \"2025-04-16\",\n  \"announcement_date\": \"2025-04-16\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-05-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://platform.openai.com/docs/models/o4-mini\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/openai\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.824485+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.824485+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/openai/organization.json",
    "content": "{\n  \"organization_id\": \"openai\",\n  \"name\": \"OpenAI\",\n  \"website\": \"https://openai.com\",\n  \"description\": \"Leading AI research company\",\n  \"country\": \"US\",\n  \"created_at\": \"2025-07-19T19:49:05.815252+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.815252+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/qwen/models/qvq-72b-preview/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1675,\n    \"benchmark_id\": \"mathvision\",\n    \"model_id\": \"qvq-72b-preview\",\n    \"score\": 0.359,\n    \"normalized_score\": 0.359,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/QVQ-72B-Preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"full\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.700746+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.700746+00:00\",\n    \"benchmark_name\": \"MathVision\"\n  },\n  {\n    \"model_benchmark_id\": 526,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"qvq-72b-preview\",\n    \"score\": 0.714,\n    \"normalized_score\": 0.714,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/QVQ-72B-Preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"mini\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.092107+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.092107+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 570,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"qvq-72b-preview\",\n    \"score\": 0.703,\n    \"normalized_score\": 0.703,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/QVQ-72B-Preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"val\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.173084+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.173084+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1716,\n    \"benchmark_id\": \"olympiadbench\",\n    \"model_id\": \"qvq-72b-preview\",\n    \"score\": 0.204,\n    \"normalized_score\": 0.204,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/QVQ-72B-Preview\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"full\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.824642+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.824642+00:00\",\n    \"benchmark_name\": \"OlympiadBench\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qvq-72b-preview/model.json",
    "content": "{\n  \"model_id\": \"qvq-72b-preview\",\n  \"name\": \"QvQ-72B-Preview\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": \"qwen2-vl-72b\",\n  \"description\": \"An experimental research model focusing on advanced visual reasoning and step-by-step cognitive capabilities. Achieves strong performance on multi-modal science and mathematics tasks, though exhibits some limitations such as potential language mixing and recursive reasoning loops.\",\n  \"release_date\": \"2024-12-25\",\n  \"announcement_date\": \"2024-12-25\",\n  \"license_id\": \"qwen\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 73400000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/Qwen/QVQ-72B-Preview\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qvq-72b-preview/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/QVQ-72B-Preview\",\n  \"created_at\": \"2025-07-19T19:49:05.895366+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.895366+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-14b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 21,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.673,\n    \"normalized_score\": 0.673,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"ARC-C benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.127541+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.127541+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 971,\n    \"benchmark_id\": \"bbh\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.782,\n    \"normalized_score\": 0.782,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"BBH benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.042167+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.042167+00:00\",\n    \"benchmark_name\": \"BBH\"\n  },\n  {\n    \"model_benchmark_id\": 301,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.455,\n    \"normalized_score\": 0.455,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPQA benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.677954+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.677954+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 994,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.948,\n    \"normalized_score\": 0.948,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GSM8K benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.082212+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.082212+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 786,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.835,\n    \"normalized_score\": 0.835,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"HumanEval benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.646500+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.646500+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1441,\n    \"benchmark_id\": \"humaneval+\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.512,\n    \"normalized_score\": 0.512,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"HumanEval+ benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.071967+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.071967+00:00\",\n    \"benchmark_name\": \"HumanEval+\"\n  },\n  {\n    \"model_benchmark_id\": 404,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.8,\n    \"normalized_score\": 0.8,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MATH benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.862254+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.862254+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1185,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.82,\n    \"normalized_score\": 0.82,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MBPP benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.497488+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.497488+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1602,\n    \"benchmark_id\": \"mbpp+\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.632,\n    \"normalized_score\": 0.632,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MBPP+ benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.507421+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.507421+00:00\",\n    \"benchmark_name\": \"MBPP+\"\n  },\n  {\n    \"model_benchmark_id\": 89,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.797,\n    \"normalized_score\": 0.797,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.269091+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.269091+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 194,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.637,\n    \"normalized_score\": 0.637,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU-Pro benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.471047+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.471047+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 731,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.8,\n    \"normalized_score\": 0.8,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU-redux benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.538944+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.538944+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 1600,\n    \"benchmark_id\": \"mmlu-stem\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.764,\n    \"normalized_score\": 0.764,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU-STEM benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.500528+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.500528+00:00\",\n    \"benchmark_name\": \"MMLU-STEM\"\n  },\n  {\n    \"model_benchmark_id\": 642,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.728,\n    \"normalized_score\": 0.728,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MultiPL-E benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.319213+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.319213+00:00\",\n    \"benchmark_name\": \"MultiPL-E\"\n  },\n  {\n    \"model_benchmark_id\": 1597,\n    \"benchmark_id\": \"theoremqa\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.43,\n    \"normalized_score\": 0.43,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"TheoremQA benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.492163+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.492163+00:00\",\n    \"benchmark_name\": \"TheoremQA\"\n  },\n  {\n    \"model_benchmark_id\": 138,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"qwen-2.5-14b-instruct\",\n    \"score\": 0.584,\n    \"normalized_score\": 0.584,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"TruthfulQA benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.355004+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.355004+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-14b-instruct/model.json",
    "content": "{\n  \"model_id\": \"qwen-2.5-14b-instruct\",\n  \"name\": \"Qwen2.5 14B Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen2.5-14B-Instruct is an instruction-tuned 14.7B parameter language model, part of the Qwen2.5 series. It features significant improvements in instruction following, long text generation (8K+ tokens), structured data understanding, and JSON output generation. The model supports a 128K token context length and multilingual capabilities across 29+ languages including Chinese, English, French, Spanish, and more.\",\n  \"release_date\": \"2024-09-19\",\n  \"announcement_date\": \"2024-09-19\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 14700000000,\n  \"training_tokens\": 18000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.alibabacloud.com/help/en/model-studio/developer-reference/use-qwen-by-calling-api\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2407.10671\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2.5\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2.5-14B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.615575+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.615575+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-32b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 18,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.704,\n    \"normalized_score\": 0.704,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"ARC-C benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.121747+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.121747+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 970,\n    \"benchmark_id\": \"bbh\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.845,\n    \"normalized_score\": 0.845,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"BBH benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.040428+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.040428+00:00\",\n    \"benchmark_name\": \"BBH\"\n  },\n  {\n    \"model_benchmark_id\": 297,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.495,\n    \"normalized_score\": 0.495,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPQA benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.671178+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.671178+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 990,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.959,\n    \"normalized_score\": 0.959,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GSM8K benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.074870+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.074870+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 45,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.852,\n    \"normalized_score\": 0.852,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"HellaSwag benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.178158+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.178158+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 782,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.884,\n    \"normalized_score\": 0.884,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"HumanEval benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.639922+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.639922+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1440,\n    \"benchmark_id\": \"humaneval+\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.524,\n    \"normalized_score\": 0.524,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"HumanEval+ benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.070409+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.070409+00:00\",\n    \"benchmark_name\": \"HumanEval+\"\n  },\n  {\n    \"model_benchmark_id\": 400,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.831,\n    \"normalized_score\": 0.831,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MATH benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.856115+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.856115+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1181,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.84,\n    \"normalized_score\": 0.84,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MBPP benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.489427+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.489427+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1601,\n    \"benchmark_id\": \"mbpp+\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.672,\n    \"normalized_score\": 0.672,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MBPP+ benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.504915+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.504915+00:00\",\n    \"benchmark_name\": \"MBPP+\"\n  },\n  {\n    \"model_benchmark_id\": 85,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.833,\n    \"normalized_score\": 0.833,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.261705+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.261705+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 190,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.69,\n    \"normalized_score\": 0.69,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU-Pro benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.465052+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.465052+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 728,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.839,\n    \"normalized_score\": 0.839,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU-redux benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.533630+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.533630+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 1599,\n    \"benchmark_id\": \"mmlu-stem\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.809,\n    \"normalized_score\": 0.809,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU-STEM benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.498255+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.498255+00:00\",\n    \"benchmark_name\": \"MMLU-STEM\"\n  },\n  {\n    \"model_benchmark_id\": 640,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.754,\n    \"normalized_score\": 0.754,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MultiPL-E benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.316384+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.316384+00:00\",\n    \"benchmark_name\": \"MultiPL-E\"\n  },\n  {\n    \"model_benchmark_id\": 1593,\n    \"benchmark_id\": \"theoremqa\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.441,\n    \"normalized_score\": 0.441,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"TheoremQA benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.482526+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.482526+00:00\",\n    \"benchmark_name\": \"TheoremQA\"\n  },\n  {\n    \"model_benchmark_id\": 135,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.578,\n    \"normalized_score\": 0.578,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"TruthfulQA benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.349397+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.349397+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 150,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"qwen-2.5-32b-instruct\",\n    \"score\": 0.82,\n    \"normalized_score\": 0.82,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Winogrande benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.384431+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.384431+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-32b-instruct/model.json",
    "content": "{\n  \"model_id\": \"qwen-2.5-32b-instruct\",\n  \"name\": \"Qwen2.5 32B Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen2.5-32B-Instruct is an instruction-tuned 32 billion parameter language model, part of the Qwen2.5 series. It is designed to follow instructions, generate long texts (over 8K tokens), understand structured data (e.g., tables), and generate structured outputs, especially JSON. The model supports multilingual capabilities across over 29 languages.\",\n  \"release_date\": \"2024-09-19\",\n  \"announcement_date\": \"2024-09-19\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 32500000000,\n  \"training_tokens\": 18000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.alibabacloud.com/help/en/model-studio/developer-reference/use-qwen-by-calling-api\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2.5\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2.5-32B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.606261+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.606261+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-72b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1617,\n    \"benchmark_id\": \"alignbench\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.816,\n    \"normalized_score\": 0.816,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"AlignBench v1.1 benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.546122+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.546122+00:00\",\n    \"benchmark_name\": \"AlignBench\"\n  },\n  {\n    \"model_benchmark_id\": 1453,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.812,\n    \"normalized_score\": 0.812,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Arena Hard benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.097075+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.097075+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 303,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.49,\n    \"normalized_score\": 0.49,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPQA benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.681073+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.681073+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 996,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.958,\n    \"normalized_score\": 0.958,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GSM8K benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.085236+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.085236+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 787,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.866,\n    \"normalized_score\": 0.866,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"HumanEval benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.648406+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.648406+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 620,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.841,\n    \"normalized_score\": 0.841,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"IFEval strict-prompt benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.277303+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.277303+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 750,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.523,\n    \"normalized_score\": 0.523,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"LiveBench benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.577555+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.577555+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 1124,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.555,\n    \"normalized_score\": 0.555,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"LiveCodeBench benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.346315+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.346315+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 406,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.831,\n    \"normalized_score\": 0.831,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MATH benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.865721+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.865721+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1187,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.882,\n    \"normalized_score\": 0.882,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MBPP benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.503069+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.503069+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 196,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.711,\n    \"normalized_score\": 0.711,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU-Pro benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.475182+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.475182+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 733,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.868,\n    \"normalized_score\": 0.868,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU-redux benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.542364+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.542364+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 1606,\n    \"benchmark_id\": \"mt-bench\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.935,\n    \"normalized_score\": 0.935,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MT-bench benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.521232+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.521232+00:00\",\n    \"benchmark_name\": \"MT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 644,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"score\": 0.751,\n    \"normalized_score\": 0.751,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MultiPL-E benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.322800+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.322800+00:00\",\n    \"benchmark_name\": \"MultiPL-E\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-72b-instruct/model.json",
    "content": "{\n  \"model_id\": \"qwen-2.5-72b-instruct\",\n  \"name\": \"Qwen2.5 72B Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen2.5-72B-Instruct is an instruction-tuned 72 billion parameter language model, part of the Qwen2.5 series. It is designed to follow instructions, generate long texts (over 8K tokens), understand structured data (e.g., tables), and generate structured outputs, especially JSON. The model supports multilingual capabilities across over 29 languages.\",\n  \"release_date\": \"2024-09-19\",\n  \"announcement_date\": \"2024-09-19\",\n  \"license_id\": \"qwen\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 72700000000,\n  \"training_tokens\": 18000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.alibabacloud.com/help/en/model-studio/developer-reference/use-qwen-by-calling-api\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2.5/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2.5\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2.5-72B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.627855+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.627855+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-7b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1618,\n    \"benchmark_id\": \"alignbench\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"AlignBench v1.1 benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.548680+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.548680+00:00\",\n    \"benchmark_name\": \"AlignBench\"\n  },\n  {\n    \"model_benchmark_id\": 1455,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.52,\n    \"normalized_score\": 0.52,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Arena Hard benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.100766+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.100766+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 306,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.364,\n    \"normalized_score\": 0.364,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPQA benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.685965+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.685965+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 998,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.916,\n    \"normalized_score\": 0.916,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GSM8K benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.088027+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.088027+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 789,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.848,\n    \"normalized_score\": 0.848,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"HumanEval benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.651744+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.651744+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 621,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.712,\n    \"normalized_score\": 0.712,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"IFEval strict-prompt benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.278867+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.278867+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 753,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.359,\n    \"normalized_score\": 0.359,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"LiveBench 0831 benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.584018+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.584018+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 1126,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.287,\n    \"normalized_score\": 0.287,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"LiveCodeBench 2305-2409 benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.352497+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.352497+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 408,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.755,\n    \"normalized_score\": 0.755,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MATH benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.869960+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.869960+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1189,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.792,\n    \"normalized_score\": 0.792,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MBPP benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.506947+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.506947+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 198,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.563,\n    \"normalized_score\": 0.563,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU-Pro benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.479104+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.479104+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 735,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.754,\n    \"normalized_score\": 0.754,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MMLU-redux benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.545338+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.545338+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 1607,\n    \"benchmark_id\": \"mt-bench\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.875,\n    \"normalized_score\": 0.875,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MT-bench benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.523567+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.523567+00:00\",\n    \"benchmark_name\": \"MT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 646,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"score\": 0.704,\n    \"normalized_score\": 0.704,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"MultiPL-E benchmark evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.325846+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.325846+00:00\",\n    \"benchmark_name\": \"MultiPL-E\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-7b-instruct/model.json",
    "content": "{\n  \"model_id\": \"qwen-2.5-7b-instruct\",\n  \"name\": \"Qwen2.5 7B Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen2.5-7B-Instruct is an instruction-tuned 7B parameter language model that excels at following instructions, generating long texts (over 8K tokens), understanding structured data, and generating structured outputs like JSON. The model features enhanced capabilities in mathematics, coding, and multilingual support across 29+ languages including Chinese, English, French, Spanish, and more.\",\n  \"release_date\": \"2024-09-19\",\n  \"announcement_date\": \"2024-09-19\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 7610000000,\n  \"training_tokens\": 18000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.alibabacloud.com/help/en/model-studio/developer-reference/use-qwen-by-calling-api\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2407.10671\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2.5-llm/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2.5\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2.5-7B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.642960+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.642960+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-coder-32b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 19,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.705,\n    \"normalized_score\": 0.705,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.123905+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.123905+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1603,\n    \"benchmark_id\": \"bigcodebench-full\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.496,\n    \"normalized_score\": 0.496,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.511653+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.511653+00:00\",\n    \"benchmark_name\": \"BigCodeBench-Full\"\n  },\n  {\n    \"model_benchmark_id\": 1604,\n    \"benchmark_id\": \"bigcodebench-hard\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.27,\n    \"normalized_score\": 0.27,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.515099+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.515099+00:00\",\n    \"benchmark_name\": \"BigCodeBench-Hard\"\n  },\n  {\n    \"model_benchmark_id\": 991,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.911,\n    \"normalized_score\": 0.911,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.076453+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.076453+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 46,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.83,\n    \"normalized_score\": 0.83,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.180700+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.180700+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 783,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.927,\n    \"normalized_score\": 0.927,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.641672+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.641672+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1117,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.314,\n    \"normalized_score\": 0.314,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.329968+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.329968+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 401,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.572,\n    \"normalized_score\": 0.572,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.857514+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.857514+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1182,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.902,\n    \"normalized_score\": 0.902,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.491369+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.491369+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 86,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.751,\n    \"normalized_score\": 0.751,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.263438+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.263438+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 191,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.504,\n    \"normalized_score\": 0.504,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.466410+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.466410+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 729,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.775,\n    \"normalized_score\": 0.775,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.535302+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.535302+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 1594,\n    \"benchmark_id\": \"theoremqa\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.431,\n    \"normalized_score\": 0.431,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.485084+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.485084+00:00\",\n    \"benchmark_name\": \"TheoremQA\"\n  },\n  {\n    \"model_benchmark_id\": 136,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.542,\n    \"normalized_score\": 0.542,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.351250+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.351250+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 1064,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"score\": 0.808,\n    \"normalized_score\": 0.808,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.219435+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.219435+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-coder-32b-instruct/model.json",
    "content": "{\n  \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n  \"name\": \"Qwen2.5-Coder 32B Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": \"qwen-2.5-32b-instruct\",\n  \"description\": \"Qwen2.5-Coder is a specialized coding model trained on 5.5 trillion tokens of code data, supporting 92 programming languages with a 128K context window. It excels in code generation, completion, repair, and multi-programming tasks while maintaining strong performance in mathematics and general capabilities.\",\n  \"release_date\": \"2024-09-19\",\n  \"announcement_date\": \"2024-09-19\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 32000000000,\n  \"training_tokens\": 5500000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.alibabacloud.com/help/en/model-studio/developer-reference/use-qwen-by-calling-api\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2409.12186\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2.5-coder/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2.5-Coder\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2.5-Coder-32B\",\n  \"created_at\": \"2025-07-19T19:49:05.882455+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.882455+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-coder-7b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1624,\n    \"benchmark_id\": \"aider\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.556,\n    \"normalized_score\": 0.556,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.569369+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.569369+00:00\",\n    \"benchmark_name\": \"Aider\"\n  },\n  {\n    \"model_benchmark_id\": 20,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.609,\n    \"normalized_score\": 0.609,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.126002+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.126002+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 1434,\n    \"benchmark_id\": \"bigcodebench\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.41,\n    \"normalized_score\": 0.41,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.052666+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.052666+00:00\",\n    \"benchmark_name\": \"BigCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1620,\n    \"benchmark_id\": \"cruxeval-input-cot\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.565,\n    \"normalized_score\": 0.565,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.554528+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.554528+00:00\",\n    \"benchmark_name\": \"CRUXEval-Input-CoT\"\n  },\n  {\n    \"model_benchmark_id\": 1621,\n    \"benchmark_id\": \"cruxeval-output-cot\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.56,\n    \"normalized_score\": 0.56,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.558251+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.558251+00:00\",\n    \"benchmark_name\": \"CRUXEval-Output-CoT\"\n  },\n  {\n    \"model_benchmark_id\": 993,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.839,\n    \"normalized_score\": 0.839,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.080381+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.080381+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 47,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.768,\n    \"normalized_score\": 0.768,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.182466+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.182466+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 785,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.884,\n    \"normalized_score\": 0.884,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.644936+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.644936+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1121,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.182,\n    \"normalized_score\": 0.182,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.340042+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.340042+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 403,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.466,\n    \"normalized_score\": 0.466,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.860821+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.860821+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1184,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.835,\n    \"normalized_score\": 0.835,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.495284+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.495284+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 88,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.676,\n    \"normalized_score\": 0.676,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.267319+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.267319+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1623,\n    \"benchmark_id\": \"mmlu-base\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.68,\n    \"normalized_score\": 0.68,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.565292+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.565292+00:00\",\n    \"benchmark_name\": \"MMLU-Base\"\n  },\n  {\n    \"model_benchmark_id\": 193,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.401,\n    \"normalized_score\": 0.401,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.469384+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.469384+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 730,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.666,\n    \"normalized_score\": 0.666,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.537049+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.537049+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 1622,\n    \"benchmark_id\": \"stem\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.34,\n    \"normalized_score\": 0.34,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.561469+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.561469+00:00\",\n    \"benchmark_name\": \"STEM\"\n  },\n  {\n    \"model_benchmark_id\": 1596,\n    \"benchmark_id\": \"theoremqa\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.34,\n    \"normalized_score\": 0.34,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.489921+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.489921+00:00\",\n    \"benchmark_name\": \"TheoremQA\"\n  },\n  {\n    \"model_benchmark_id\": 137,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.506,\n    \"normalized_score\": 0.506,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.353301+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.353301+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 1065,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n    \"score\": 0.729,\n    \"normalized_score\": 0.729,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://arxiv.org/abs/2409.12186\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.221874+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.221874+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen-2.5-coder-7b-instruct/model.json",
    "content": "{\n  \"model_id\": \"qwen-2.5-coder-7b-instruct\",\n  \"name\": \"Qwen2.5-Coder 7B Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": \"qwen-2.5-7b-instruct\",\n  \"description\": \"Qwen2.5-Coder is a specialized coding model trained on 5.5 trillion tokens of code data, supporting 92 programming languages with a 128K context window. It excels in code generation, completion, and repair while maintaining strong performance in math and general tasks. The model demonstrates exceptional capabilities in multi-programming language tasks and code reasoning.\",\n  \"release_date\": \"2024-09-19\",\n  \"announcement_date\": \"2024-09-19\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 7000000000,\n  \"training_tokens\": 5500000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://www.alibabacloud.com/help/en/model-studio/developer-reference/use-qwen-by-calling-api\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2409.12186\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2.5-coder/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2.5-7B-Coder\",\n  \"created_at\": \"2025-07-19T19:49:05.890300+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.890300+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen2-72b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 22,\n    \"benchmark_id\": \"arc-c\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.689,\n    \"normalized_score\": 0.689,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.129146+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.129146+00:00\",\n    \"benchmark_name\": \"ARC-C\"\n  },\n  {\n    \"model_benchmark_id\": 973,\n    \"benchmark_id\": \"bbh\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.824,\n    \"normalized_score\": 0.824,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.045120+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.045120+00:00\",\n    \"benchmark_name\": \"BBH\"\n  },\n  {\n    \"model_benchmark_id\": 437,\n    \"benchmark_id\": \"c-eval\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.838,\n    \"normalized_score\": 0.838,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.926225+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.926225+00:00\",\n    \"benchmark_name\": \"C-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 1749,\n    \"benchmark_id\": \"cmmlu\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.901,\n    \"normalized_score\": 0.901,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.943893+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.943893+00:00\",\n    \"benchmark_name\": \"CMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 372,\n    \"benchmark_id\": \"evalplus\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.79,\n    \"normalized_score\": 0.79,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.802955+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.802955+00:00\",\n    \"benchmark_name\": \"EvalPlus\"\n  },\n  {\n    \"model_benchmark_id\": 307,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.424,\n    \"normalized_score\": 0.424,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.687633+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.687633+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 999,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.911,\n    \"normalized_score\": 0.911,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.089706+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.089706+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 48,\n    \"benchmark_id\": \"hellaswag\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.876,\n    \"normalized_score\": 0.876,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.184833+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.184833+00:00\",\n    \"benchmark_name\": \"HellaSwag\"\n  },\n  {\n    \"model_benchmark_id\": 790,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.86,\n    \"normalized_score\": 0.86,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.653267+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.653267+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 409,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.597,\n    \"normalized_score\": 0.597,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.871582+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.871582+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1190,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.802,\n    \"normalized_score\": 0.802,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.508406+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.508406+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 91,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.823,\n    \"normalized_score\": 0.823,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.272629+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.272629+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 199,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.644,\n    \"normalized_score\": 0.644,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.480879+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.480879+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 647,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.692,\n    \"normalized_score\": 0.692,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.327331+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.327331+00:00\",\n    \"benchmark_name\": \"MultiPL-E\"\n  },\n  {\n    \"model_benchmark_id\": 1598,\n    \"benchmark_id\": \"theoremqa\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.444,\n    \"normalized_score\": 0.444,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.494165+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.494165+00:00\",\n    \"benchmark_name\": \"TheoremQA\"\n  },\n  {\n    \"model_benchmark_id\": 139,\n    \"benchmark_id\": \"truthfulqa\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.548,\n    \"normalized_score\": 0.548,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.356602+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.356602+00:00\",\n    \"benchmark_name\": \"TruthfulQA\"\n  },\n  {\n    \"model_benchmark_id\": 151,\n    \"benchmark_id\": \"winogrande\",\n    \"model_id\": \"qwen2-72b-instruct\",\n    \"score\": 0.851,\n    \"normalized_score\": 0.851,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.386216+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.386216+00:00\",\n    \"benchmark_name\": \"Winogrande\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen2-72b-instruct/model.json",
    "content": "{\n  \"model_id\": \"qwen2-72b-instruct\",\n  \"name\": \"Qwen2 72B Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen2-72B-Instruct is an instruction-tuned language model with 72 billion parameters, supporting a context length of up to 131,072 tokens. It's part of the new Qwen2 series, which has surpassed most open-source models and demonstrates competitiveness against proprietary models across various benchmarks.\",\n  \"release_date\": \"2024-07-23\",\n  \"announcement_date\": \"2024-07-23\",\n  \"license_id\": \"tongyi_qianwen\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 72000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n  \"source_playground\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n  \"source_paper\": \"https://arxiv.org/abs/2309.00071\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2/\",\n  \"source_repo_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2-72B\",\n  \"created_at\": \"2025-07-19T19:49:05.650844+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.650844+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen2-7b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1616,\n    \"benchmark_id\": \"alignbench\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.721,\n    \"normalized_score\": 0.721,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.544441+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.544441+00:00\",\n    \"benchmark_name\": \"AlignBench\"\n  },\n  {\n    \"model_benchmark_id\": 436,\n    \"benchmark_id\": \"c-eval\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.772,\n    \"normalized_score\": 0.772,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.924104+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.924104+00:00\",\n    \"benchmark_name\": \"C-Eval\"\n  },\n  {\n    \"model_benchmark_id\": 370,\n    \"benchmark_id\": \"evalplus\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.703,\n    \"normalized_score\": 0.703,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.799094+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.799094+00:00\",\n    \"benchmark_name\": \"EvalPlus\"\n  },\n  {\n    \"model_benchmark_id\": 299,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.253,\n    \"normalized_score\": 0.253,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.674412+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.674412+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 992,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.823,\n    \"normalized_score\": 0.823,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.078833+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.078833+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 784,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.799,\n    \"normalized_score\": 0.799,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.643272+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.643272+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1119,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.266,\n    \"normalized_score\": 0.266,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.335377+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.335377+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 402,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.496,\n    \"normalized_score\": 0.496,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.859120+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.859120+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1183,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.672,\n    \"normalized_score\": 0.672,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.493272+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.493272+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 87,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.705,\n    \"normalized_score\": 0.705,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.265352+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.265352+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 192,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.441,\n    \"normalized_score\": 0.441,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.467957+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.467957+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1605,\n    \"benchmark_id\": \"mt-bench\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.841,\n    \"normalized_score\": 0.841,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.519120+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.519120+00:00\",\n    \"benchmark_name\": \"MT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 641,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.591,\n    \"normalized_score\": 0.591,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.317803+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.317803+00:00\",\n    \"benchmark_name\": \"MultiPL-E\"\n  },\n  {\n    \"model_benchmark_id\": 1595,\n    \"benchmark_id\": \"theoremqa\",\n    \"model_id\": \"qwen2-7b-instruct\",\n    \"score\": 0.253,\n    \"normalized_score\": 0.253,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.487702+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.487702+00:00\",\n    \"benchmark_name\": \"TheoremQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen2-7b-instruct/model.json",
    "content": "{\n  \"model_id\": \"qwen2-7b-instruct\",\n  \"name\": \"Qwen2 7B Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen2-7B-Instruct is an instruction-tuned language model with 7 billion parameters, supporting a context length of up to 131,072 tokens.\",\n  \"release_date\": \"2024-07-23\",\n  \"announcement_date\": \"2024-07-23\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 7620000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n  \"source_playground\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n  \"source_paper\": \"https://arxiv.org/abs/2309.00071\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2-7B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.612662+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.612662+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen2-vl-72b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 864,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.883,\n    \"normalized_score\": 0.883,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.806635+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.806635+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 1629,\n    \"benchmark_id\": \"docvqatest\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.965,\n    \"normalized_score\": 0.965,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.582058+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.582058+00:00\",\n    \"benchmark_name\": \"DocVQAtest\"\n  },\n  {\n    \"model_benchmark_id\": 923,\n    \"benchmark_id\": \"egoschema\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.779,\n    \"normalized_score\": 0.779,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.928297+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.928297+00:00\",\n    \"benchmark_name\": \"EgoSchema\"\n  },\n  {\n    \"model_benchmark_id\": 1630,\n    \"benchmark_id\": \"infovqatest\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.845,\n    \"normalized_score\": 0.845,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.586477+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.586477+00:00\",\n    \"benchmark_name\": \"InfoVQAtest\"\n  },\n  {\n    \"model_benchmark_id\": 1269,\n    \"benchmark_id\": \"mathvista-mini\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.705,\n    \"normalized_score\": 0.705,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.662750+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.662750+00:00\",\n    \"benchmark_name\": \"MathVista-Mini\"\n  },\n  {\n    \"model_benchmark_id\": 1639,\n    \"benchmark_id\": \"mmbench-test\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.865,\n    \"normalized_score\": 0.865,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.610292+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.610292+00:00\",\n    \"benchmark_name\": \"MMBench_test\"\n  },\n  {\n    \"model_benchmark_id\": 1532,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.462,\n    \"normalized_score\": 0.462,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.292395+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.292395+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1628,\n    \"benchmark_id\": \"mmmuval\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.645,\n    \"normalized_score\": 0.645,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.578458+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.578458+00:00\",\n    \"benchmark_name\": \"MMMUval\"\n  },\n  {\n    \"model_benchmark_id\": 1640,\n    \"benchmark_id\": \"mmvetgpt4turbo\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.74,\n    \"normalized_score\": 0.74,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.613913+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.613913+00:00\",\n    \"benchmark_name\": \"MMVetGPT4Turbo\"\n  },\n  {\n    \"model_benchmark_id\": 1631,\n    \"benchmark_id\": \"mtvqa\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.309,\n    \"normalized_score\": 0.309,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.590936+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.590936+00:00\",\n    \"benchmark_name\": \"MTVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1641,\n    \"benchmark_id\": \"mvbench\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.736,\n    \"normalized_score\": 0.736,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.618622+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.618622+00:00\",\n    \"benchmark_name\": \"MVBench\"\n  },\n  {\n    \"model_benchmark_id\": 1539,\n    \"benchmark_id\": \"ocrbench\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.877,\n    \"normalized_score\": 0.877,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.311748+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.311748+00:00\",\n    \"benchmark_name\": \"OCRBench\"\n  },\n  {\n    \"model_benchmark_id\": 1633,\n    \"benchmark_id\": \"realworldqa\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.778,\n    \"normalized_score\": 0.778,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.597450+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.597450+00:00\",\n    \"benchmark_name\": \"RealWorldQA\"\n  },\n  {\n    \"model_benchmark_id\": 909,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.855,\n    \"normalized_score\": 0.855,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.894922+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.894922+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1632,\n    \"benchmark_id\": \"vcr-en-easy\",\n    \"model_id\": \"qwen2-vl-72b\",\n    \"score\": 0.9193,\n    \"normalized_score\": 0.9193,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.594379+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.594379+00:00\",\n    \"benchmark_name\": \"VCR_en_easy\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen2-vl-72b/model.json",
    "content": "{\n  \"model_id\": \"qwen2-vl-72b\",\n  \"name\": \"Qwen2-VL-72B-Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"An instruction-tuned, large multimodal model that excels at visual understanding and step-by-step reasoning. It supports image and video input, with dynamic resolution handling and improved positional embeddings (M-ROPE), enabling advanced capabilities such as complex problem solving, multilingual text recognition in images, and agent-like interactions in video contexts.\",\n  \"release_date\": \"2024-08-29\",\n  \"announcement_date\": \"2024-08-29\",\n  \"license_id\": \"tongyi_qianwen\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2023-06-30\",\n  \"param_count\": 73400000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://arxiv.org/abs/2409.12191\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2-vl/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2-VL\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.619575+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.619575+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen2.5-omni-7b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1254,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.832,\n    \"normalized_score\": 0.832,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.633399+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.633399+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 866,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.853,\n    \"normalized_score\": 0.853,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.809953+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.809953+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 1718,\n    \"benchmark_id\": \"common-voice-15\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.076,\n    \"normalized_score\": 0.076,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"WER\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.833534+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.833534+00:00\",\n    \"benchmark_name\": \"Common Voice 15\"\n  },\n  {\n    \"model_benchmark_id\": 1717,\n    \"benchmark_id\": \"covost2-en-zh\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.414,\n    \"normalized_score\": 0.414,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"BLEU\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.828460+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.828460+00:00\",\n    \"benchmark_name\": \"CoVoST2 en-zh\"\n  },\n  {\n    \"model_benchmark_id\": 1719,\n    \"benchmark_id\": \"crperelation\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.765,\n    \"normalized_score\": 0.765,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.837425+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.837425+00:00\",\n    \"benchmark_name\": \"CRPErelation\"\n  },\n  {\n    \"model_benchmark_id\": 887,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.952,\n    \"normalized_score\": 0.952,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.846061+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.846061+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 924,\n    \"benchmark_id\": \"egoschema\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.686,\n    \"normalized_score\": 0.686,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.931056+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.931056+00:00\",\n    \"benchmark_name\": \"EgoSchema\"\n  },\n  {\n    \"model_benchmark_id\": 1401,\n    \"benchmark_id\": \"fleurs\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.041,\n    \"normalized_score\": 0.041,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"WER\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.953081+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.953081+00:00\",\n    \"benchmark_name\": \"FLEURS\"\n  },\n  {\n    \"model_benchmark_id\": 1720,\n    \"benchmark_id\": \"giantsteps-tempo\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.88,\n    \"normalized_score\": 0.88,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.841583+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.841583+00:00\",\n    \"benchmark_name\": \"GiantSteps Tempo\"\n  },\n  {\n    \"model_benchmark_id\": 305,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.308,\n    \"normalized_score\": 0.308,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.684328+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.684328+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 997,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.887,\n    \"normalized_score\": 0.887,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.086524+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.086524+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 788,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.787,\n    \"normalized_score\": 0.787,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.650243+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.650243+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 752,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.296,\n    \"normalized_score\": 0.296,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.581448+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.581448+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 407,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.715,\n    \"normalized_score\": 0.715,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.867189+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.867189+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1676,\n    \"benchmark_id\": \"mathvision\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.25,\n    \"normalized_score\": 0.25,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.702750+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.702750+00:00\",\n    \"benchmark_name\": \"MathVision\"\n  },\n  {\n    \"model_benchmark_id\": 527,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.679,\n    \"normalized_score\": 0.679,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.094090+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.094090+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 1188,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.732,\n    \"normalized_score\": 0.732,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.504920+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.504920+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1721,\n    \"benchmark_id\": \"meld\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.57,\n    \"normalized_score\": 0.57,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.845437+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.845437+00:00\",\n    \"benchmark_name\": \"Meld\"\n  },\n  {\n    \"model_benchmark_id\": 1722,\n    \"benchmark_id\": \"mmau\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.656,\n    \"normalized_score\": 0.656,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.849392+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.849392+00:00\",\n    \"benchmark_name\": \"MMAU\"\n  },\n  {\n    \"model_benchmark_id\": 1723,\n    \"benchmark_id\": \"mmau-music\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.6916,\n    \"normalized_score\": 0.6916,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.854098+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.854098+00:00\",\n    \"benchmark_name\": \"MMAU Music\"\n  },\n  {\n    \"model_benchmark_id\": 1724,\n    \"benchmark_id\": \"mmau-sound\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.6787,\n    \"normalized_score\": 0.6787,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.862523+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.862523+00:00\",\n    \"benchmark_name\": \"MMAU Sound\"\n  },\n  {\n    \"model_benchmark_id\": 1725,\n    \"benchmark_id\": \"mmau-speech\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.5976,\n    \"normalized_score\": 0.5976,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.867393+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.867393+00:00\",\n    \"benchmark_name\": \"MMAU Speech\"\n  },\n  {\n    \"model_benchmark_id\": 1726,\n    \"benchmark_id\": \"mmbench-v1.1\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.818,\n    \"normalized_score\": 0.818,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.871500+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.871500+00:00\",\n    \"benchmark_name\": \"MMBench-V1.1\"\n  },\n  {\n    \"model_benchmark_id\": 1730,\n    \"benchmark_id\": \"mme-realworld\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.616,\n    \"normalized_score\": 0.616,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.879804+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.879804+00:00\",\n    \"benchmark_name\": \"MME-RealWorld\"\n  },\n  {\n    \"model_benchmark_id\": 197,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.47,\n    \"normalized_score\": 0.47,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.477278+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.477278+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 734,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.71,\n    \"normalized_score\": 0.71,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.544013+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.544013+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 1731,\n    \"benchmark_id\": \"mm-mt-bench\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.06,\n    \"normalized_score\": 0.06,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.883880+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.883880+00:00\",\n    \"benchmark_name\": \"MM-MT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 571,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.592,\n    \"normalized_score\": 0.592,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.175251+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.175251+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1534,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.366,\n    \"normalized_score\": 0.366,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.296124+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.296124+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1660,\n    \"benchmark_id\": \"mmstar\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.64,\n    \"normalized_score\": 0.64,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.664551+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.664551+00:00\",\n    \"benchmark_name\": \"MMStar\"\n  },\n  {\n    \"model_benchmark_id\": 1734,\n    \"benchmark_id\": \"muirbench\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.592,\n    \"normalized_score\": 0.592,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.891075+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.891075+00:00\",\n    \"benchmark_name\": \"MuirBench\"\n  },\n  {\n    \"model_benchmark_id\": 645,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.658,\n    \"normalized_score\": 0.658,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.324318+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.324318+00:00\",\n    \"benchmark_name\": \"MultiPL-E\"\n  },\n  {\n    \"model_benchmark_id\": 1735,\n    \"benchmark_id\": \"musiccaps\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.328,\n    \"normalized_score\": 0.328,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.894342+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.894342+00:00\",\n    \"benchmark_name\": \"MusicCaps\"\n  },\n  {\n    \"model_benchmark_id\": 1643,\n    \"benchmark_id\": \"mvbench\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.703,\n    \"normalized_score\": 0.703,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.621841+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.621841+00:00\",\n    \"benchmark_name\": \"MVBench\"\n  },\n  {\n    \"model_benchmark_id\": 1736,\n    \"benchmark_id\": \"nmos\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.0451,\n    \"normalized_score\": 0.0451,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen2.5-omni/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"NMOS\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.897653+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.897653+00:00\",\n    \"benchmark_name\": \"NMOS\"\n  },\n  {\n    \"model_benchmark_id\": 1737,\n    \"benchmark_id\": \"ocrbench-v2\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.578,\n    \"normalized_score\": 0.578,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.901546+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.901546+00:00\",\n    \"benchmark_name\": \"OCRBench_V2\"\n  },\n  {\n    \"model_benchmark_id\": 1738,\n    \"benchmark_id\": \"odinw\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.424,\n    \"normalized_score\": 0.424,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.905294+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.905294+00:00\",\n    \"benchmark_name\": \"ODinW\"\n  },\n  {\n    \"model_benchmark_id\": 1739,\n    \"benchmark_id\": \"omnibench\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.5613,\n    \"normalized_score\": 0.5613,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.909979+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.909979+00:00\",\n    \"benchmark_name\": \"OmniBench\"\n  },\n  {\n    \"model_benchmark_id\": 1740,\n    \"benchmark_id\": \"omnibench-music\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.5283,\n    \"normalized_score\": 0.5283,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.913742+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.913742+00:00\",\n    \"benchmark_name\": \"OmniBench Music\"\n  },\n  {\n    \"model_benchmark_id\": 1741,\n    \"benchmark_id\": \"pointgrounding\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.665,\n    \"normalized_score\": 0.665,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.918183+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.918183+00:00\",\n    \"benchmark_name\": \"PointGrounding\"\n  },\n  {\n    \"model_benchmark_id\": 1634,\n    \"benchmark_id\": \"realworldqa\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.703,\n    \"normalized_score\": 0.703,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.599392+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.599392+00:00\",\n    \"benchmark_name\": \"RealWorldQA\"\n  },\n  {\n    \"model_benchmark_id\": 911,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.844,\n    \"normalized_score\": 0.844,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.899579+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.899579+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1685,\n    \"benchmark_id\": \"videomme-w-sub.\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.724,\n    \"normalized_score\": 0.724,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.727965+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.727965+00:00\",\n    \"benchmark_name\": \"VideoMME w sub.\"\n  },\n  {\n    \"model_benchmark_id\": 1742,\n    \"benchmark_id\": \"vocalsound\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.939,\n    \"normalized_score\": 0.939,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.921505+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.921505+00:00\",\n    \"benchmark_name\": \"VocalSound\"\n  },\n  {\n    \"model_benchmark_id\": 1743,\n    \"benchmark_id\": \"voicebench-avg\",\n    \"model_id\": \"qwen2.5-omni-7b\",\n    \"score\": 0.7412,\n    \"normalized_score\": 0.7412,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.925208+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.925208+00:00\",\n    \"benchmark_name\": \"VoiceBench Avg\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen2.5-omni-7b/model.json",
    "content": "{\n  \"model_id\": \"qwen2.5-omni-7b\",\n  \"name\": \"Qwen2.5-Omni-7B\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen2.5-Omni is the flagship end-to-end multimodal model in the Qwen series. It processes diverse inputs including text, images, audio, and video, delivering real-time streaming responses through text generation and natural speech synthesis using a novel Thinker-Talker architecture.\",\n  \"release_date\": \"2025-03-27\",\n  \"announcement_date\": \"2025-03-27\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 7000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://chat.qwen.ai/\",\n  \"source_paper\": \"https://arxiv.org/pdf/2503.20215\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2.5-omni/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2.5-Omni\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2.5-Omni-7B\",\n  \"created_at\": \"2025-07-19T19:49:05.639433+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.639433+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen2.5-vl-32b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1704,\n    \"benchmark_id\": \"aitz-em\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.831,\n    \"normalized_score\": 0.831,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.791493+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.791493+00:00\",\n    \"benchmark_name\": \"AITZ_EM\"\n  },\n  {\n    \"model_benchmark_id\": 1707,\n    \"benchmark_id\": \"android-control-high-em\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.696,\n    \"normalized_score\": 0.696,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.798431+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.798431+00:00\",\n    \"benchmark_name\": \"Android Control High_EM\"\n  },\n  {\n    \"model_benchmark_id\": 1710,\n    \"benchmark_id\": \"android-control-low-em\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.933,\n    \"normalized_score\": 0.933,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.807428+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.807428+00:00\",\n    \"benchmark_name\": \"Android Control Low_EM\"\n  },\n  {\n    \"model_benchmark_id\": 1713,\n    \"benchmark_id\": \"androidworld-sr\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.22,\n    \"normalized_score\": 0.22,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"SR\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.815734+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.815734+00:00\",\n    \"benchmark_name\": \"AndroidWorld_SR\"\n  },\n  {\n    \"model_benchmark_id\": 1658,\n    \"benchmark_id\": \"cc-ocr\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.771,\n    \"normalized_score\": 0.771,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.659496+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.659496+00:00\",\n    \"benchmark_name\": \"CC-OCR\"\n  },\n  {\n    \"model_benchmark_id\": 1695,\n    \"benchmark_id\": \"charadessta\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.542,\n    \"normalized_score\": 0.542,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.765807+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.765807+00:00\",\n    \"benchmark_name\": \"CharadesSTA\"\n  },\n  {\n    \"model_benchmark_id\": 889,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.948,\n    \"normalized_score\": 0.948,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.850117+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.850117+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1751,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.46,\n    \"normalized_score\": 0.46,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.953480+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.953480+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 791,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.915,\n    \"normalized_score\": 0.915,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.655022+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.655022+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 1243,\n    \"benchmark_id\": \"infovqa\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.834,\n    \"normalized_score\": 0.834,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.612560+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.612560+00:00\",\n    \"benchmark_name\": \"InfoVQA\"\n  },\n  {\n    \"model_benchmark_id\": 830,\n    \"benchmark_id\": \"lvbench\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.49,\n    \"normalized_score\": 0.49,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.733525+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.733525+00:00\",\n    \"benchmark_name\": \"LVBench\"\n  },\n  {\n    \"model_benchmark_id\": 410,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.822,\n    \"normalized_score\": 0.822,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.873375+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.873375+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1678,\n    \"benchmark_id\": \"mathvision\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.384,\n    \"normalized_score\": 0.384,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-VL\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.707439+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.707439+00:00\",\n    \"benchmark_name\": \"MathVision\"\n  },\n  {\n    \"model_benchmark_id\": 1272,\n    \"benchmark_id\": \"mathvista-mini\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.747,\n    \"normalized_score\": 0.747,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.668155+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.668155+00:00\",\n    \"benchmark_name\": \"MathVista-Mini\"\n  },\n  {\n    \"model_benchmark_id\": 1191,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.84,\n    \"normalized_score\": 0.84,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.509907+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.509907+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1690,\n    \"benchmark_id\": \"mmbench-video\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.0193,\n    \"normalized_score\": 0.0193,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-VL\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.747059+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.747059+00:00\",\n    \"benchmark_name\": \"MMBench-Video\"\n  },\n  {\n    \"model_benchmark_id\": 92,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.784,\n    \"normalized_score\": 0.784,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.274441+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.274441+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 200,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.688,\n    \"normalized_score\": 0.688,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.482355+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.482355+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 573,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.7,\n    \"normalized_score\": 0.7,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.179390+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.179390+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1536,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.495,\n    \"normalized_score\": 0.495,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.299391+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.299391+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1662,\n    \"benchmark_id\": \"mmstar\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.695,\n    \"normalized_score\": 0.695,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.668445+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.668445+00:00\",\n    \"benchmark_name\": \"MMStar\"\n  },\n  {\n    \"model_benchmark_id\": 1745,\n    \"benchmark_id\": \"ocrbench-v2-(en)\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.572,\n    \"normalized_score\": 0.572,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.930331+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.930331+00:00\",\n    \"benchmark_name\": \"OCRBench-V2 (en)\"\n  },\n  {\n    \"model_benchmark_id\": 1750,\n    \"benchmark_id\": \"ocrbench-v2-(zh)\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.591,\n    \"normalized_score\": 0.591,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.947420+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.947420+00:00\",\n    \"benchmark_name\": \"OCRBench-V2 (zh)\"\n  },\n  {\n    \"model_benchmark_id\": 1748,\n    \"benchmark_id\": \"osworld\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.0592,\n    \"normalized_score\": 0.0592,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.939263+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.939263+00:00\",\n    \"benchmark_name\": \"OSWorld\"\n  },\n  {\n    \"model_benchmark_id\": 1698,\n    \"benchmark_id\": \"screenspot\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.885,\n    \"normalized_score\": 0.885,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.775538+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.775538+00:00\",\n    \"benchmark_name\": \"ScreenSpot\"\n  },\n  {\n    \"model_benchmark_id\": 1701,\n    \"benchmark_id\": \"screenspot-pro\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.394,\n    \"normalized_score\": 0.394,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.783897+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.783897+00:00\",\n    \"benchmark_name\": \"ScreenSpot Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1683,\n    \"benchmark_id\": \"videomme-w-o-sub.\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.705,\n    \"normalized_score\": 0.705,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.722056+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.722056+00:00\",\n    \"benchmark_name\": \"VideoMME w/o sub.\"\n  },\n  {\n    \"model_benchmark_id\": 1686,\n    \"benchmark_id\": \"videomme-w-sub.\",\n    \"model_id\": \"qwen2.5-vl-32b\",\n    \"score\": 0.779,\n    \"normalized_score\": 0.779,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.729388+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.729388+00:00\",\n    \"benchmark_name\": \"VideoMME w sub.\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen2.5-vl-32b/model.json",
    "content": "{\n  \"model_id\": \"qwen2.5-vl-32b\",\n  \"name\": \"Qwen2.5 VL 32B Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen2.5-VL is a vision-language model from the Qwen family. Key enhancements include visual understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video comprehension with event pinpointing, visual localization (bounding boxes/points), and structured output generation.\",\n  \"release_date\": \"2025-02-28\",\n  \"announcement_date\": \"2025-02-28\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 33500000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://chat.qwen.ai/\",\n  \"source_paper\": \"https://arxiv.org/pdf/2502.13923\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2.5-vl/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2.5-VL\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.653921+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.653921+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen2.5-vl-72b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1255,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.884,\n    \"normalized_score\": 0.884,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.635049+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.635049+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 1703,\n    \"benchmark_id\": \"aitz-em\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.832,\n    \"normalized_score\": 0.832,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.789425+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.789425+00:00\",\n    \"benchmark_name\": \"AITZ_EM\"\n  },\n  {\n    \"model_benchmark_id\": 1706,\n    \"benchmark_id\": \"android-control-high-em\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.6736,\n    \"normalized_score\": 0.6736,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.796411+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.796411+00:00\",\n    \"benchmark_name\": \"Android Control High_EM\"\n  },\n  {\n    \"model_benchmark_id\": 1709,\n    \"benchmark_id\": \"android-control-low-em\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.937,\n    \"normalized_score\": 0.937,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.805303+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.805303+00:00\",\n    \"benchmark_name\": \"Android Control Low_EM\"\n  },\n  {\n    \"model_benchmark_id\": 1712,\n    \"benchmark_id\": \"androidworld-sr\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.35,\n    \"normalized_score\": 0.35,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"SR\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.813492+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.813492+00:00\",\n    \"benchmark_name\": \"AndroidWorld_SR\"\n  },\n  {\n    \"model_benchmark_id\": 1657,\n    \"benchmark_id\": \"cc-ocr\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.798,\n    \"normalized_score\": 0.798,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.657333+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.657333+00:00\",\n    \"benchmark_name\": \"CC-OCR\"\n  },\n  {\n    \"model_benchmark_id\": 867,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.895,\n    \"normalized_score\": 0.895,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.811401+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.811401+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 888,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.964,\n    \"normalized_score\": 0.964,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.848273+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.848273+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 925,\n    \"benchmark_id\": \"egoschema\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.762,\n    \"normalized_score\": 0.762,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.933582+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.933582+00:00\",\n    \"benchmark_name\": \"EgoSchema\"\n  },\n  {\n    \"model_benchmark_id\": 1673,\n    \"benchmark_id\": \"hallusion-bench\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.5516,\n    \"normalized_score\": 0.5516,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.694733+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.694733+00:00\",\n    \"benchmark_name\": \"Hallusion Bench\"\n  },\n  {\n    \"model_benchmark_id\": 829,\n    \"benchmark_id\": \"lvbench\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.473,\n    \"normalized_score\": 0.473,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.731476+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.731476+00:00\",\n    \"benchmark_name\": \"LVBench\"\n  },\n  {\n    \"model_benchmark_id\": 1677,\n    \"benchmark_id\": \"mathvision\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.381,\n    \"normalized_score\": 0.381,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.705119+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.705119+00:00\",\n    \"benchmark_name\": \"MathVision\"\n  },\n  {\n    \"model_benchmark_id\": 1271,\n    \"benchmark_id\": \"mathvista-mini\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.748,\n    \"normalized_score\": 0.748,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.666379+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.666379+00:00\",\n    \"benchmark_name\": \"MathVista-Mini\"\n  },\n  {\n    \"model_benchmark_id\": 1746,\n    \"benchmark_id\": \"mlvu-m\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.746,\n    \"normalized_score\": 0.746,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.934328+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.934328+00:00\",\n    \"benchmark_name\": \"MLVU-M\"\n  },\n  {\n    \"model_benchmark_id\": 1512,\n    \"benchmark_id\": \"mmbench\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.88,\n    \"normalized_score\": 0.88,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.243543+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.243543+00:00\",\n    \"benchmark_name\": \"MMBench\"\n  },\n  {\n    \"model_benchmark_id\": 1689,\n    \"benchmark_id\": \"mmbench-video\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.0202,\n    \"normalized_score\": 0.0202,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-VL\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.744558+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.744558+00:00\",\n    \"benchmark_name\": \"MMBench-Video\"\n  },\n  {\n    \"model_benchmark_id\": 572,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.702,\n    \"normalized_score\": 0.702,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.177290+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.177290+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1535,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.511,\n    \"normalized_score\": 0.511,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.297757+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.297757+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1661,\n    \"benchmark_id\": \"mmstar\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.708,\n    \"normalized_score\": 0.708,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.666719+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.666719+00:00\",\n    \"benchmark_name\": \"MMStar\"\n  },\n  {\n    \"model_benchmark_id\": 1671,\n    \"benchmark_id\": \"mmvet\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.7619,\n    \"normalized_score\": 0.7619,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.688513+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.688513+00:00\",\n    \"benchmark_name\": \"MMVet\"\n  },\n  {\n    \"model_benchmark_id\": 1715,\n    \"benchmark_id\": \"mobileminiwob++-sr\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.68,\n    \"normalized_score\": 0.68,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"SR\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.820961+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.820961+00:00\",\n    \"benchmark_name\": \"MobileMiniWob++_SR\"\n  },\n  {\n    \"model_benchmark_id\": 1644,\n    \"benchmark_id\": \"mvbench\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.704,\n    \"normalized_score\": 0.704,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.623550+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.623550+00:00\",\n    \"benchmark_name\": \"MVBench\"\n  },\n  {\n    \"model_benchmark_id\": 1541,\n    \"benchmark_id\": \"ocrbench\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.885,\n    \"normalized_score\": 0.885,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.318110+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.318110+00:00\",\n    \"benchmark_name\": \"OCRBench\"\n  },\n  {\n    \"model_benchmark_id\": 1744,\n    \"benchmark_id\": \"ocrbench-v2-(en)\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.615,\n    \"normalized_score\": 0.615,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.928710+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.928710+00:00\",\n    \"benchmark_name\": \"OCRBench-V2 (en)\"\n  },\n  {\n    \"model_benchmark_id\": 1747,\n    \"benchmark_id\": \"osworld\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.0883,\n    \"normalized_score\": 0.0883,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.937610+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.937610+00:00\",\n    \"benchmark_name\": \"OSWorld\"\n  },\n  {\n    \"model_benchmark_id\": 1680,\n    \"benchmark_id\": \"perceptiontest\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.732,\n    \"normalized_score\": 0.732,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.713944+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.713944+00:00\",\n    \"benchmark_name\": \"PerceptionTest\"\n  },\n  {\n    \"model_benchmark_id\": 1697,\n    \"benchmark_id\": \"screenspot\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.871,\n    \"normalized_score\": 0.871,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.773284+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.773284+00:00\",\n    \"benchmark_name\": \"ScreenSpot\"\n  },\n  {\n    \"model_benchmark_id\": 1700,\n    \"benchmark_id\": \"screenspot-pro\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.436,\n    \"normalized_score\": 0.436,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.780898+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.780898+00:00\",\n    \"benchmark_name\": \"ScreenSpot Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1692,\n    \"benchmark_id\": \"tempcompass\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.748,\n    \"normalized_score\": 0.748,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.754032+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.754032+00:00\",\n    \"benchmark_name\": \"TempCompass\"\n  },\n  {\n    \"model_benchmark_id\": 1682,\n    \"benchmark_id\": \"videomme-w-o-sub.\",\n    \"model_id\": \"qwen2.5-vl-72b\",\n    \"score\": 0.733,\n    \"normalized_score\": 0.733,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.720259+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.720259+00:00\",\n    \"benchmark_name\": \"VideoMME w/o sub.\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen2.5-vl-72b/model.json",
    "content": "{\n  \"model_id\": \"qwen2.5-vl-72b\",\n  \"name\": \"Qwen2.5 VL 72B Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen2.5-VL is the new flagship vision-language model of Qwen, significantly improved from Qwen2-VL. It excels at recognizing objects, analyzing text/charts/layouts in images, acting as a visual agent, understanding long videos (over 1 hour) with event pinpointing, performing visual localization (bounding boxes/points), and generating structured outputs from documents.\",\n  \"release_date\": \"2025-01-26\",\n  \"announcement_date\": \"2025-01-26\",\n  \"license_id\": \"tongyi_qianwen\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 72000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://chat.qwen.ai/\",\n  \"source_paper\": \"https://arxiv.org/pdf/2502.13923\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2.5-vl/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2.5-VL\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.647509+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.647509+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen2.5-vl-7b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1702,\n    \"benchmark_id\": \"aitz-em\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.819,\n    \"normalized_score\": 0.819,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.787781+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.787781+00:00\",\n    \"benchmark_name\": \"AITZ_EM\"\n  },\n  {\n    \"model_benchmark_id\": 1705,\n    \"benchmark_id\": \"android-control-high-em\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.601,\n    \"normalized_score\": 0.601,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.794879+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.794879+00:00\",\n    \"benchmark_name\": \"Android Control High_EM\"\n  },\n  {\n    \"model_benchmark_id\": 1708,\n    \"benchmark_id\": \"android-control-low-em\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.914,\n    \"normalized_score\": 0.914,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-VL\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"EM\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.803305+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.803305+00:00\",\n    \"benchmark_name\": \"Android Control Low_EM\"\n  },\n  {\n    \"model_benchmark_id\": 1711,\n    \"benchmark_id\": \"androidworld-sr\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.255,\n    \"normalized_score\": 0.255,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"SR\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.811782+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.811782+00:00\",\n    \"benchmark_name\": \"AndroidWorld_SR\"\n  },\n  {\n    \"model_benchmark_id\": 1656,\n    \"benchmark_id\": \"cc-ocr\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.778,\n    \"normalized_score\": 0.778,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.655251+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.655251+00:00\",\n    \"benchmark_name\": \"CC-OCR\"\n  },\n  {\n    \"model_benchmark_id\": 1694,\n    \"benchmark_id\": \"charadessta\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.436,\n    \"normalized_score\": 0.436,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"mIoU\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.763802+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.763802+00:00\",\n    \"benchmark_name\": \"CharadesSTA\"\n  },\n  {\n    \"model_benchmark_id\": 865,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.873,\n    \"normalized_score\": 0.873,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.808329+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.808329+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 886,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.957,\n    \"normalized_score\": 0.957,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.844347+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.844347+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1672,\n    \"benchmark_id\": \"hallusion-bench\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.529,\n    \"normalized_score\": 0.529,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.693096+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.693096+00:00\",\n    \"benchmark_name\": \"Hallusion Bench\"\n  },\n  {\n    \"model_benchmark_id\": 1242,\n    \"benchmark_id\": \"infovqa\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.826,\n    \"normalized_score\": 0.826,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.610945+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.610945+00:00\",\n    \"benchmark_name\": \"InfoVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1687,\n    \"benchmark_id\": \"longvideobench\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.547,\n    \"normalized_score\": 0.547,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.737450+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.737450+00:00\",\n    \"benchmark_name\": \"LongVideoBench\"\n  },\n  {\n    \"model_benchmark_id\": 828,\n    \"benchmark_id\": \"lvbench\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.453,\n    \"normalized_score\": 0.453,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.729778+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.729778+00:00\",\n    \"benchmark_name\": \"LVBench\"\n  },\n  {\n    \"model_benchmark_id\": 1674,\n    \"benchmark_id\": \"mathvision\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.2507,\n    \"normalized_score\": 0.2507,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.698748+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.698748+00:00\",\n    \"benchmark_name\": \"MathVision\"\n  },\n  {\n    \"model_benchmark_id\": 1270,\n    \"benchmark_id\": \"mathvista-mini\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.682,\n    \"normalized_score\": 0.682,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.664381+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.664381+00:00\",\n    \"benchmark_name\": \"MathVista-Mini\"\n  },\n  {\n    \"model_benchmark_id\": 1693,\n    \"benchmark_id\": \"mlvu\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.702,\n    \"normalized_score\": 0.702,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.758833+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.758833+00:00\",\n    \"benchmark_name\": \"MLVU\"\n  },\n  {\n    \"model_benchmark_id\": 1511,\n    \"benchmark_id\": \"mmbench\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.843,\n    \"normalized_score\": 0.843,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-VL\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.241869+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.241869+00:00\",\n    \"benchmark_name\": \"MMBench\"\n  },\n  {\n    \"model_benchmark_id\": 1688,\n    \"benchmark_id\": \"mmbench-video\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.0179,\n    \"normalized_score\": 0.0179,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-VL\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.742467+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.742467+00:00\",\n    \"benchmark_name\": \"MMBench-Video\"\n  },\n  {\n    \"model_benchmark_id\": 569,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.586,\n    \"normalized_score\": 0.586,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.170987+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.170987+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1533,\n    \"benchmark_id\": \"mmmu-pro\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.383,\n    \"normalized_score\": 0.383,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://github.com/QwenLM/Qwen2.5-VL\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.294582+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.294582+00:00\",\n    \"benchmark_name\": \"MMMU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1659,\n    \"benchmark_id\": \"mmstar\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.639,\n    \"normalized_score\": 0.639,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.662888+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.662888+00:00\",\n    \"benchmark_name\": \"MMStar\"\n  },\n  {\n    \"model_benchmark_id\": 1666,\n    \"benchmark_id\": \"mmt-bench\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.636,\n    \"normalized_score\": 0.636,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.676869+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.676869+00:00\",\n    \"benchmark_name\": \"MMT-Bench\"\n  },\n  {\n    \"model_benchmark_id\": 1670,\n    \"benchmark_id\": \"mmvet\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.671,\n    \"normalized_score\": 0.671,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.687023+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.687023+00:00\",\n    \"benchmark_name\": \"MMVet\"\n  },\n  {\n    \"model_benchmark_id\": 1714,\n    \"benchmark_id\": \"mobileminiwob++-sr\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.914,\n    \"normalized_score\": 0.914,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"SR\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.819401+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.819401+00:00\",\n    \"benchmark_name\": \"MobileMiniWob++_SR\"\n  },\n  {\n    \"model_benchmark_id\": 1642,\n    \"benchmark_id\": \"mvbench\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.696,\n    \"normalized_score\": 0.696,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.620310+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.620310+00:00\",\n    \"benchmark_name\": \"MVBench\"\n  },\n  {\n    \"model_benchmark_id\": 1540,\n    \"benchmark_id\": \"ocrbench\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.864,\n    \"normalized_score\": 0.864,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.315649+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.315649+00:00\",\n    \"benchmark_name\": \"OCRBench\"\n  },\n  {\n    \"model_benchmark_id\": 1679,\n    \"benchmark_id\": \"perceptiontest\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.705,\n    \"normalized_score\": 0.705,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.712010+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.712010+00:00\",\n    \"benchmark_name\": \"PerceptionTest\"\n  },\n  {\n    \"model_benchmark_id\": 1696,\n    \"benchmark_id\": \"screenspot\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.847,\n    \"normalized_score\": 0.847,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.771516+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.771516+00:00\",\n    \"benchmark_name\": \"ScreenSpot\"\n  },\n  {\n    \"model_benchmark_id\": 1699,\n    \"benchmark_id\": \"screenspot-pro\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.29,\n    \"normalized_score\": 0.29,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.779312+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.779312+00:00\",\n    \"benchmark_name\": \"ScreenSpot Pro\"\n  },\n  {\n    \"model_benchmark_id\": 1691,\n    \"benchmark_id\": \"tempcompass\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.717,\n    \"normalized_score\": 0.717,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.752008+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.752008+00:00\",\n    \"benchmark_name\": \"TempCompass\"\n  },\n  {\n    \"model_benchmark_id\": 910,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.849,\n    \"normalized_score\": 0.849,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.896871+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.896871+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  },\n  {\n    \"model_benchmark_id\": 1681,\n    \"benchmark_id\": \"videomme-w-o-sub.\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.651,\n    \"normalized_score\": 0.651,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.718319+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.718319+00:00\",\n    \"benchmark_name\": \"VideoMME w/o sub.\"\n  },\n  {\n    \"model_benchmark_id\": 1684,\n    \"benchmark_id\": \"videomme-w-sub.\",\n    \"model_id\": \"qwen2.5-vl-7b\",\n    \"score\": 0.716,\n    \"normalized_score\": 0.716,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.726358+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.726358+00:00\",\n    \"benchmark_name\": \"VideoMME w sub.\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen2.5-vl-7b/model.json",
    "content": "{\n  \"model_id\": \"qwen2.5-vl-7b\",\n  \"name\": \"Qwen2.5 VL 7B Instruct\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen2.5-VL is a vision-language model from the Qwen family. Key enhancements include visual understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video comprehension with event pinpointing, visual localization (bounding boxes/points), and structured output generation.\",\n  \"release_date\": \"2025-01-26\",\n  \"announcement_date\": \"2025-01-26\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 8290000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://chat.qwen.ai/\",\n  \"source_paper\": \"https://arxiv.org/pdf/2502.13923\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen2.5-vl/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2.5-VL\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct\",\n  \"created_at\": \"2025-07-19T19:49:05.635630+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.635630+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-235b-a22b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1626,\n    \"benchmark_id\": \"aider\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.618,\n    \"normalized_score\": 0.618,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@2\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.572970+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.572970+00:00\",\n    \"benchmark_name\": \"Aider\"\n  },\n  {\n    \"model_benchmark_id\": 454,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.857,\n    \"normalized_score\": 0.857,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.963641+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.963641+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 690,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.815,\n    \"normalized_score\": 0.815,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.447678+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.447678+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1452,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.956,\n    \"normalized_score\": 0.956,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.095282+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.095282+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 972,\n    \"benchmark_id\": \"bbh\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.8887,\n    \"normalized_score\": 0.8887,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.043683+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.043683+00:00\",\n    \"benchmark_name\": \"BBH\"\n  },\n  {\n    \"model_benchmark_id\": 851,\n    \"benchmark_id\": \"bfcl\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.708,\n    \"normalized_score\": 0.708,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"v3\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.780457+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.780457+00:00\",\n    \"benchmark_name\": \"BFCL\"\n  },\n  {\n    \"model_benchmark_id\": 1648,\n    \"benchmark_id\": \"crux-o\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.79,\n    \"normalized_score\": 0.79,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.637715+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.637715+00:00\",\n    \"benchmark_name\": \"CRUX-O\"\n  },\n  {\n    \"model_benchmark_id\": 371,\n    \"benchmark_id\": \"evalplus\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.776,\n    \"normalized_score\": 0.776,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.801301+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.801301+00:00\",\n    \"benchmark_name\": \"EvalPlus\"\n  },\n  {\n    \"model_benchmark_id\": 302,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.4747,\n    \"normalized_score\": 0.4747,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.679464+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.679464+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 995,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.9439,\n    \"normalized_score\": 0.9439,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.083824+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.083824+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 1308,\n    \"benchmark_id\": \"include\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.7346,\n    \"normalized_score\": 0.7346,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.737543+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.737543+00:00\",\n    \"benchmark_name\": \"Include\"\n  },\n  {\n    \"model_benchmark_id\": 749,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.771,\n    \"normalized_score\": 0.771,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.575629+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.575629+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 1123,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.707,\n    \"normalized_score\": 0.707,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"v5\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.344206+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.344206+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 405,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.7184,\n    \"normalized_score\": 0.7184,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.863985+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.863985+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 1186,\n    \"benchmark_id\": \"mbpp\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.814,\n    \"normalized_score\": 0.814,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.500617+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.500617+00:00\",\n    \"benchmark_name\": \"MBPP\"\n  },\n  {\n    \"model_benchmark_id\": 1289,\n    \"benchmark_id\": \"mgsm\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.8353,\n    \"normalized_score\": 0.8353,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.700097+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.700097+00:00\",\n    \"benchmark_name\": \"MGSM\"\n  },\n  {\n    \"model_benchmark_id\": 90,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.8781,\n    \"normalized_score\": 0.8781,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.270963+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.270963+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 195,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.6818,\n    \"normalized_score\": 0.6818,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.472627+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.472627+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 732,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.874,\n    \"normalized_score\": 0.874,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.540685+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.540685+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 1477,\n    \"benchmark_id\": \"mmmlu\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.867,\n    \"normalized_score\": 0.867,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.150792+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.150792+00:00\",\n    \"benchmark_name\": \"MMMLU\"\n  },\n  {\n    \"model_benchmark_id\": 1647,\n    \"benchmark_id\": \"multilf\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.719,\n    \"normalized_score\": 0.719,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.633963+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.633963+00:00\",\n    \"benchmark_name\": \"MultiLF\"\n  },\n  {\n    \"model_benchmark_id\": 643,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.6594,\n    \"normalized_score\": 0.6594,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.320821+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.320821+00:00\",\n    \"benchmark_name\": \"MultiPL-E\"\n  },\n  {\n    \"model_benchmark_id\": 366,\n    \"benchmark_id\": \"supergpqa\",\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"score\": 0.4406,\n    \"normalized_score\": 0.4406,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.784624+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.784624+00:00\",\n    \"benchmark_name\": \"SuperGPQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-235b-a22b/model.json",
    "content": "{\n  \"model_id\": \"qwen3-235b-a22b\",\n  \"name\": \"Qwen3 235B A22B\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen3 235B A22B is a large language model developed by Alibaba, featuring a Mixture-of-Experts (MoE) architecture with 235 billion total parameters and 22 billion activated parameters. It achieves competitive results in benchmark evaluations of coding, math, general capabilities, and more, compared to other top-tier models.\",\n  \"release_date\": \"2025-04-29\",\n  \"announcement_date\": \"2025-04-29\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 235000000000,\n  \"training_tokens\": 36000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://qwenlm.github.io/blog/qwen3/\",\n  \"source_playground\": \"https://chat.qwen.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen3\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B\",\n  \"created_at\": \"2025-07-19T19:49:05.624683+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.624683+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-235b-a22b-instruct-2507/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 15972,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.573,\n    \"normalized_score\": 0.573,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.609026+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.609026+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 15973,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.703,\n    \"normalized_score\": 0.703,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.611021+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.611021+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 15974,\n    \"benchmark_id\": \"arc-agi\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.418,\n    \"normalized_score\": 0.418,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.618116+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.618116+00:00\",\n    \"benchmark_name\": \"ARC-AGI\"\n  },\n  {\n    \"model_benchmark_id\": 15975,\n    \"benchmark_id\": \"arena-hard-v2\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.792,\n    \"normalized_score\": 0.792,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Win Rate\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.620187+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.620187+00:00\",\n    \"benchmark_name\": \"Arena-Hard v2\"\n  },\n  {\n    \"model_benchmark_id\": 15976,\n    \"benchmark_id\": \"bfcl-v3\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.709,\n    \"normalized_score\": 0.709,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.622144+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.622144+00:00\",\n    \"benchmark_name\": \"BFCL-v3\"\n  },\n  {\n    \"model_benchmark_id\": 15977,\n    \"benchmark_id\": \"creative-writing-v3\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.875,\n    \"normalized_score\": 0.875,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.626065+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.626065+00:00\",\n    \"benchmark_name\": \"Creative Writing v3\"\n  },\n  {\n    \"model_benchmark_id\": 15978,\n    \"benchmark_id\": \"csimpleqa\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.843,\n    \"normalized_score\": 0.843,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.629696+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.629696+00:00\",\n    \"benchmark_name\": \"CSimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 15979,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.775,\n    \"normalized_score\": 0.775,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.631769+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.631769+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 15980,\n    \"benchmark_id\": \"hmmt25\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.554,\n    \"normalized_score\": 0.554,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.633387+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.633387+00:00\",\n    \"benchmark_name\": \"HMMT25\"\n  },\n  {\n    \"model_benchmark_id\": 15981,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.887,\n    \"normalized_score\": 0.887,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.635001+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.635001+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 15982,\n    \"benchmark_id\": \"include\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.795,\n    \"normalized_score\": 0.795,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.636605+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.636605+00:00\",\n    \"benchmark_name\": \"INCLUDE\"\n  },\n  {\n    \"model_benchmark_id\": 15983,\n    \"benchmark_id\": \"livebench-20241125\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.754,\n    \"normalized_score\": 0.754,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.638166+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.638166+00:00\",\n    \"benchmark_name\": \"LiveBench 20241125\"\n  },\n  {\n    \"model_benchmark_id\": 15984,\n    \"benchmark_id\": \"livecodebench-v6\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.518,\n    \"normalized_score\": 0.518,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.639661+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.639661+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v6\"\n  },\n  {\n    \"model_benchmark_id\": 15985,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.83,\n    \"normalized_score\": 0.83,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.641236+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.641236+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 15986,\n    \"benchmark_id\": \"mmlu-prox\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.794,\n    \"normalized_score\": 0.794,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.642908+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.642908+00:00\",\n    \"benchmark_name\": \"MMLU-ProX\"\n  },\n  {\n    \"model_benchmark_id\": 15987,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.931,\n    \"normalized_score\": 0.931,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.644630+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.644630+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 15988,\n    \"benchmark_id\": \"multi-if\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.775,\n    \"normalized_score\": 0.775,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.646355+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.646355+00:00\",\n    \"benchmark_name\": \"Multi-IF\"\n  },\n  {\n    \"model_benchmark_id\": 15989,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.879,\n    \"normalized_score\": 0.879,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Score\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.648211+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.648211+00:00\",\n    \"benchmark_name\": \"MultiPL-E\"\n  },\n  {\n    \"model_benchmark_id\": 15990,\n    \"benchmark_id\": \"polymath\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.502,\n    \"normalized_score\": 0.502,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.649756+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.649756+00:00\",\n    \"benchmark_name\": \"PolyMATH\"\n  },\n  {\n    \"model_benchmark_id\": 15991,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.543,\n    \"normalized_score\": 0.543,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.651445+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.651445+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  },\n  {\n    \"model_benchmark_id\": 15992,\n    \"benchmark_id\": \"supergpqa\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.626,\n    \"normalized_score\": 0.626,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.652980+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.652980+00:00\",\n    \"benchmark_name\": \"SuperGPQA\"\n  },\n  {\n    \"model_benchmark_id\": 15993,\n    \"benchmark_id\": \"tau2-airline\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.44,\n    \"normalized_score\": 0.44,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.654737+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.654737+00:00\",\n    \"benchmark_name\": \"Tau2 airline\"\n  },\n  {\n    \"model_benchmark_id\": 15994,\n    \"benchmark_id\": \"tau2-retail\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.713,\n    \"normalized_score\": 0.713,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.656359+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.656359+00:00\",\n    \"benchmark_name\": \"Tau2 retail\"\n  },\n  {\n    \"model_benchmark_id\": 15995,\n    \"benchmark_id\": \"writingbench\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.852,\n    \"normalized_score\": 0.852,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.657968+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.657968+00:00\",\n    \"benchmark_name\": \"WritingBench\"\n  },\n  {\n    \"model_benchmark_id\": 15996,\n    \"benchmark_id\": \"zebralogic\",\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"score\": 0.95,\n    \"normalized_score\": 0.95,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-08-03T22:06:13.659618+00:00\",\n    \"updated_at\": \"2025-08-03T22:06:13.659618+00:00\",\n    \"benchmark_name\": \"ZebraLogic\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-235b-a22b-instruct-2507/model.json",
    "content": "{\n  \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n  \"name\": \"Qwen3-235B-A22B-Instruct-2507\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen3-235B-A22B-Instruct-2507 is the updated instruct version of Qwen3-235B-A22B featuring significant improvements in general capabilities including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. It provides substantial gains in long-tail knowledge coverage across multiple languages and markedly better alignment with user preferences in subjective and open-ended tasks.\",\n  \"release_date\": \"2025-07-22\",\n  \"announcement_date\": \"2025-07-22\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 235000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://qwenlm.github.io/blog/qwen3/\",\n  \"source_playground\": \"https://chat.qwen.ai/\",\n  \"source_paper\": \"https://arxiv.org/abs/2505.09388\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen3\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507\",\n  \"created_at\": \"2025-08-03T22:06:11.701778+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:11.701778+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-235b-a22b-thinking-2507/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 9101,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.844,\n    \"normalized_score\": 0.844,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 9102,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.938,\n    \"normalized_score\": 0.938,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 9103,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.811,\n    \"normalized_score\": 0.811,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9104,\n    \"benchmark_id\": \"supergpqa\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.649,\n    \"normalized_score\": 0.649,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SuperGPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9105,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.923,\n    \"normalized_score\": 0.923,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 9106,\n    \"benchmark_id\": \"hmmt25\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.839,\n    \"normalized_score\": 0.839,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT25\"\n  },\n  {\n    \"model_benchmark_id\": 9107,\n    \"benchmark_id\": \"livebench-20241125\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.784,\n    \"normalized_score\": 0.784,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveBench 20241125\"\n  },\n  {\n    \"model_benchmark_id\": 9108,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.182,\n    \"normalized_score\": 0.182,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"text-only subset\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Score refers to text-only subset as model is not multi-modal\",\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HLE\"\n  },\n  {\n    \"model_benchmark_id\": 9109,\n    \"benchmark_id\": \"livecodebench-v6\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.741,\n    \"normalized_score\": 0.741,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25.02-25.05\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v6\"\n  },\n  {\n    \"model_benchmark_id\": 9110,\n    \"benchmark_id\": \"cfeval\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 2134,\n    \"normalized_score\": 0.2134,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Raw score: 2134\",\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"CFEval\"\n  },\n  {\n    \"model_benchmark_id\": 9111,\n    \"benchmark_id\": \"ojbench\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.325,\n    \"normalized_score\": 0.325,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"OJBench\"\n  },\n  {\n    \"model_benchmark_id\": 9112,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.878,\n    \"normalized_score\": 0.878,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 9113,\n    \"benchmark_id\": \"arena-hard-v2\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.797,\n    \"normalized_score\": 0.797,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4 evaluated win rates\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Arena-Hard v2\"\n  },\n  {\n    \"model_benchmark_id\": 9114,\n    \"benchmark_id\": \"creative-writing-v3\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.861,\n    \"normalized_score\": 0.861,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Creative Writing v3\"\n  },\n  {\n    \"model_benchmark_id\": 9115,\n    \"benchmark_id\": \"writingbench\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.883,\n    \"normalized_score\": 0.883,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"WritingBench\"\n  },\n  {\n    \"model_benchmark_id\": 9116,\n    \"benchmark_id\": \"bfcl-v3\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.719,\n    \"normalized_score\": 0.719,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BFCL-v3\"\n  },\n  {\n    \"model_benchmark_id\": 9117,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.678,\n    \"normalized_score\": 0.678,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"TAU1-Retail\",\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU1-Retail\"\n  },\n  {\n    \"model_benchmark_id\": 9118,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.46,\n    \"normalized_score\": 0.46,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"TAU1-Airline\",\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU1-Airline\"\n  },\n  {\n    \"model_benchmark_id\": 9119,\n    \"benchmark_id\": \"tau2-retail\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.719,\n    \"normalized_score\": 0.719,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU2-Retail\"\n  },\n  {\n    \"model_benchmark_id\": 9120,\n    \"benchmark_id\": \"tau2-airline\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.58,\n    \"normalized_score\": 0.58,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU2-Airline\"\n  },\n  {\n    \"model_benchmark_id\": 9121,\n    \"benchmark_id\": \"tau2-telecom\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.456,\n    \"normalized_score\": 0.456,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU2-Telecom\"\n  },\n  {\n    \"model_benchmark_id\": 9122,\n    \"benchmark_id\": \"multi-if\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.806,\n    \"normalized_score\": 0.806,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"MultiIF\",\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MultiIF\"\n  },\n  {\n    \"model_benchmark_id\": 9123,\n    \"benchmark_id\": \"mmlu-prox\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.81,\n    \"normalized_score\": 0.81,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"MMLU-ProX\",\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-ProX\"\n  },\n  {\n    \"model_benchmark_id\": 9124,\n    \"benchmark_id\": \"include\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.81,\n    \"normalized_score\": 0.81,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"INCLUDE\"\n  },\n  {\n    \"model_benchmark_id\": 9125,\n    \"benchmark_id\": \"polymath\",\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"score\": 0.601,\n    \"normalized_score\": 0.601,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"PolyMATH\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-235b-a22b-thinking-2507/model.json",
    "content": "{\n  \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n  \"name\": \"Qwen3-235B-A22B-Thinking-2507\",\n  \"organization_id\": \"qwen\",\n  \"model_family_id\": null,\n  \"fine_tuned_from_model_id\": \"qwen3-235b-a22b\",\n  \"description\": \"Qwen3-235B-A22B-Thinking-2507 is a state-of-the-art thinking-enabled Mixture-of-Experts (MoE) model with 235B total parameters (22B activated). It features 94 layers, 128 experts (8 activated), and supports 262K native context length. This version delivers significantly improved reasoning performance, achieving state-of-the-art results among open-source thinking models on logical reasoning, mathematics, science, coding, and academic benchmarks. Key enhancements include markedly better general capabilities (instruction following, tool usage, text generation), enhanced 256K long-context understanding, and increased thinking depth. The model supports only thinking mode with automatic <think> tag inclusion.\",\n  \"release_date\": \"2025-07-25\",\n  \"announcement_date\": \"2025-07-25\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 235000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507\",\n  \"source_playground\": \"https://chat.qwen.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen3-thinking/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen3\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507\",\n  \"created_at\": \"2025-07-25T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-30b-a3b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 455,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"qwen3-30b-a3b\",\n    \"score\": 0.804,\n    \"normalized_score\": 0.804,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.965575+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.965575+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 691,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"qwen3-30b-a3b\",\n    \"score\": 0.709,\n    \"normalized_score\": 0.709,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.449947+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.449947+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1454,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"qwen3-30b-a3b\",\n    \"score\": 0.91,\n    \"normalized_score\": 0.91,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.098594+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.098594+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 852,\n    \"benchmark_id\": \"bfcl\",\n    \"model_id\": \"qwen3-30b-a3b\",\n    \"score\": 0.691,\n    \"normalized_score\": 0.691,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"v3\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.782049+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.782049+00:00\",\n    \"benchmark_name\": \"BFCL\"\n  },\n  {\n    \"model_benchmark_id\": 304,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen3-30b-a3b\",\n    \"score\": 0.658,\n    \"normalized_score\": 0.658,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.682771+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.682771+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 751,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"qwen3-30b-a3b\",\n    \"score\": 0.743,\n    \"normalized_score\": 0.743,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.579527+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.579527+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 1125,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"qwen3-30b-a3b\",\n    \"score\": 0.626,\n    \"normalized_score\": 0.626,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"v5\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.349221+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.349221+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1649,\n    \"benchmark_id\": \"multi-if\",\n    \"model_id\": \"qwen3-30b-a3b\",\n    \"score\": 0.722,\n    \"normalized_score\": 0.722,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.641584+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.641584+00:00\",\n    \"benchmark_name\": \"Multi-IF\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-30b-a3b/model.json",
    "content": "{\n  \"model_id\": \"qwen3-30b-a3b\",\n  \"name\": \"Qwen3 30B A3B\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen3-30B-A3B is a smaller Mixture-of-Experts (MoE) model from the Qwen3 series by Alibaba, with 30.5 billion total parameters and 3.3 billion activated parameters. Features hybrid thinking/non-thinking modes, support for 119 languages, and enhanced agent capabilities. It aims to outperform previous models like QwQ-32B while using significantly fewer activated parameters.\",\n  \"release_date\": \"2025-04-29\",\n  \"announcement_date\": \"2025-04-29\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 30500000000,\n  \"training_tokens\": 36000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://qwenlm.github.io/blog/qwen3/\",\n  \"source_playground\": \"https://chat.qwen.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen3\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen3-30B-A3B\",\n  \"created_at\": \"2025-07-19T19:49:05.631206+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.631206+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-32b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1625,\n    \"benchmark_id\": \"aider\",\n    \"model_id\": \"qwen3-32b\",\n    \"score\": 0.502,\n    \"normalized_score\": 0.502,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@2\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.571165+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.571165+00:00\",\n    \"benchmark_name\": \"Aider\"\n  },\n  {\n    \"model_benchmark_id\": 453,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"qwen3-32b\",\n    \"score\": 0.814,\n    \"normalized_score\": 0.814,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.961658+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.961658+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 689,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"qwen3-32b\",\n    \"score\": 0.729,\n    \"normalized_score\": 0.729,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@64\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.446075+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.446075+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1451,\n    \"benchmark_id\": \"arena-hard\",\n    \"model_id\": \"qwen3-32b\",\n    \"score\": 0.938,\n    \"normalized_score\": 0.938,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.093495+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.093495+00:00\",\n    \"benchmark_name\": \"Arena Hard\"\n  },\n  {\n    \"model_benchmark_id\": 850,\n    \"benchmark_id\": \"bfcl\",\n    \"model_id\": \"qwen3-32b\",\n    \"score\": 0.703,\n    \"normalized_score\": 0.703,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"v3\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.778924+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.778924+00:00\",\n    \"benchmark_name\": \"BFCL\"\n  },\n  {\n    \"model_benchmark_id\": 1645,\n    \"benchmark_id\": \"codeforces\",\n    \"model_id\": \"qwen3-32b\",\n    \"score\": 0.659,\n    \"normalized_score\": 0.659,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Elo Rating\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.627279+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.627279+00:00\",\n    \"benchmark_name\": \"CodeForces\"\n  },\n  {\n    \"model_benchmark_id\": 748,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"qwen3-32b\",\n    \"score\": 0.749,\n    \"normalized_score\": 0.749,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.573432+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.573432+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 1122,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"qwen3-32b\",\n    \"score\": 0.657,\n    \"normalized_score\": 0.657,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"v5\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.342304+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.342304+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1646,\n    \"benchmark_id\": \"multilf\",\n    \"model_id\": \"qwen3-32b\",\n    \"score\": 0.73,\n    \"normalized_score\": 0.73,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.630716+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.630716+00:00\",\n    \"benchmark_name\": \"MultiLF\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-32b/model.json",
    "content": "{\n  \"model_id\": \"qwen3-32b\",\n  \"name\": \"Qwen3 32B\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen3-32B is a large language model from Alibaba's Qwen3 series. It features 32.8 billion parameters, a 128k token context window, support for 119 languages, and hybrid thinking modes allowing switching between deep reasoning and fast responses. It demonstrates strong performance in reasoning, instruction-following, and agent capabilities.\",\n  \"release_date\": \"2025-04-29\",\n  \"announcement_date\": \"2025-04-29\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 32800000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://chat.qwen.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen3/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen3\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen3-32B\",\n  \"created_at\": \"2025-07-19T19:49:05.621845+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.621845+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-next-80b-a3b-base/benchmarks.json",
    "content": "[]\n"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-next-80b-a3b-base/model.json",
    "content": "{\n  \"model_id\": \"qwen3-next-80b-a3b-base\",\n  \"name\": \"Qwen3-Next-80B-A3B-Base\",\n  \"organization_id\": \"qwen\",\n  \"model_family_id\": null,\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen3-Next-80B-A3B-Base is the foundation model in the Qwen3-Next series, featuring revolutionary architectural innovations for ultimate training and inference efficiency. It introduces Hybrid Attention combining Gated DeltaNet (75% layers) and Gated Attention (25% layers) for efficient ultra-long context modeling, Ultra-Sparse MoE with 512 total experts but only 10 routed + 1 shared expert activated (3.7% activation ratio), and native Multi-Token Prediction for faster inference. With 80B total parameters and only ~3B activated per inference step, it achieves performance comparable to Qwen3-32B while using less than 10% training cost and delivering 10x+ throughput for 32K+ contexts. Trained on 15T tokens with training-stability-friendly designs including Zero-Centered RMSNorm and normalized MoE router parameters. Supports 256K context length, extensible to 1M tokens with YaRN scaling.\",\n  \"release_date\": \"2025-09-10\",\n  \"announcement_date\": \"2025-09-10\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 80000000000,\n  \"training_tokens\": 15000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Base\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen3\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Base\",\n  \"created_at\": \"2025-09-10T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-next-80b-a3b-instruct/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 9301,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.806,\n    \"normalized_score\": 0.806,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 9302,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.909,\n    \"normalized_score\": 0.909,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 9303,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.729,\n    \"normalized_score\": 0.729,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9304,\n    \"benchmark_id\": \"supergpqa\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.588,\n    \"normalized_score\": 0.588,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SuperGPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9305,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.695,\n    \"normalized_score\": 0.695,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 9306,\n    \"benchmark_id\": \"hmmt25\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.541,\n    \"normalized_score\": 0.541,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT25\"\n  },\n  {\n    \"model_benchmark_id\": 9307,\n    \"benchmark_id\": \"livebench-20241125\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.758,\n    \"normalized_score\": 0.758,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveBench 20241125\"\n  },\n  {\n    \"model_benchmark_id\": 9308,\n    \"benchmark_id\": \"livecodebench-v6\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.566,\n    \"normalized_score\": 0.566,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25.02-25.05\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v6\"\n  },\n  {\n    \"model_benchmark_id\": 9309,\n    \"benchmark_id\": \"multipl-e\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.878,\n    \"normalized_score\": 0.878,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MultiPL-E\"\n  },\n  {\n    \"model_benchmark_id\": 9310,\n    \"benchmark_id\": \"aider-polyglot\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.498,\n    \"normalized_score\": 0.498,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Aider-Polyglot\"\n  },\n  {\n    \"model_benchmark_id\": 9311,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.876,\n    \"normalized_score\": 0.876,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 9312,\n    \"benchmark_id\": \"arena-hard-v2\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.827,\n    \"normalized_score\": 0.827,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4.1 evaluated win rates\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Arena-Hard v2\"\n  },\n  {\n    \"model_benchmark_id\": 9313,\n    \"benchmark_id\": \"creative-writing-v3\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.853,\n    \"normalized_score\": 0.853,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Creative Writing v3\"\n  },\n  {\n    \"model_benchmark_id\": 9314,\n    \"benchmark_id\": \"writingbench\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.873,\n    \"normalized_score\": 0.873,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"WritingBench\"\n  },\n  {\n    \"model_benchmark_id\": 9315,\n    \"benchmark_id\": \"bfcl-v3\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.703,\n    \"normalized_score\": 0.703,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BFCL-v3\"\n  },\n  {\n    \"model_benchmark_id\": 9316,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.609,\n    \"normalized_score\": 0.609,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"TAU1-Retail\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU1-Retail\"\n  },\n  {\n    \"model_benchmark_id\": 9317,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.44,\n    \"normalized_score\": 0.44,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"TAU1-Airline\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU1-Airline\"\n  },\n  {\n    \"model_benchmark_id\": 9318,\n    \"benchmark_id\": \"tau2-retail\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.573,\n    \"normalized_score\": 0.573,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU2-Retail\"\n  },\n  {\n    \"model_benchmark_id\": 9319,\n    \"benchmark_id\": \"tau2-airline\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.455,\n    \"normalized_score\": 0.455,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU2-Airline\"\n  },\n  {\n    \"model_benchmark_id\": 9320,\n    \"benchmark_id\": \"tau2-telecom\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.132,\n    \"normalized_score\": 0.132,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU2-Telecom\"\n  },\n  {\n    \"model_benchmark_id\": 9321,\n    \"benchmark_id\": \"multi-if\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.758,\n    \"normalized_score\": 0.758,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"MultiIF\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MultiIF\"\n  },\n  {\n    \"model_benchmark_id\": 9322,\n    \"benchmark_id\": \"mmlu-prox\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.767,\n    \"normalized_score\": 0.767,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"MMLU-ProX\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-ProX\"\n  },\n  {\n    \"model_benchmark_id\": 9323,\n    \"benchmark_id\": \"include\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.789,\n    \"normalized_score\": 0.789,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"INCLUDE\"\n  },\n  {\n    \"model_benchmark_id\": 9324,\n    \"benchmark_id\": \"polymath\",\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"score\": 0.459,\n    \"normalized_score\": 0.459,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"PolyMATH\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-next-80b-a3b-instruct/model.json",
    "content": "{\n  \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n  \"name\": \"Qwen3-Next-80B-A3B-Instruct\",\n  \"organization_id\": \"qwen\",\n  \"model_family_id\": null,\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen3-Next-80B-A3B-Instruct is the first in the Qwen3-Next series, featuring groundbreaking architectural innovations. It uses Hybrid Attention combining Gated DeltaNet and Gated Attention for efficient ultra-long context modeling, High-Sparsity MoE with 512 experts (10 activated + 1 shared) achieving extreme low activation ratio, and Multi-Token Prediction for improved performance and faster inference. With 80B total parameters and only 3B activated, it outperforms Qwen3-32B-Base with 10% training cost and 10x throughput for 32K+ contexts. The model performs on par with Qwen3-235B-A22B-Instruct-2507 while excelling at ultra-long-context tasks up to 256K tokens (extensible to 1M with YaRN). Architecture: 48 layers, 15T training tokens, hybrid layout of 12*(3*(Gated DeltaNet->MoE)->(Gated Attention->MoE)).\",\n  \"release_date\": \"2025-09-10\",\n  \"announcement_date\": \"2025-09-10\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 80000000000,\n  \"training_tokens\": 15000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct\",\n  \"source_playground\": \"https://chat.qwen.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen3\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct\",\n  \"created_at\": \"2025-09-10T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-next-80b-a3b-thinking/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 9201,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.827,\n    \"normalized_score\": 0.827,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 9202,\n    \"benchmark_id\": \"mmlu-redux\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.925,\n    \"normalized_score\": 0.925,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Redux\"\n  },\n  {\n    \"model_benchmark_id\": 9203,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.772,\n    \"normalized_score\": 0.772,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9204,\n    \"benchmark_id\": \"supergpqa\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.608,\n    \"normalized_score\": 0.608,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SuperGPQA\"\n  },\n  {\n    \"model_benchmark_id\": 9205,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.878,\n    \"normalized_score\": 0.878,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 9206,\n    \"benchmark_id\": \"hmmt25\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.739,\n    \"normalized_score\": 0.739,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT25\"\n  },\n  {\n    \"model_benchmark_id\": 9207,\n    \"benchmark_id\": \"livebench-20241125\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.766,\n    \"normalized_score\": 0.766,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveBench 241125\"\n  },\n  {\n    \"model_benchmark_id\": 9208,\n    \"benchmark_id\": \"livecodebench-v6\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.687,\n    \"normalized_score\": 0.687,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"25.02-25.05\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v6\"\n  },\n  {\n    \"model_benchmark_id\": 9209,\n    \"benchmark_id\": \"cfeval\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 2071,\n    \"normalized_score\": 0.2071,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"Raw score: 2071\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"CFEval\"\n  },\n  {\n    \"model_benchmark_id\": 9210,\n    \"benchmark_id\": \"ojbench\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.297,\n    \"normalized_score\": 0.297,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"OJBench\"\n  },\n  {\n    \"model_benchmark_id\": 9211,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.889,\n    \"normalized_score\": 0.889,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 9212,\n    \"benchmark_id\": \"arena-hard-v2\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.623,\n    \"normalized_score\": 0.623,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"GPT-4.1 evaluated win rates\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Arena-Hard v2\"\n  },\n  {\n    \"model_benchmark_id\": 9213,\n    \"benchmark_id\": \"writingbench\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.846,\n    \"normalized_score\": 0.846,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"WritingBench\"\n  },\n  {\n    \"model_benchmark_id\": 9214,\n    \"benchmark_id\": \"bfcl-v3\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.72,\n    \"normalized_score\": 0.72,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BFCL-v3\"\n  },\n  {\n    \"model_benchmark_id\": 9215,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.696,\n    \"normalized_score\": 0.696,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"TAU1-Retail\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU1-Retail\"\n  },\n  {\n    \"model_benchmark_id\": 9216,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.49,\n    \"normalized_score\": 0.49,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"TAU1-Airline\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU1-Airline\"\n  },\n  {\n    \"model_benchmark_id\": 9217,\n    \"benchmark_id\": \"tau2-retail\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.678,\n    \"normalized_score\": 0.678,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU2-Retail\"\n  },\n  {\n    \"model_benchmark_id\": 9218,\n    \"benchmark_id\": \"tau2-airline\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.605,\n    \"normalized_score\": 0.605,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU2-Airline\"\n  },\n  {\n    \"model_benchmark_id\": 9219,\n    \"benchmark_id\": \"tau2-telecom\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.439,\n    \"normalized_score\": 0.439,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU2-Telecom\"\n  },\n  {\n    \"model_benchmark_id\": 9220,\n    \"benchmark_id\": \"multi-if\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.778,\n    \"normalized_score\": 0.778,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"MultiIF\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MultiIF\"\n  },\n  {\n    \"model_benchmark_id\": 9221,\n    \"benchmark_id\": \"mmlu-prox\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.787,\n    \"normalized_score\": 0.787,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": \"MMLU-ProX\",\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-ProX\"\n  },\n  {\n    \"model_benchmark_id\": 9222,\n    \"benchmark_id\": \"include\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.789,\n    \"normalized_score\": 0.789,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"INCLUDE\"\n  },\n  {\n    \"model_benchmark_id\": 9223,\n    \"benchmark_id\": \"polymath\",\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"score\": 0.563,\n    \"normalized_score\": 0.563,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": null,\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-01-10T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"PolyMATH\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/qwen/models/qwen3-next-80b-a3b-thinking/model.json",
    "content": "{\n  \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n  \"name\": \"Qwen3-Next-80B-A3B-Thinking\",\n  \"organization_id\": \"qwen\",\n  \"model_family_id\": null,\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Qwen3-Next-80B-A3B-Thinking is the thinking variant of the Qwen3-Next series, featuring the same groundbreaking architecture as the instruct model. Leveraging GSPO, it addresses stability and efficiency challenges of hybrid attention + high-sparsity MoE in RL training. It uses Hybrid Attention combining Gated DeltaNet and Gated Attention for efficient ultra-long context modeling, High-Sparsity MoE with 512 experts (10 activated + 1 shared), and Multi-Token Prediction. With 80B total parameters and only 3B activated, it demonstrates outstanding performance on complex reasoning tasks — outperforming Qwen3-30B-A3B-Thinking-2507, Qwen3-32B-Thinking, and even the proprietary Gemini-2.5-Flash-Thinking across multiple benchmarks. Architecture: 48 layers, 15T training tokens, hybrid layout of 12*(3*(Gated DeltaNet->MoE)->(Gated Attention->MoE)). Supports only thinking mode with automatic <think> tag inclusion, may generate longer thinking content.\",\n  \"release_date\": \"2025-09-10\",\n  \"announcement_date\": \"2025-09-10\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 80000000000,\n  \"training_tokens\": 15000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking\",\n  \"source_playground\": \"https://chat.qwen.ai/\",\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwen3-next/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen3\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking\",\n  \"created_at\": \"2025-09-10T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/qwen/models/qwq-32b/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 451,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"qwq-32b\",\n    \"score\": 0.795,\n    \"normalized_score\": 0.795,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwq-32b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.957773+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.957773+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 849,\n    \"benchmark_id\": \"bfcl\",\n    \"model_id\": \"qwq-32b\",\n    \"score\": 0.664,\n    \"normalized_score\": 0.664,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwq-32b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.777209+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.777209+00:00\",\n    \"benchmark_name\": \"BFCL\"\n  },\n  {\n    \"model_benchmark_id\": 298,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwq-32b\",\n    \"score\": 0.652,\n    \"normalized_score\": 0.652,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwen-ai.com/qwq-32b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.672880+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.672880+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 619,\n    \"benchmark_id\": \"ifeval\",\n    \"model_id\": \"qwq-32b\",\n    \"score\": 0.839,\n    \"normalized_score\": 0.839,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwq-32b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.275723+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.275723+00:00\",\n    \"benchmark_name\": \"IFEval\"\n  },\n  {\n    \"model_benchmark_id\": 747,\n    \"benchmark_id\": \"livebench\",\n    \"model_id\": \"qwq-32b\",\n    \"score\": 0.731,\n    \"normalized_score\": 0.731,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwq-32b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.570952+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.570952+00:00\",\n    \"benchmark_name\": \"LiveBench\"\n  },\n  {\n    \"model_benchmark_id\": 1118,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"qwq-32b\",\n    \"score\": 0.634,\n    \"normalized_score\": 0.634,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwq-32b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.332752+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.332752+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 495,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"qwq-32b\",\n    \"score\": 0.906,\n    \"normalized_score\": 0.906,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwen-ai.com/qwq-32b/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.034467+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.034467+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwq-32b/model.json",
    "content": "{\n  \"model_id\": \"qwq-32b\",\n  \"name\": \"QwQ-32B\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A model focused on advancing AI reasoning capabilities, particularly excelling in mathematics and programming. Features deep introspection and self-questioning abilities while having some limitations in language mixing and recursive/endless reasoning patterns.\",\n  \"release_date\": \"2025-03-05\",\n  \"announcement_date\": \"2025-03-05\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2024-11-28\",\n  \"param_count\": 32500000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/Qwen/QwQ-32B\",\n  \"source_playground\": \"https://huggingface.co/playground?modelId=Qwen/QwQ-32B\",\n  \"source_paper\": \"https://arxiv.org/abs/2412.15115\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwq-32b/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/QwQ\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/QwQ-32B\",\n  \"created_at\": \"2025-07-19T19:49:05.609393+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.609393+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/models/qwq-32b-preview/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 452,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"qwq-32b-preview\",\n    \"score\": 0.5,\n    \"normalized_score\": 0.5,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwq-32b-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.959852+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.959852+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 300,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"qwq-32b-preview\",\n    \"score\": 0.652,\n    \"normalized_score\": 0.652,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwq-32b-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.675997+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.675997+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1120,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"qwq-32b-preview\",\n    \"score\": 0.5,\n    \"normalized_score\": 0.5,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwq-32b-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.337401+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.337401+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 496,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"qwq-32b-preview\",\n    \"score\": 0.906,\n    \"normalized_score\": 0.906,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://qwenlm.github.io/blog/qwq-32b-preview/\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.036449+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.036449+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  }\n]"
  },
  {
    "path": "data/organizations/qwen/models/qwq-32b-preview/model.json",
    "content": "{\n  \"model_id\": \"qwq-32b-preview\",\n  \"name\": \"QwQ-32B-Preview\",\n  \"organization_id\": \"qwen\",\n  \"fine_tuned_from_model_id\": \"qwen-2.5-32b-instruct\",\n  \"description\": \"An experimental research model focused on advancing AI reasoning capabilities, particularly excelling in mathematics and programming. Features deep introspection and self-questioning abilities while having some limitations in language mixing and recursive reasoning patterns.\",\n  \"release_date\": \"2024-11-28\",\n  \"announcement_date\": \"2024-11-28\",\n  \"license_id\": \"apache_2_0\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": \"2024-11-28\",\n  \"param_count\": 32500000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://huggingface.co/Qwen/QwQ-32B-Preview\",\n  \"source_playground\": \"https://huggingface.co/spaces/Qwen/QwQ-32B-Preview\",\n  \"source_paper\": \"https://arxiv.org/abs/2407.10671\",\n  \"source_scorecard_blog_link\": \"https://qwenlm.github.io/blog/qwq-32b-preview/\",\n  \"source_repo_link\": \"https://github.com/QwenLM/Qwen2\",\n  \"source_weights_link\": \"https://huggingface.co/Qwen/QwQ-32B-Preview\",\n  \"created_at\": \"2025-07-19T19:49:05.887027+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.887027+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/qwen/organization.json",
    "content": "{\n  \"organization_id\": \"qwen\",\n  \"name\": \"Alibaba Cloud / Qwen Team\",\n  \"website\": \"https://qwenlm.github.io\",\n  \"description\": \"The Qwen Team from Alibaba Cloud, developing the Qwen series of large language models including state-of-the-art mixture-of-experts and thinking-enabled models\",\n  \"country\": \"CN\",\n  \"created_at\": \"2025-07-19T19:49:05.604449+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/unknown/organization.json",
    "content": "{\n  \"organization_id\": \"unknown\",\n  \"name\": \"Unknown\",\n  \"website\": \"\",\n  \"description\": \"Default organization for missing data\",\n  \"country\": null,\n  \"created_at\": \"2025-08-03T22:06:10.791768+00:00\",\n  \"updated_at\": \"2025-08-03T22:06:10.791768+00:00\"\n}"
  },
  {
    "path": "data/organizations/xai/models/grok-1.5/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 894,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"grok-1.5\",\n    \"score\": 0.856,\n    \"normalized_score\": 0.856,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.861804+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.861804+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 322,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"grok-1.5\",\n    \"score\": 0.359,\n    \"normalized_score\": 0.359,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.711788+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.711788+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1001,\n    \"benchmark_id\": \"gsm8k\",\n    \"model_id\": \"grok-1.5\",\n    \"score\": 0.9,\n    \"normalized_score\": 0.9,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"8-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.092882+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.092882+00:00\",\n    \"benchmark_name\": \"GSM8k\"\n  },\n  {\n    \"model_benchmark_id\": 794,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"grok-1.5\",\n    \"score\": 0.741,\n    \"normalized_score\": 0.741,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.660557+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.660557+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 413,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"grok-1.5\",\n    \"score\": 0.506,\n    \"normalized_score\": 0.506,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"4-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.878054+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.878054+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 532,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"grok-1.5\",\n    \"score\": 0.528,\n    \"normalized_score\": 0.528,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.103226+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.103226+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 97,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"grok-1.5\",\n    \"score\": 0.813,\n    \"normalized_score\": 0.813,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-1.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"5-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.283997+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.283997+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 206,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"grok-1.5\",\n    \"score\": 0.51,\n    \"normalized_score\": 0.51,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.492470+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.492470+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 578,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"grok-1.5\",\n    \"score\": 0.536,\n    \"normalized_score\": 0.536,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"0-shot\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.189264+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.189264+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  }\n]"
  },
  {
    "path": "data/organizations/xai/models/grok-1.5/model.json",
    "content": "{\n  \"model_id\": \"grok-1.5\",\n  \"name\": \"Grok-1.5\",\n  \"organization_id\": \"xai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"An advanced language model with improved reasoning capabilities, particularly excelling in coding and mathematical tasks. Features a 128K token context window and enhanced problem-solving abilities compared to its predecessor.\",\n  \"release_date\": \"2024-03-28\",\n  \"announcement_date\": \"2024-03-28\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://x.ai/api\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://x.ai/blog/grok-1.5\",\n  \"source_repo_link\": \"https://github.com/xai-org/grok-1\",\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.705047+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.705047+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/xai/models/grok-1.5v/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 1259,\n    \"benchmark_id\": \"ai2d\",\n    \"model_id\": \"grok-1.5v\",\n    \"score\": 0.883,\n    \"normalized_score\": 0.883,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-1.5v\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"zero-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.641849+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.641849+00:00\",\n    \"benchmark_name\": \"AI2D\"\n  },\n  {\n    \"model_benchmark_id\": 871,\n    \"benchmark_id\": \"chartqa\",\n    \"model_id\": \"grok-1.5v\",\n    \"score\": 0.761,\n    \"normalized_score\": 0.761,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-1.5v\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"zero-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.817786+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.817786+00:00\",\n    \"benchmark_name\": \"ChartQA\"\n  },\n  {\n    \"model_benchmark_id\": 896,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"grok-1.5v\",\n    \"score\": 0.856,\n    \"normalized_score\": 0.856,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-1.5v\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"zero-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.865566+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.865566+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 534,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"grok-1.5v\",\n    \"score\": 0.528,\n    \"normalized_score\": 0.528,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-1.5v\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"zero-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.106344+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.106344+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 581,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"grok-1.5v\",\n    \"score\": 0.536,\n    \"normalized_score\": 0.536,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-1.5v\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"zero-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.195047+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.195047+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  },\n  {\n    \"model_benchmark_id\": 1638,\n    \"benchmark_id\": \"realworldqa\",\n    \"model_id\": \"grok-1.5v\",\n    \"score\": 0.687,\n    \"normalized_score\": 0.687,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-1.5v\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"zero-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:14.606610+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:14.606610+00:00\",\n    \"benchmark_name\": \"RealWorldQA\"\n  },\n  {\n    \"model_benchmark_id\": 915,\n    \"benchmark_id\": \"textvqa\",\n    \"model_id\": \"grok-1.5v\",\n    \"score\": 0.781,\n    \"normalized_score\": 0.781,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-1.5v\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"zero-shot evaluation\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.908800+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.908800+00:00\",\n    \"benchmark_name\": \"TextVQA\"\n  }\n]"
  },
  {
    "path": "data/organizations/xai/models/grok-1.5v/model.json",
    "content": "{\n  \"model_id\": \"grok-1.5v\",\n  \"name\": \"Grok-1.5V\",\n  \"organization_id\": \"xai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"A multimodal model capable of processing text and visual information, including documents, diagrams, charts, screenshots, and photographs. Notable for strong real-world spatial understanding capabilities.\",\n  \"release_date\": \"2024-04-12\",\n  \"announcement_date\": \"2024-04-12\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://x.ai/api\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://x.ai/blog/grok-1.5v\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.717803+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.717803+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/xai/models/grok-2/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 895,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"grok-2\",\n    \"score\": 0.936,\n    \"normalized_score\": 0.936,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.863462+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.863462+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 325,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"grok-2\",\n    \"score\": 0.56,\n    \"normalized_score\": 0.56,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.716230+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.716230+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 795,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"grok-2\",\n    \"score\": 0.884,\n    \"normalized_score\": 0.884,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.662404+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.662404+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 414,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"grok-2\",\n    \"score\": 0.761,\n    \"normalized_score\": 0.761,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"maj@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.880368+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.880368+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 533,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"grok-2\",\n    \"score\": 0.69,\n    \"normalized_score\": 0.69,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.104885+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.104885+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 98,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"grok-2\",\n    \"score\": 0.875,\n    \"normalized_score\": 0.875,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.285517+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.285517+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 207,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"grok-2\",\n    \"score\": 0.755,\n    \"normalized_score\": 0.755,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.494333+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.494333+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 580,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"grok-2\",\n    \"score\": 0.661,\n    \"normalized_score\": 0.661,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.193698+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.193698+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  }\n]"
  },
  {
    "path": "data/organizations/xai/models/grok-2/model.json",
    "content": "{\n  \"model_id\": \"grok-2\",\n  \"name\": \"Grok-2\",\n  \"organization_id\": \"xai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Grok-2 is a frontier language model with state-of-the-art reasoning capabilities, featuring advanced abilities in chat, coding, and reasoning. It demonstrates superior performance in visual math reasoning, document-based question answering, and excels across various academic benchmarks including reasoning, reading comprehension, math, and science.\",\n  \"release_date\": \"2024-08-13\",\n  \"announcement_date\": \"2024-08-13\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://x.ai/api\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://x.ai/blog/grok-2\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.715016+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.715016+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/xai/models/grok-2-mini/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 893,\n    \"benchmark_id\": \"docvqa\",\n    \"model_id\": \"grok-2-mini\",\n    \"score\": 0.932,\n    \"normalized_score\": 0.932,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.860093+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.860093+00:00\",\n    \"benchmark_name\": \"DocVQA\"\n  },\n  {\n    \"model_benchmark_id\": 321,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"grok-2-mini\",\n    \"score\": 0.51,\n    \"normalized_score\": 0.51,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.710285+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.710285+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 793,\n    \"benchmark_id\": \"humaneval\",\n    \"model_id\": \"grok-2-mini\",\n    \"score\": 0.857,\n    \"normalized_score\": 0.857,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"pass@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.658802+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.658802+00:00\",\n    \"benchmark_name\": \"HumanEval\"\n  },\n  {\n    \"model_benchmark_id\": 412,\n    \"benchmark_id\": \"math\",\n    \"model_id\": \"grok-2-mini\",\n    \"score\": 0.73,\n    \"normalized_score\": 0.73,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"maj@1\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.876593+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.876593+00:00\",\n    \"benchmark_name\": \"MATH\"\n  },\n  {\n    \"model_benchmark_id\": 531,\n    \"benchmark_id\": \"mathvista\",\n    \"model_id\": \"grok-2-mini\",\n    \"score\": 0.681,\n    \"normalized_score\": 0.681,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.101817+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.101817+00:00\",\n    \"benchmark_name\": \"MathVista\"\n  },\n  {\n    \"model_benchmark_id\": 96,\n    \"benchmark_id\": \"mmlu\",\n    \"model_id\": \"grok-2-mini\",\n    \"score\": 0.862,\n    \"normalized_score\": 0.862,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.281643+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.281643+00:00\",\n    \"benchmark_name\": \"MMLU\"\n  },\n  {\n    \"model_benchmark_id\": 205,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"grok-2-mini\",\n    \"score\": 0.72,\n    \"normalized_score\": 0.72,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.490630+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.490630+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 577,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"grok-2-mini\",\n    \"score\": 0.632,\n    \"normalized_score\": 0.632,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-2\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.186961+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.186961+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  }\n]"
  },
  {
    "path": "data/organizations/xai/models/grok-2-mini/model.json",
    "content": "{\n  \"model_id\": \"grok-2-mini\",\n  \"name\": \"Grok-2 mini\",\n  \"organization_id\": \"xai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Grok-2 mini is a smaller, faster variant of Grok-2 that offers a balance between speed and answer quality. While more compact than its larger sibling, it maintains strong capabilities across various tasks including reasoning, coding, and chat interactions.\",\n  \"release_date\": \"2024-08-13\",\n  \"announcement_date\": \"2024-08-13\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://x.ai/api\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": \"https://x.ai/blog/grok-2\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.702680+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.702680+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/xai/models/grok-3/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 475,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"grok-3\",\n    \"score\": 0.933,\n    \"normalized_score\": 0.933,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.003392+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.003392+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 696,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"grok-3\",\n    \"score\": 0.933,\n    \"normalized_score\": 0.933,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.457788+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.457788+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 324,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"grok-3\",\n    \"score\": 0.846,\n    \"normalized_score\": 0.846,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.714708+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.714708+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1142,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"grok-3\",\n    \"score\": 0.794,\n    \"normalized_score\": 0.794,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.402422+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.402422+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 579,\n    \"benchmark_id\": \"mmmu\",\n    \"model_id\": \"grok-3\",\n    \"score\": 0.78,\n    \"normalized_score\": 0.78,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.191844+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.191844+00:00\",\n    \"benchmark_name\": \"MMMU\"\n  }\n]"
  },
  {
    "path": "data/organizations/xai/models/grok-3/model.json",
    "content": "{\n  \"model_id\": \"grok-3\",\n  \"name\": \"Grok-3\",\n  \"organization_id\": \"xai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Grok 3, launched by xAI on February 17, 2025, is an advanced AI model with significantly enhanced capabilities compared to Grok 2, boasting an order of magnitude increase in performance. Trained on a vast dataset that includes legal documents among others, and utilizing a massive compute infrastructure with around 200,000 GPUs in a Memphis data center, Grok 3's training used ten times more compute than its predecessor. It features specialized models like Grok 3 Reasoning and Grok 3 Mini Reasoning for complex problem-solving, and it excels in benchmarks like AIME for mathematics and GPQA for PhD-level science.\",\n  \"release_date\": \"2025-02-17\",\n  \"announcement_date\": \"2025-02-17\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-11-17\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://x.ai/api\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.711845+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.711845+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/xai/models/grok-3-mini/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 474,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"grok-3-mini\",\n    \"score\": 0.958,\n    \"normalized_score\": 0.958,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.001587+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.001587+00:00\",\n    \"benchmark_name\": \"AIME 2024\"\n  },\n  {\n    \"model_benchmark_id\": 693,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"grok-3-mini\",\n    \"score\": 0.908,\n    \"normalized_score\": 0.908,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.452930+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.452930+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 319,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"grok-3-mini\",\n    \"score\": 0.84,\n    \"normalized_score\": 0.84,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.707259+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.707259+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1139,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"grok-3-mini\",\n    \"score\": 0.804,\n    \"normalized_score\": 0.804,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-3\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.394024+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.394024+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  }\n]"
  },
  {
    "path": "data/organizations/xai/models/grok-3-mini/model.json",
    "content": "{\n  \"model_id\": \"grok-3-mini\",\n  \"name\": \"Grok-3 Mini\",\n  \"organization_id\": \"xai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Grok 3 Mini is a streamlined version of xAI's Grok 3 AI model, designed for quicker response times while maintaining utility. It's tailored for users who require speed over the comprehensive capabilities of the full Grok 3 model, making it suitable for tasks where rapid information retrieval is key. Grok 3 Mini still leverages the advanced training and data that Grok 3 was built on but offers a lighter, more efficient version for everyday use.\",\n  \"release_date\": \"2025-02-17\",\n  \"announcement_date\": \"2025-02-17\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-11-17\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://x.ai/api\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.697297+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.697297+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/xai/models/grok-4/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 695,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"grok-4\",\n    \"score\": 0.917,\n    \"normalized_score\": 0.917,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.456102+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.456102+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 1387,\n    \"benchmark_id\": \"arc-agi-v2\",\n    \"model_id\": \"grok-4\",\n    \"score\": 0.159,\n    \"normalized_score\": 0.159,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.922021+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.922021+00:00\",\n    \"benchmark_name\": \"ARC-AGI v2\"\n  },\n  {\n    \"model_benchmark_id\": 323,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"grok-4\",\n    \"score\": 0.875,\n    \"normalized_score\": 0.875,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.713248+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.713248+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1799,\n    \"benchmark_id\": \"hmmt25\",\n    \"model_id\": \"grok-4\",\n    \"score\": 0.9,\n    \"normalized_score\": 0.9,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.065811+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.065811+00:00\",\n    \"benchmark_name\": \"HMMT25\"\n  },\n  {\n    \"model_benchmark_id\": 723,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"grok-4\",\n    \"score\": 0.4,\n    \"normalized_score\": 0.4,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.523105+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.523105+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 1141,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"grok-4\",\n    \"score\": 0.79,\n    \"normalized_score\": 0.79,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.399716+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.399716+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1801,\n    \"benchmark_id\": \"usamo25\",\n    \"model_id\": \"grok-4\",\n    \"score\": 0.375,\n    \"normalized_score\": 0.375,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.071894+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.071894+00:00\",\n    \"benchmark_name\": \"USAMO25\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/xai/models/grok-4/model.json",
    "content": "{\n  \"model_id\": \"grok-4\",\n  \"name\": \"Grok-4\",\n  \"organization_id\": \"xai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Grok 4, announced by xAI in summer 2025, represents a major leap in AI capabilities, described as 'the smartest AI in the world.' Built on version 6 of xAI's foundation model, it uses 100x more training compute than Grok 2 and 10x more reinforcement learning compute than Grok 3. The model achieves PhD-level performance across all academic disciplines simultaneously, scoring perfect on standardized tests like the SAT and near-perfect on graduate exams like the GRE. Unlike Grok 3, tool usage is built into the training process rather than relying on generalization. Trained using 200,000 GPUs, Grok 4 excels at complex reasoning, mathematical problem-solving, and coding tasks, though it has acknowledged weaknesses in multimodal capabilities that are being addressed in the next version.\",\n  \"release_date\": \"2025-07-09\",\n  \"announcement_date\": \"2025-07-09\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-12-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://x.ai/api\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.707962+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.707962+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/xai/models/grok-4-fast/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 22228,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"grok-4-fast\",\n    \"score\": 0.857,\n    \"normalized_score\": 0.857,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/news/grok-4-fast\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 22229,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"grok-4-fast\",\n    \"score\": 0.920,\n    \"normalized_score\": 0.920,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/news/grok-4-fast\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 22230,\n    \"benchmark_id\": \"hmmt-2025\",\n    \"model_id\": \"grok-4-fast\",\n    \"score\": 0.933,\n    \"normalized_score\": 0.933,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/news/grok-4-fast\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HMMT 2025\"\n  },\n  {\n    \"model_benchmark_id\": 22231,\n    \"benchmark_id\": \"hle\",\n    \"model_id\": \"grok-4-fast\",\n    \"score\": 0.200,\n    \"normalized_score\": 0.200,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/news/grok-4-fast\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HLE\"\n  },\n  {\n    \"model_benchmark_id\": 22232,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"grok-4-fast\",\n    \"score\": 0.800,\n    \"normalized_score\": 0.800,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/news/grok-4-fast\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 22233,\n    \"benchmark_id\": \"browsecomp\",\n    \"model_id\": \"grok-4-fast\",\n    \"score\": 0.449,\n    \"normalized_score\": 0.449,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/news/grok-4-fast\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp\"\n  },\n  {\n    \"model_benchmark_id\": 22234,\n    \"benchmark_id\": \"simpleqa\",\n    \"model_id\": \"grok-4-fast\",\n    \"score\": 0.950,\n    \"normalized_score\": 0.950,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/news/grok-4-fast\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SimpleQA\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/xai/models/grok-4-fast/model.json",
    "content": "{\n  \"model_id\": \"grok-4-fast\",\n  \"name\": \"Grok 4 Fast\",\n  \"organization_id\": \"xai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Grok 4 Fast is a high-speed variant of Grok-4, optimized for faster inference while maintaining strong reasoning capabilities. It offers improved throughput and lower latency compared to the standard Grok-4 model.\",\n  \"release_date\": \"2025-08-28\",\n  \"announcement_date\": \"2025-08-28\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://x.ai/api\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://data.x.ai/2025-08-26-grok-code-fast-1-model-card.pdf\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-10-11T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-10-11T00:00:00.000000+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/xai/models/grok-4-heavy/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 694,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"grok-4-heavy\",\n    \"score\": 1.0,\n    \"normalized_score\": 1.0,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.454500+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.454500+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 320,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"grok-4-heavy\",\n    \"score\": 0.884,\n    \"normalized_score\": 0.884,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:11.708827+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:11.708827+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 1798,\n    \"benchmark_id\": \"hmmt25\",\n    \"model_id\": \"grok-4-heavy\",\n    \"score\": 0.967,\n    \"normalized_score\": 0.967,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.063588+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.063588+00:00\",\n    \"benchmark_name\": \"HMMT25\"\n  },\n  {\n    \"model_benchmark_id\": 722,\n    \"benchmark_id\": \"humanity's-last-exam\",\n    \"model_id\": \"grok-4-heavy\",\n    \"score\": 0.507,\n    \"normalized_score\": 0.507,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:12.521361+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:12.521361+00:00\",\n    \"benchmark_name\": \"Humanity's Last Exam\"\n  },\n  {\n    \"model_benchmark_id\": 1140,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"grok-4-heavy\",\n    \"score\": 0.794,\n    \"normalized_score\": 0.794,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:13.396669+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:13.396669+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 1800,\n    \"benchmark_id\": \"usamo25\",\n    \"model_id\": \"grok-4-heavy\",\n    \"score\": 0.619,\n    \"normalized_score\": 0.619,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.com/xai/status/1943158495588815072\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"accuracy\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-19T19:56:15.070427+00:00\",\n    \"updated_at\": \"2025-07-19T19:56:15.070427+00:00\",\n    \"benchmark_name\": \"USAMO25\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/xai/models/grok-4-heavy/model.json",
    "content": "{\n  \"model_id\": \"grok-4-heavy\",\n  \"name\": \"Grok-4 Heavy\",\n  \"organization_id\": \"xai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Grok 4 Heavy is the multi-agent version of Grok 4, released alongside the standard model in summer 2025. This system spawns multiple Grok 4 agents in parallel that work independently on problems and then collaborate by comparing their solutions, similar to a study group. The agents share insights and tricks they discover, with the system intelligently combining their work rather than simply using majority voting. Grok 4 Heavy uses approximately 10x more test-time compute than regular Grok 4, enabling it to solve significantly more complex problems. On the Humanities Last Exam, it achieves over 50% accuracy on text-only problems, and it scored a perfect result on the AIME 2025 mathematics competition. The system represents a major advancement in multi-agent AI collaboration and reasoning capabilities.\",\n  \"release_date\": \"2025-07-09\",\n  \"announcement_date\": \"2025-07-09\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": \"2024-12-31\",\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://x.ai/api\",\n  \"source_playground\": null,\n  \"source_paper\": null,\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-07-19T19:49:05.700416+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.700416+00:00\",\n  \"model_family_id\": null\n}"
  },
  {
    "path": "data/organizations/xai/models/grok-code-fast-1/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 22227,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"grok-code-fast-1\",\n    \"score\": 0.708,\n    \"normalized_score\": 0.708,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://x.ai/blog/grok-code-fast-1\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"full subset, internal harness\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-10-03T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-03T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-Bench Verified\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/xai/models/grok-code-fast-1/model.json",
    "content": "{\n  \"model_id\": \"grok-code-fast-1\",\n  \"name\": \"Grok Code Fast 1\",\n  \"organization_id\": \"xai\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"Grok Code Fast 1 is a speedy and economical reasoning model that excels at agentic coding. Built from scratch with a brand-new model architecture, it features a pre-training corpus rich with programming-related content and post-training datasets that reflect real-world pull requests and coding tasks. The model has mastered the use of common tools like grep, terminal, and file editing, making it ideal for integration with IDEs. It is exceptionally versatile across the full software development stack and is particularly adept at TypeScript, Python, Java, Rust, C++, and Go.\",\n  \"release_date\": \"2025-08-28\",\n  \"announcement_date\": \"2025-08-28\",\n  \"license_id\": \"proprietary\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": null,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://x.ai/api\",\n  \"source_playground\": null,\n  \"source_paper\": \"https://data.x.ai/2025-08-26-grok-code-fast-1-model-card.pdf\",\n  \"source_scorecard_blog_link\": \"https://x.ai/blog/grok-code-fast-1\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-10-03T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-10-03T00:00:00.000000+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/xai/organization.json",
    "content": "{\n  \"organization_id\": \"xai\",\n  \"name\": \"xAI\",\n  \"website\": \"https://x.ai\",\n  \"description\": \"Elon Musk AI company\",\n  \"country\": \"US\",\n  \"created_at\": \"2025-07-19T19:49:05.695344+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:05.695344+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/zai-org/models/glm-4.5/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 7001,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.846,\n    \"normalized_score\": 0.846,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 7002,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.91,\n    \"normalized_score\": 0.91,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@32\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME24\"\n  },\n  {\n    \"model_benchmark_id\": 7003,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.982,\n    \"normalized_score\": 0.982,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 7004,\n    \"benchmark_id\": \"scicode\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.417,\n    \"normalized_score\": 0.417,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SciCode\"\n  },\n  {\n    \"model_benchmark_id\": 7005,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.791,\n    \"normalized_score\": 0.791,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@8\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 7006,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.729,\n    \"normalized_score\": 0.729,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"2407-2501\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 7007,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.642,\n    \"normalized_score\": 0.642,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenHands v0.34.0\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-bench-Verified\"\n  },\n  {\n    \"model_benchmark_id\": 7008,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.797,\n    \"normalized_score\": 0.797,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"optimized user simulator\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU-bench-Retail\"\n  },\n  {\n    \"model_benchmark_id\": 7009,\n    \"benchmark_id\": \"bfcl-v3\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.778,\n    \"normalized_score\": 0.778,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Full\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BFCL-v3\"\n  },\n  {\n    \"model_benchmark_id\": 7010,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.604,\n    \"normalized_score\": 0.604,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"optimized user simulator\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU-bench-Airline\"\n  },\n  {\n    \"model_benchmark_id\": 7011,\n    \"benchmark_id\": \"browsecomp\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.264,\n    \"normalized_score\": 0.264,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp\"\n  },\n  {\n    \"model_benchmark_id\": 7012,\n    \"benchmark_id\": \"hle\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.144,\n    \"normalized_score\": 0.144,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"text-based questions only\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HLE\"\n  },\n  {\n    \"model_benchmark_id\": 7013,\n    \"benchmark_id\": \"aa-index\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.677,\n    \"normalized_score\": 0.677,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Estimated\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AA-Index\"\n  },\n  {\n    \"model_benchmark_id\": 7014,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"glm-4.5\",\n    \"score\": 0.375,\n    \"normalized_score\": 0.375,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Terminus framework\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Terminal-Bench\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/zai-org/models/glm-4.5/model.json",
    "content": "{\n  \"model_id\": \"glm-4.5\",\n  \"name\": \"GLM-4.5\",\n  \"organization_id\": \"zai-org\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GLM-4.5 is an Agentic, Reasoning, and Coding (ARC) foundation model designed for intelligent agents, featuring 355 billion total parameters with 32 billion active parameters using MoE architecture. Trained on 23T tokens through multi-stage training, it is a hybrid reasoning model that provides two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. The model unifies agentic, reasoning, and coding capabilities with 128K context length support. It achieves exceptional performance with a score of 63.2 across 12 industry-standard benchmarks, placing 3rd among all proprietary and open-source models. Released under MIT open-source license allowing commercial use and secondary development.\",\n  \"release_date\": \"2025-07-28\",\n  \"announcement_date\": \"2025-07-28\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 355000000000,\n  \"training_tokens\": 23000000000000,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.z.ai/guides/llm/glm-4.5\",\n  \"source_playground\": \"https://chat.z.ai\",\n  \"source_paper\": \"https://arxiv.org/pdf/2508.06471\",\n  \"source_scorecard_blog_link\": \"https://z.ai/blog/glm-4.5\",\n  \"source_repo_link\": \"https://github.com/zai-org/GLM-4.5\",\n  \"source_weights_link\": \"https://huggingface.co/zai-org/GLM-4.5\",\n  \"created_at\": \"2025-09-15T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/zai-org/models/glm-4.5-air/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 7101,\n    \"benchmark_id\": \"mmlu-pro\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.814,\n    \"normalized_score\": 0.814,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MMLU-Pro\"\n  },\n  {\n    \"model_benchmark_id\": 7102,\n    \"benchmark_id\": \"aime-2024\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.894,\n    \"normalized_score\": 0.894,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@32\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME24\"\n  },\n  {\n    \"model_benchmark_id\": 7103,\n    \"benchmark_id\": \"math-500\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.981,\n    \"normalized_score\": 0.981,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"MATH-500\"\n  },\n  {\n    \"model_benchmark_id\": 7104,\n    \"benchmark_id\": \"scicode\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.373,\n    \"normalized_score\": 0.373,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SciCode\"\n  },\n  {\n    \"model_benchmark_id\": 7105,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.75,\n    \"normalized_score\": 0.75,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Avg@8\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 7106,\n    \"benchmark_id\": \"livecodebench\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.707,\n    \"normalized_score\": 0.707,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"2407-2501\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveCodeBench\"\n  },\n  {\n    \"model_benchmark_id\": 7107,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.576,\n    \"normalized_score\": 0.576,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenHands v0.34.0\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-bench-Verified\"\n  },\n  {\n    \"model_benchmark_id\": 7108,\n    \"benchmark_id\": \"tau-bench-retail\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.779,\n    \"normalized_score\": 0.779,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"optimized user simulator\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU-bench-Retail\"\n  },\n  {\n    \"model_benchmark_id\": 7109,\n    \"benchmark_id\": \"bfcl-v3\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.764,\n    \"normalized_score\": 0.764,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Full\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BFCL-v3\"\n  },\n  {\n    \"model_benchmark_id\": 7110,\n    \"benchmark_id\": \"tau-bench-airline\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.608,\n    \"normalized_score\": 0.608,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"optimized user simulator\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"TAU-bench-Airline\"\n  },\n  {\n    \"model_benchmark_id\": 7111,\n    \"benchmark_id\": \"browsecomp\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.213,\n    \"normalized_score\": 0.213,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp\"\n  },\n  {\n    \"model_benchmark_id\": 7112,\n    \"benchmark_id\": \"hle\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.106,\n    \"normalized_score\": 0.106,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"text-based questions only\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HLE\"\n  },\n  {\n    \"model_benchmark_id\": 7113,\n    \"benchmark_id\": \"aa-index\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.648,\n    \"normalized_score\": 0.648,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Estimated\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AA-Index\"\n  },\n  {\n    \"model_benchmark_id\": 7114,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"glm-4.5-air\",\n    \"score\": 0.3,\n    \"normalized_score\": 0.3,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.5\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"Terminus framework\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-28T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Terminal-Bench\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/zai-org/models/glm-4.5-air/model.json",
    "content": "{\n  \"model_id\": \"glm-4.5-air\",\n  \"name\": \"GLM-4.5-Air\",\n  \"organization_id\": \"zai-org\",\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GLM-4.5-Air is a more compact variant of GLM-4.5 designed for efficient Agentic, Reasoning, and Coding (ARC) applications. It features 106 billion total parameters with 12 billion active parameters using MoE architecture. Like GLM-4.5, it is a hybrid reasoning model providing thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. Despite its compact design, GLM-4.5-Air delivers competitive performance with a score of 59.8 across 12 industry-standard benchmarks, ranking 6th overall while maintaining superior efficiency. It supports 128K context length and is released under MIT open-source license allowing commercial use.\",\n  \"release_date\": \"2025-07-28\",\n  \"announcement_date\": \"2025-07-28\",\n  \"license_id\": \"mit\",\n  \"multimodal\": false,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 106000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.z.ai/guides/llm/glm-4.5\",\n  \"source_playground\": \"https://chat.z.ai\",\n  \"source_paper\": \"https://arxiv.org/pdf/2508.06471\",\n  \"source_scorecard_blog_link\": \"https://z.ai/blog/glm-4.5\",\n  \"source_repo_link\": \"https://github.com/zai-org/GLM-4.5\",\n  \"source_weights_link\": \"https://huggingface.co/zai-org/GLM-4.5-Air\",\n  \"created_at\": \"2025-09-15T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n  \"model_family_id\": null\n}\n"
  },
  {
    "path": "data/organizations/zai-org/models/glm-4.5v/benchmarks.json",
    "content": "[]\n"
  },
  {
    "path": "data/organizations/zai-org/models/glm-4.5v/model.json",
    "content": "{\n  \"model_id\": \"glm-4.5v\",\n  \"name\": \"GLM-4.5V\",\n  \"organization_id\": \"zai-org\",\n  \"model_family_id\": null,\n  \"fine_tuned_from_model_id\": \"glm-4.5-air\",\n  \"description\": \"GLM-4.5V is a multimodal (vision-language) model based on GLM-4.5-Air (106B total, 12B active) that extends hybrid reasoning to images and video. It achieves state-of-the-art results across 40+ VLM benchmarks (image reasoning, video understanding, GUI tasks, chart/document parsing, grounding) while supporting a Thinking Mode switch for deep reasoning. Released under MIT with FP8/BF16 variants and tooling in Transformers, vLLM, and SGLang.\",\n  \"release_date\": \"2025-08-11\",\n  \"announcement_date\": \"2025-08-11\",\n  \"license_id\": \"mit\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 108000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": null,\n  \"source_playground\": \"https://chat.z.ai\",\n  \"source_paper\": \"https://arxiv.org/abs/2507.01006\",\n  \"source_scorecard_blog_link\": null,\n  \"source_repo_link\": \"https://github.com/zai-org/GLM-V/\",\n  \"source_weights_link\": \"https://huggingface.co/zai-org/GLM-4.5V\",\n  \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/zai-org/models/glm-4.6/benchmarks.json",
    "content": "[\n  {\n    \"model_benchmark_id\": 7002,\n    \"benchmark_id\": \"aime-2025\",\n    \"model_id\": \"glm-4.6\",\n    \"score\": 0.939,\n    \"normalized_score\": 0.939,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.6\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"AIME 2025\"\n  },\n  {\n    \"model_benchmark_id\": 7005,\n    \"benchmark_id\": \"gpqa\",\n    \"model_id\": \"glm-4.6\",\n    \"score\": 0.81,\n    \"normalized_score\": 0.81,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.6\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"GPQA\"\n  },\n  {\n    \"model_benchmark_id\": 7006,\n    \"benchmark_id\": \"livecodebench-v6\",\n    \"model_id\": \"glm-4.6\",\n    \"score\": 0.828,\n    \"normalized_score\": 0.828,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.6\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"LiveCodeBench v6\"\n  },\n  {\n    \"model_benchmark_id\": 7007,\n    \"benchmark_id\": \"swe-bench-verified\",\n    \"model_id\": \"glm-4.6\",\n    \"score\": 0.68,\n    \"normalized_score\": 0.68,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.6\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"OpenHands v0.34.0\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"SWE-bench-Verified\"\n  },\n  {\n    \"model_benchmark_id\": 7011,\n    \"benchmark_id\": \"browsecomp\",\n    \"model_id\": \"glm-4.6\",\n    \"score\": 0.451,\n    \"normalized_score\": 0.451,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.6\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"BrowseComp\"\n  },\n  {\n    \"model_benchmark_id\": 7012,\n    \"benchmark_id\": \"hle\",\n    \"model_id\": \"glm-4.6\",\n    \"score\": 0.172,\n    \"normalized_score\": 0.172,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.6\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"HLE\"\n  },\n  {\n    \"model_benchmark_id\": 7014,\n    \"benchmark_id\": \"terminal-bench\",\n    \"model_id\": \"glm-4.6\",\n    \"score\": 0.405,\n    \"normalized_score\": 0.405,\n    \"is_self_reported\": true,\n    \"self_reported_source_link\": \"https://z.ai/blog/glm-4.6\",\n    \"verified_by_llmstats\": false,\n    \"analysis_method\": \"standard\",\n    \"verification_provider_id\": null,\n    \"verification_hardware\": null,\n    \"verification_date\": null,\n    \"verification_notes\": null,\n    \"created_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-30T00:00:00.000000+00:00\",\n    \"benchmark_name\": \"Terminal-Bench\"\n  }\n]\n"
  },
  {
    "path": "data/organizations/zai-org/models/glm-4.6/model.json",
    "content": "{\n  \"model_id\": \"glm-4.6\",\n  \"name\": \"GLM-4.6\",\n  \"organization_id\": \"zai-org\",\n  \"model_family_id\": null,\n  \"fine_tuned_from_model_id\": null,\n  \"description\": \"GLM-4.6 is the latest version of Z.ai's flagship model, bringing significant improvements over GLM-4.5. Key features include: 200K token context window (expanded from 128K), superior coding performance with better real-world application in Claude Code/Cline/Roo Code/Kilo Code, advanced reasoning with tool use during inference, stronger agent capabilities, and refined writing aligned with human preferences. GLM-4.6 achieves competitive performance with DeepSeek-V3.2-Exp and Claude Sonnet 4, reaching near parity with Claude Sonnet 4 (48.6% win rate) on CC-Bench real-world coding tasks.\",\n  \"release_date\": \"2025-09-30\",\n  \"announcement_date\": \"2025-09-30\",\n  \"license_id\": \"mit\",\n  \"multimodal\": true,\n  \"knowledge_cutoff\": null,\n  \"param_count\": 357000000000,\n  \"training_tokens\": null,\n  \"available_in_zeroeval\": true,\n  \"source_api_ref\": \"https://docs.z.ai/guides/llm/glm-4.6\",\n  \"source_playground\": \"https://chat.z.ai\",\n  \"source_paper\": \"https://arxiv.org/pdf/2508.06471\",\n  \"source_scorecard_blog_link\": \"https://huggingface.co/zai-org/GLM-4.6\",\n  \"source_repo_link\": null,\n  \"source_weights_link\": null,\n  \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-30T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/organizations/zai-org/organization.json",
    "content": "{\n  \"organization_id\": \"zai-org\",\n  \"name\": \"Zhipu AI\",\n  \"website\": \"https://z.ai\",\n  \"description\": \"Zhipu AI is a Chinese AI company that provides a suite of AI tools and services.\",\n  \"country\": \"CN\",\n  \"created_at\": \"2025-09-15T00:00:00.000000+00:00\",\n  \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\"\n}\n"
  },
  {
    "path": "data/providers/anthropic/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 398,\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"provider_id\": \"anthropic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 100,\n    \"output_cents_per_million_tokens\": 500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 100.0,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.073101+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.073101+00:00\",\n    \"provider_model_id_used\": \"claude-3-5-haiku-20241022\",\n    \"model_name\": \"Claude 3.5 Haiku\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 397,\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"provider_id\": \"anthropic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.071608+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.071608+00:00\",\n    \"provider_model_id_used\": \"claude-3-5-sonnet-20241022\",\n    \"model_name\": \"Claude 3.5 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 402,\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"provider_id\": \"anthropic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.082450+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.082450+00:00\",\n    \"provider_model_id_used\": \"claude-3-7-sonnet-20250219\",\n    \"model_name\": \"Claude 3.7 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 401,\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"provider_id\": \"anthropic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 25,\n    \"output_cents_per_million_tokens\": 125,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.080579+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.080579+00:00\",\n    \"provider_model_id_used\": \"claude-3-haiku-20240307\",\n    \"model_name\": \"Claude 3 Haiku\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 399,\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"provider_id\": \"anthropic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 7500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.075485+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.075485+00:00\",\n    \"provider_model_id_used\": \"claude-3-opus-20240229\",\n    \"model_name\": \"Claude 3 Opus\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 400,\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"provider_id\": \"anthropic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.078602+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.078602+00:00\",\n    \"provider_model_id_used\": \"claude-3-sonnet-20240229\",\n    \"model_name\": \"Claude 3 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 404,\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"provider_id\": \"anthropic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 7500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.086661+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.086661+00:00\",\n    \"provider_model_id_used\": \"claude-opus-4-20250514\",\n    \"model_name\": \"Claude Opus 4\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 405,\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"provider_id\": \"anthropic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 7500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 32000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"claude-opus-4-1-20250805\",\n    \"model_name\": \"Claude Opus 4.1\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 403,\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"provider_id\": \"anthropic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.084616+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.084616+00:00\",\n    \"provider_model_id_used\": \"claude-sonnet-4-20250514\",\n    \"model_name\": \"Claude Sonnet 4\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 406,\n    \"model_id\": \"claude-sonnet-4-5-20250929\",\n    \"provider_id\": \"anthropic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 64000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": true,\n    \"input_modality_video\": true,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.084616+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.084616+00:00\",\n    \"provider_model_id_used\": \"claude-sonnet-4-5-20250929\",\n    \"model_name\": \"Claude Sonnet 4.5\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 407,\n    \"model_id\": \"claude-haiku-4-5-20251015\",\n    \"provider_id\": \"anthropic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 100,\n    \"output_cents_per_million_tokens\": 500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 100.0,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-15T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"claude-haiku-4-5-20251015\",\n    \"model_name\": \"Claude Haiku 4.5\",\n    \"organization_id\": \"anthropic\"\n  }\n]\n"
  },
  {
    "path": "data/providers/anthropic/provider.json",
    "content": "{\n  \"provider_id\": \"anthropic\",\n  \"name\": \"Anthropic\",\n  \"website\": \"https://anthropic.com\",\n  \"created_at\": \"2025-07-19T19:49:17.069874+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:17.069874+00:00\"\n}\n"
  },
  {
    "path": "data/providers/azure/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 261,\n    \"model_id\": \"gpt-3.5-turbo-0125\",\n    \"provider_id\": \"azure\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 50,\n    \"output_cents_per_million_tokens\": 150,\n    \"quantization\": null,\n    \"max_input_tokens\": 16385,\n    \"max_output_tokens\": 4096,\n    \"throughput\": 90.0,\n    \"latency\": 0.8,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.759540+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.759540+00:00\",\n    \"provider_model_id_used\": \"gpt-3.5-turbo-0125\",\n    \"model_name\": \"GPT-3.5 Turbo\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 259,\n    \"model_id\": \"gpt-4-0613\",\n    \"provider_id\": \"azure\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 3000,\n    \"output_cents_per_million_tokens\": 6000,\n    \"quantization\": null,\n    \"max_input_tokens\": 32768,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 104.0,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.751649+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.751649+00:00\",\n    \"provider_model_id_used\": \"gpt-4-0613\",\n    \"model_name\": \"GPT-4\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 264,\n    \"model_id\": \"gpt-4o-2024-05-13\",\n    \"provider_id\": \"azure\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 250,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 4096,\n    \"throughput\": 92.0,\n    \"latency\": 0.54,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.767540+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.767540+00:00\",\n    \"provider_model_id_used\": \"gpt-4o-2024-05-13\",\n    \"model_name\": \"GPT-4o\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 263,\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"provider_id\": \"azure\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 250,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 16384,\n    \"throughput\": 99.0,\n    \"latency\": 0.53,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.765163+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.765163+00:00\",\n    \"provider_model_id_used\": \"gpt-4o-2024-08-06\",\n    \"model_name\": \"GPT-4o\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 262,\n    \"model_id\": \"gpt-4o-mini-2024-07-18\",\n    \"provider_id\": \"azure\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 16384,\n    \"throughput\": 92.0,\n    \"latency\": 0.52,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.762692+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.762692+00:00\",\n    \"provider_model_id_used\": \"gpt-4o-mini-2024-07-18\",\n    \"model_name\": \"GPT-4o mini\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 260,\n    \"model_id\": \"gpt-4-turbo-2024-04-09\",\n    \"provider_id\": \"azure\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1000,\n    \"output_cents_per_million_tokens\": 3000,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 4096,\n    \"throughput\": 97.0,\n    \"latency\": 0.6,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.755438+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.755438+00:00\",\n    \"provider_model_id_used\": \"gpt-4-turbo-2024-04-09\",\n    \"model_name\": \"GPT-4 Turbo\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 266,\n    \"model_id\": \"o1-2024-12-17\",\n    \"provider_id\": \"azure\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 6000,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 100000,\n    \"throughput\": 16.0,\n    \"latency\": 0.54,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.772502+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.772502+00:00\",\n    \"provider_model_id_used\": \"o1-2024-12-17\",\n    \"model_name\": \"o1\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 267,\n    \"model_id\": \"o1-mini\",\n    \"provider_id\": \"azure\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 330,\n    \"output_cents_per_million_tokens\": 1320,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.774395+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.774395+00:00\",\n    \"provider_model_id_used\": \"o1-mini\",\n    \"model_name\": \"o1-mini\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 265,\n    \"model_id\": \"o1-preview\",\n    \"provider_id\": \"azure\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1650,\n    \"output_cents_per_million_tokens\": 6600,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 16.0,\n    \"latency\": 0.54,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.770395+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.770395+00:00\",\n    \"provider_model_id_used\": \"o1-preview\",\n    \"model_name\": \"o1-preview\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 268,\n    \"model_id\": \"o3-mini\",\n    \"provider_id\": \"azure\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 110,\n    \"output_cents_per_million_tokens\": 440,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 100000,\n    \"throughput\": 115.0,\n    \"latency\": 5.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.776480+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.776480+00:00\",\n    \"provider_model_id_used\": \"o3-mini\",\n    \"model_name\": \"o3-mini\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 269,\n    \"model_id\": \"phi-3.5-mini-instruct\",\n    \"provider_id\": \"azure\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 10,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 23.0,\n    \"latency\": 0.52,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.778852+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.778852+00:00\",\n    \"provider_model_id_used\": \"phi-3.5-mini-instruct\",\n    \"model_name\": \"Phi-3.5-mini-instruct\",\n    \"organization_id\": \"microsoft\"\n  }\n]"
  },
  {
    "path": "data/providers/azure/provider.json",
    "content": "{\n  \"provider_id\": \"azure\",\n  \"name\": \"Azure\",\n  \"website\": \"https://azure.microsoft.com\",\n  \"created_at\": \"2025-07-19T19:49:16.749000+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.749000+00:00\"\n}"
  },
  {
    "path": "data/providers/bedrock/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 369,\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 80,\n    \"output_cents_per_million_tokens\": 400,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 104.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.009862+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.009862+00:00\",\n    \"provider_model_id_used\": \"claude-3-5-haiku-20241022\",\n    \"model_name\": \"Claude 3.5 Haiku\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 368,\n    \"model_id\": \"claude-3-5-sonnet-20240620\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 101.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.007722+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.007722+00:00\",\n    \"provider_model_id_used\": \"claude-3-5-sonnet-20240620\",\n    \"model_name\": \"Claude 3.5 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 367,\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 101.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.005765+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.005765+00:00\",\n    \"provider_model_id_used\": \"claude-3-5-sonnet-20241022\",\n    \"model_name\": \"Claude 3.5 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 385,\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 101.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.041625+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.041625+00:00\",\n    \"provider_model_id_used\": \"claude-3-7-sonnet-20250219\",\n    \"model_name\": \"Claude 3.7 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 372,\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 25,\n    \"output_cents_per_million_tokens\": 125,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 104.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.016542+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.016542+00:00\",\n    \"provider_model_id_used\": \"claude-3-haiku-20240307\",\n    \"model_name\": \"Claude 3 Haiku\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 370,\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 7500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 120.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.011523+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.011523+00:00\",\n    \"provider_model_id_used\": \"claude-3-opus-20240229\",\n    \"model_name\": \"Claude 3 Opus\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 371,\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 120.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.014573+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.014573+00:00\",\n    \"provider_model_id_used\": \"claude-3-sonnet-20240229\",\n    \"model_name\": \"Claude 3 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 387,\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 7500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 120.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.046935+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.046935+00:00\",\n    \"provider_model_id_used\": \"claude-opus-4-20250514\",\n    \"model_name\": \"Claude Opus 4\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 388,\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 7500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 32000,\n    \"throughput\": 120.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"claude-opus-4-1-20250805\",\n    \"model_name\": \"Claude Opus 4.1\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 386,\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 101.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.044184+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.044184+00:00\",\n    \"provider_model_id_used\": \"claude-sonnet-4-20250514\",\n    \"model_name\": \"Claude Sonnet 4\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 381,\n    \"model_id\": \"command-r-plus-04-2024\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.034365+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.034365+00:00\",\n    \"provider_model_id_used\": \"command-r-plus-04-2024\",\n    \"model_name\": \"Command R+\",\n    \"organization_id\": \"cohere\"\n  },\n  {\n    \"model_provider_id\": 374,\n    \"model_id\": \"jamba-1.5-large\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 200,\n    \"output_cents_per_million_tokens\": 800,\n    \"quantization\": null,\n    \"max_input_tokens\": 256000,\n    \"max_output_tokens\": 256000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.020432+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.020432+00:00\",\n    \"provider_model_id_used\": \"jamba-1.5-large\",\n    \"model_name\": \"Jamba 1.5 Large\",\n    \"organization_id\": \"ai21\"\n  },\n  {\n    \"model_provider_id\": 373,\n    \"model_id\": \"jamba-1.5-mini\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 256144,\n    \"max_output_tokens\": 256144,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.018357+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.018357+00:00\",\n    \"provider_model_id_used\": \"jamba-1.5-mini\",\n    \"model_name\": \"Jamba 1.5 Mini\",\n    \"organization_id\": \"ai21\"\n  },\n  {\n    \"model_provider_id\": 376,\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 300,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.024314+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.024314+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-405b-instruct\",\n    \"model_name\": \"Llama 3.1 405B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 375,\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 89,\n    \"output_cents_per_million_tokens\": 89,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.022256+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.022256+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-70b-instruct\",\n    \"model_name\": \"Llama 3.1 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 377,\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 22,\n    \"output_cents_per_million_tokens\": 22,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.026582+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.026582+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-8b-instruct\",\n    \"model_name\": \"Llama 3.1 8B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 378,\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 16,\n    \"output_cents_per_million_tokens\": 16,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.028853+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.028853+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-11b-instruct\",\n    \"model_name\": \"Llama 3.2 11B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 379,\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 72,\n    \"output_cents_per_million_tokens\": 72,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.030727+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.030727+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-90b-instruct\",\n    \"model_name\": \"Llama 3.2 90B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 380,\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 72,\n    \"output_cents_per_million_tokens\": 72,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.032478+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.032478+00:00\",\n    \"provider_model_id_used\": \"llama-3.3-70b-instruct\",\n    \"model_name\": \"Llama 3.3 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 383,\n    \"model_id\": \"nova-lite\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 6,\n    \"output_cents_per_million_tokens\": 24,\n    \"quantization\": null,\n    \"max_input_tokens\": 300000,\n    \"max_output_tokens\": 2048,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.037841+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.037841+00:00\",\n    \"provider_model_id_used\": \"nova-lite\",\n    \"model_name\": \"Nova Lite\",\n    \"organization_id\": \"amazon\"\n  },\n  {\n    \"model_provider_id\": 382,\n    \"model_id\": \"nova-micro\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 3,\n    \"output_cents_per_million_tokens\": 14,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.036065+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.036065+00:00\",\n    \"provider_model_id_used\": \"nova-micro\",\n    \"model_name\": \"Nova Micro\",\n    \"organization_id\": \"amazon\"\n  },\n  {\n    \"model_provider_id\": 384,\n    \"model_id\": \"nova-pro\",\n    \"provider_id\": \"bedrock\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 80,\n    \"output_cents_per_million_tokens\": 320,\n    \"quantization\": null,\n    \"max_input_tokens\": 300000,\n    \"max_output_tokens\": 300000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.039606+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.039606+00:00\",\n    \"provider_model_id_used\": \"nova-pro\",\n    \"model_name\": \"Nova Pro\",\n    \"organization_id\": \"amazon\"\n  }\n]\n"
  },
  {
    "path": "data/providers/bedrock/provider.json",
    "content": "{\n  \"provider_id\": \"bedrock\",\n  \"name\": \"Bedrock\",\n  \"website\": \"https://aws.amazon.com/bedrock/\",\n  \"created_at\": \"2025-07-19T19:49:17.004009+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:17.004009+00:00\"\n}"
  },
  {
    "path": "data/providers/cerebras/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 405,\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"provider_id\": \"cerebras\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 60,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 1204.0,\n    \"latency\": 0.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.090362+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.090362+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-70b-instruct\",\n    \"model_name\": \"Llama 3.1 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 406,\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"provider_id\": \"cerebras\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 10,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 2047.0,\n    \"latency\": 0.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.092709+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.092709+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-8b-instruct\",\n    \"model_name\": \"Llama 3.1 8B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 407,\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"provider_id\": \"cerebras\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 70,\n    \"output_cents_per_million_tokens\": 80,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 2220.0,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.095252+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.095252+00:00\",\n    \"provider_model_id_used\": \"llama-3.3-70b-instruct\",\n    \"model_name\": \"Llama 3.3 70B Instruct\",\n    \"organization_id\": \"meta\"\n  }\n]"
  },
  {
    "path": "data/providers/cerebras/provider.json",
    "content": "{\n  \"provider_id\": \"cerebras\",\n  \"name\": \"Cerebras\",\n  \"website\": \"https://cerebras.ai\",\n  \"created_at\": \"2025-07-19T19:49:17.088130+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:17.088130+00:00\"\n}"
  },
  {
    "path": "data/providers/cohere/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 238,\n    \"model_id\": \"command-r-plus-04-2024\",\n    \"provider_id\": \"cohere\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 25,\n    \"output_cents_per_million_tokens\": 100,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 59.0,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.693641+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.693641+00:00\",\n    \"provider_model_id_used\": \"command-r-plus-04-2024\",\n    \"model_name\": \"Command R+\",\n    \"organization_id\": \"cohere\"\n  }\n]"
  },
  {
    "path": "data/providers/cohere/provider.json",
    "content": "{\n  \"provider_id\": \"cohere\",\n  \"name\": \"Cohere\",\n  \"website\": \"https://cohere.ai\",\n  \"created_at\": \"2025-07-19T19:49:16.663117+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.663117+00:00\"\n}\n"
  },
  {
    "path": "data/providers/deepinfra/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 290,\n    \"model_id\": \"deepseek-r1\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 85,\n    \"output_cents_per_million_tokens\": 250,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 0.9,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.830887+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.830887+00:00\",\n    \"provider_model_id_used\": \"deepseek-r1\",\n    \"model_name\": \"DeepSeek-R1\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 304,\n    \"model_id\": \"deepseek-r1-0528\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 50,\n    \"output_cents_per_million_tokens\": 215,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 45.04,\n    \"latency\": 0.61,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.862375+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.862375+00:00\",\n    \"provider_model_id_used\": \"deepseek-r1-0528\",\n    \"model_name\": \"DeepSeek-R1-0528\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 298,\n    \"model_id\": \"deepseek-r1-distill-llama-70b\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 37.0,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.847437+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.847437+00:00\",\n    \"provider_model_id_used\": \"deepseek-r1-distill-llama-70b\",\n    \"model_name\": \"DeepSeek R1 Distill Llama 70B\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 299,\n    \"model_id\": \"deepseek-r1-distill-qwen-32b\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 12,\n    \"output_cents_per_million_tokens\": 18,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 37.0,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.849673+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.849673+00:00\",\n    \"provider_model_id_used\": \"deepseek-r1-distill-qwen-32b\",\n    \"model_name\": \"DeepSeek R1 Distill Qwen 32B\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 284,\n    \"model_id\": \"deepseek-v2.5\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 70,\n    \"output_cents_per_million_tokens\": 140,\n    \"quantization\": null,\n    \"max_input_tokens\": 8192,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 63.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.819006+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.819006+00:00\",\n    \"provider_model_id_used\": \"deepseek-v2.5\",\n    \"model_name\": \"DeepSeek-V2.5\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 305,\n    \"model_id\": \"deepseek-v3.1\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 27,\n    \"output_cents_per_million_tokens\": 100,\n    \"quantization\": \"int4\",\n    \"max_input_tokens\": 163840,\n    \"max_output_tokens\": 163840,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"deepseek-ai/DeepSeek-V3.1\",\n    \"model_name\": \"DeepSeek V3.1\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 306,\n    \"model_id\": \"glm-4.5\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 40,\n    \"output_cents_per_million_tokens\": 160,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": false,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"zai-org/GLM-4.5\",\n    \"model_name\": \"GLM-4.5\",\n    \"organization_id\": \"zai-org\"\n  },\n  {\n    \"model_provider_id\": 307,\n    \"model_id\": \"gpt-oss-120b\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 9,\n    \"output_cents_per_million_tokens\": 45,\n    \"quantization\": \"int4\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-15T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"openai/gpt-oss-120b\",\n    \"model_name\": \"GPT-OSS-120B\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 294,\n    \"model_id\": \"gemma-3-12b-it\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 5,\n    \"output_cents_per_million_tokens\": 10,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 33.0,\n    \"latency\": 0.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.839147+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.839147+00:00\",\n    \"provider_model_id_used\": \"gemma-3-12b-it\",\n    \"model_name\": \"Gemma 3 12B\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 295,\n    \"model_id\": \"gemma-3-27b-it\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 20,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 33.0,\n    \"latency\": 0.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.841300+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.841300+00:00\",\n    \"provider_model_id_used\": \"gemma-3-27b-it\",\n    \"model_name\": \"Gemma 3 27B\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 293,\n    \"model_id\": \"gemma-3-4b-it\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 2,\n    \"output_cents_per_million_tokens\": 4,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 33.0,\n    \"latency\": 0.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.837297+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.837297+00:00\",\n    \"provider_model_id_used\": \"gemma-3-4b-it\",\n    \"model_name\": \"Gemma 3 4B\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 281,\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 179,\n    \"output_cents_per_million_tokens\": 179,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 27.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.812645+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.812645+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-405b-instruct\",\n    \"model_name\": \"Llama 3.1 405B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 279,\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 35,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 25.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.808506+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.808506+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-70b-instruct\",\n    \"model_name\": \"Llama 3.1 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 280,\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 5,\n    \"output_cents_per_million_tokens\": 5,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 118.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.810724+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.810724+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-8b-instruct\",\n    \"model_name\": \"Llama 3.1 8B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 283,\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 5,\n    \"output_cents_per_million_tokens\": 5,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 108.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.817103+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.817103+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-11b-instruct\",\n    \"model_name\": \"Llama 3.2 11B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 289,\n    \"model_id\": \"llama-3.2-3b-instruct\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1,\n    \"output_cents_per_million_tokens\": 2,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 171.5,\n    \"latency\": 0.24,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.828875+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.828875+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-3b-instruct\",\n    \"model_name\": \"Llama 3.2 3B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 282,\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 35,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 24.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.814472+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.814472+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-90b-instruct\",\n    \"model_name\": \"Llama 3.2 90B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 288,\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 23,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 37.0,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.827019+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.827019+00:00\",\n    \"provider_model_id_used\": \"llama-3.3-70b-instruct\",\n    \"model_name\": \"Llama 3.3 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 296,\n    \"model_id\": \"llama-4-maverick\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 17,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 1000000,\n    \"max_output_tokens\": 1000000,\n    \"throughput\": 83.59,\n    \"latency\": 0.38,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.843444+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.843444+00:00\",\n    \"provider_model_id_used\": \"llama-4-maverick\",\n    \"model_name\": \"Llama 4 Maverick\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 297,\n    \"model_id\": \"llama-4-scout\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 8,\n    \"output_cents_per_million_tokens\": 30,\n    \"quantization\": null,\n    \"max_input_tokens\": 10000000,\n    \"max_output_tokens\": 10000000,\n    \"throughput\": 76.1,\n    \"latency\": 0.31,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.845085+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.845085+00:00\",\n    \"provider_model_id_used\": \"llama-4-scout\",\n    \"model_name\": \"Llama 4 Scout\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 291,\n    \"model_id\": \"mistral-small-24b-instruct-2501\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 7,\n    \"output_cents_per_million_tokens\": 14,\n    \"quantization\": null,\n    \"max_input_tokens\": 32000,\n    \"max_output_tokens\": 32000,\n    \"throughput\": 49.0,\n    \"latency\": 0.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.832954+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.832954+00:00\",\n    \"provider_model_id_used\": \"mistral-small-24b-instruct-2501\",\n    \"model_name\": \"Mistral Small 3 24B Instruct\",\n    \"organization_id\": \"mistral\"\n  },\n  {\n    \"model_provider_id\": 292,\n    \"model_id\": \"phi-4\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 7,\n    \"output_cents_per_million_tokens\": 14,\n    \"quantization\": null,\n    \"max_input_tokens\": 16000,\n    \"max_output_tokens\": 16000,\n    \"throughput\": 33.0,\n    \"latency\": 0.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.835314+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.835314+00:00\",\n    \"provider_model_id_used\": \"phi-4\",\n    \"model_name\": \"Phi 4\",\n    \"organization_id\": \"microsoft\"\n  },\n  {\n    \"model_provider_id\": 300,\n    \"model_id\": \"phi-4-multimodal-instruct\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 5,\n    \"output_cents_per_million_tokens\": 10,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 25.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.852868+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.852868+00:00\",\n    \"provider_model_id_used\": \"phi-4-multimodal-instruct\",\n    \"model_name\": \"Phi-4-multimodal-instruct\",\n    \"organization_id\": \"microsoft\"\n  },\n  {\n    \"model_provider_id\": 286,\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 35,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 10.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.822329+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.822329+00:00\",\n    \"provider_model_id_used\": \"qwen-2.5-72b-instruct\",\n    \"model_name\": \"Qwen2.5 72B Instruct\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 285,\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 18,\n    \"output_cents_per_million_tokens\": 18,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 44.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.820492+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.820492+00:00\",\n    \"provider_model_id_used\": \"qwen-2.5-coder-32b-instruct\",\n    \"model_name\": \"Qwen2.5-Coder 32B Instruct\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 301,\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 21.74,\n    \"latency\": 1.23,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.855452+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.855452+00:00\",\n    \"provider_model_id_used\": \"qwen3-235b-a22b\",\n    \"model_name\": \"Qwen3 235B A22B\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 303,\n    \"model_id\": \"qwen3-30b-a3b\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 30,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 82.57,\n    \"latency\": 0.84,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.859780+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.859780+00:00\",\n    \"provider_model_id_used\": \"qwen3-30b-a3b\",\n    \"model_name\": \"Qwen3 30B A3B\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 302,\n    \"model_id\": \"qwen3-32b\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 30,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 26.95,\n    \"latency\": 1.19,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.857468+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.857468+00:00\",\n    \"provider_model_id_used\": \"qwen3-32b\",\n    \"model_name\": \"Qwen3 32B\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 287,\n    \"model_id\": \"qwq-32b-preview\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 32768,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 76.04,\n    \"latency\": 0.44,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.825039+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.825039+00:00\",\n    \"provider_model_id_used\": \"qwq-32b-preview\",\n    \"model_name\": \"QwQ-32B-Preview\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 288,\n    \"model_id\": \"glm-4.6\",\n    \"provider_id\": \"deepinfra\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 60,\n    \"output_cents_per_million_tokens\": 200,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": true,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-30T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-30T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"zai-org/GLM-4.6\",\n    \"model_name\": \"GLM-4.6\",\n    \"organization_id\": \"zai-org\"\n  }\n]\n"
  },
  {
    "path": "data/providers/deepinfra/provider.json",
    "content": "{\n  \"provider_id\": \"deepinfra\",\n  \"name\": \"DeepInfra\",\n  \"website\": \"https://deepinfra.com/\",\n  \"created_at\": \"2025-07-19T19:49:16.806529+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.806529+00:00\"\n}"
  },
  {
    "path": "data/providers/deepseek/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 361,\n    \"model_id\": \"deepseek-r1\",\n    \"provider_id\": \"deepseek\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 55,\n    \"output_cents_per_million_tokens\": 219,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 9.0,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.991378+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.991378+00:00\",\n    \"provider_model_id_used\": \"deepseek-r1\",\n    \"model_name\": \"DeepSeek-R1\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 362,\n    \"model_id\": \"deepseek-r1-0528\",\n    \"provider_id\": \"deepseek\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 55,\n    \"output_cents_per_million_tokens\": 219,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 9.0,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.993656+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.993656+00:00\",\n    \"provider_model_id_used\": \"deepseek-r1-0528\",\n    \"model_name\": \"DeepSeek-R1-0528\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 359,\n    \"model_id\": \"deepseek-v2.5\",\n    \"provider_id\": \"deepseek\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 14,\n    \"output_cents_per_million_tokens\": 28,\n    \"quantization\": null,\n    \"max_input_tokens\": 8192,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.987664+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.987664+00:00\",\n    \"provider_model_id_used\": \"deepseek-v2.5\",\n    \"model_name\": \"DeepSeek-V2.5\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 360,\n    \"model_id\": \"deepseek-v3\",\n    \"provider_id\": \"deepseek\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 27,\n    \"output_cents_per_million_tokens\": 110,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.989355+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.989355+00:00\",\n    \"provider_model_id_used\": \"deepseek-v3\",\n    \"model_name\": \"DeepSeek-V3\",\n    \"organization_id\": \"deepseek\"\n  }\n]"
  },
  {
    "path": "data/providers/deepseek/provider.json",
    "content": "{\n  \"provider_id\": \"deepseek\",\n  \"name\": \"DeepSeek\",\n  \"website\": \"https://deepseek.com/\",\n  \"created_at\": \"2025-07-19T19:49:16.986078+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.986078+00:00\"\n}\n"
  },
  {
    "path": "data/providers/fireworks/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 340,\n    \"model_id\": \"deepseek-r1\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 800,\n    \"output_cents_per_million_tokens\": 800,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 2.1,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.942224+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.942224+00:00\",\n    \"provider_model_id_used\": \"deepseek-r1\",\n    \"model_name\": \"DeepSeek-R1\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 331,\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 300,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 78.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.923810+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.923810+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-405b-instruct\",\n    \"model_name\": \"Llama 3.1 405B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 332,\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 89,\n    \"output_cents_per_million_tokens\": 89,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 32.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.926263+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.926263+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-70b-instruct\",\n    \"model_name\": \"Llama 3.1 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 333,\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 20,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 292.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.928500+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.928500+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-8b-instruct\",\n    \"model_name\": \"Llama 3.1 8B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 335,\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 20,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 125.0,\n    \"latency\": 0.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.932316+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.932316+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-11b-instruct\",\n    \"model_name\": \"Llama 3.2 11B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 334,\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 89,\n    \"output_cents_per_million_tokens\": 89,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 50.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.930486+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.930486+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-90b-instruct\",\n    \"model_name\": \"Llama 3.2 90B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 339,\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 89,\n    \"output_cents_per_million_tokens\": 89,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 197.0,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.939993+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.939993+00:00\",\n    \"provider_model_id_used\": \"llama-3.3-70b-instruct\",\n    \"model_name\": \"Llama 3.3 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 341,\n    \"model_id\": \"llama-4-maverick\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 22,\n    \"output_cents_per_million_tokens\": 88,\n    \"quantization\": null,\n    \"max_input_tokens\": 1000000,\n    \"max_output_tokens\": 1000000,\n    \"throughput\": 63.03,\n    \"latency\": 0.62,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.944370+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.944370+00:00\",\n    \"provider_model_id_used\": \"llama-4-maverick\",\n    \"model_name\": \"Llama 4 Maverick\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 342,\n    \"model_id\": \"llama-4-scout\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 10000000,\n    \"max_output_tokens\": 10000000,\n    \"throughput\": 116.1,\n    \"latency\": 0.53,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.946725+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.946725+00:00\",\n    \"provider_model_id_used\": \"llama-4-scout\",\n    \"model_name\": \"Llama 4 Scout\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 337,\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 89,\n    \"output_cents_per_million_tokens\": 89,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 59.0,\n    \"latency\": 0.37,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.936092+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.936092+00:00\",\n    \"provider_model_id_used\": \"qwen-2.5-72b-instruct\",\n    \"model_name\": \"Qwen2.5 72B Instruct\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 336,\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 89,\n    \"output_cents_per_million_tokens\": 89,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 110.0,\n    \"latency\": 0.26,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.934183+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.934183+00:00\",\n    \"provider_model_id_used\": \"qwen-2.5-coder-32b-instruct\",\n    \"model_name\": \"Qwen2.5-Coder 32B Instruct\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 343,\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 10,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 68.17,\n    \"latency\": 0.78,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.949833+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.949833+00:00\",\n    \"provider_model_id_used\": \"qwen3-235b-a22b\",\n    \"model_name\": \"Qwen3 235B A22B\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 344,\n    \"model_id\": \"qwen3-30b-a3b\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 89,\n    \"output_cents_per_million_tokens\": 89,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 122.4,\n    \"latency\": 0.66,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.951886+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.951886+00:00\",\n    \"provider_model_id_used\": \"qwen3-30b-a3b\",\n    \"model_name\": \"Qwen3 30B A3B\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 338,\n    \"model_id\": \"qwq-32b-preview\",\n    \"provider_id\": \"fireworks\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 89,\n    \"output_cents_per_million_tokens\": 89,\n    \"quantization\": null,\n    \"max_input_tokens\": 32768,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 99.15,\n    \"latency\": 0.53,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.937841+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.937841+00:00\",\n    \"provider_model_id_used\": \"qwq-32b-preview\",\n    \"model_name\": \"QwQ-32B-Preview\",\n    \"organization_id\": \"qwen\"\n  }\n]"
  },
  {
    "path": "data/providers/fireworks/provider.json",
    "content": "{\n  \"provider_id\": \"fireworks\",\n  \"name\": \"Fireworks\",\n  \"website\": \"https://fireworks.ai/\",\n  \"created_at\": \"2025-07-19T19:49:16.921865+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.921865+00:00\"\n}"
  },
  {
    "path": "data/providers/google/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 318,\n    \"model_id\": \"claude-3-5-haiku-20241022\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 80,\n    \"output_cents_per_million_tokens\": 400,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.896052+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.896052+00:00\",\n    \"provider_model_id_used\": \"claude-3-5-haiku-20241022\",\n    \"model_name\": \"Claude 3.5 Haiku\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 320,\n    \"model_id\": \"claude-3-5-sonnet-20240620\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.900161+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.900161+00:00\",\n    \"provider_model_id_used\": \"claude-3-5-sonnet-20240620\",\n    \"model_name\": \"Claude 3.5 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 319,\n    \"model_id\": \"claude-3-5-sonnet-20241022\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.898073+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.898073+00:00\",\n    \"provider_model_id_used\": \"claude-3-5-sonnet-20241022\",\n    \"model_name\": \"Claude 3.5 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 327,\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.914565+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.914565+00:00\",\n    \"provider_model_id_used\": \"claude-3-7-sonnet-20250219\",\n    \"model_name\": \"Claude 3.7 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 328,\n    \"model_id\": \"claude-3-haiku-20240307\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 25,\n    \"output_cents_per_million_tokens\": 125,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.916491+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.916491+00:00\",\n    \"provider_model_id_used\": \"claude-3-haiku-20240307\",\n    \"model_name\": \"Claude 3 Haiku\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 322,\n    \"model_id\": \"claude-3-opus-20240229\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 7500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.903705+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.903705+00:00\",\n    \"provider_model_id_used\": \"claude-3-opus-20240229\",\n    \"model_name\": \"Claude 3 Opus\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 321,\n    \"model_id\": \"claude-3-sonnet-20240229\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 200000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.902100+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.902100+00:00\",\n    \"provider_model_id_used\": \"claude-3-sonnet-20240229\",\n    \"model_name\": \"Claude 3 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 330,\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 7500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.920504+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.920504+00:00\",\n    \"provider_model_id_used\": \"claude-opus-4-20250514\",\n    \"model_name\": \"Claude Opus 4\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 331,\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 7500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 32000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"claude-opus-4-1-20250805\",\n    \"model_name\": \"Claude Opus 4.1\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 329,\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.918456+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.918456+00:00\",\n    \"provider_model_id_used\": \"claude-sonnet-4-20250514\",\n    \"model_name\": \"Claude Sonnet 4\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 312,\n    \"model_id\": \"gemini-1.0-pro\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 50,\n    \"output_cents_per_million_tokens\": 150,\n    \"quantization\": null,\n    \"max_input_tokens\": 32760,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 120.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.882424+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.882424+00:00\",\n    \"provider_model_id_used\": \"gemini-1.0-pro\",\n    \"model_name\": \"Gemini 1.0 Pro\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 313,\n    \"model_id\": \"gemini-1.5-flash\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 1048576,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 150.0,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.885387+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.885387+00:00\",\n    \"provider_model_id_used\": \"gemini-1.5-flash\",\n    \"model_name\": \"Gemini 1.5 Flash\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 314,\n    \"model_id\": \"gemini-1.5-flash-8b\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 7,\n    \"output_cents_per_million_tokens\": 30,\n    \"quantization\": null,\n    \"max_input_tokens\": 1048576,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 150.0,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.887626+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.887626+00:00\",\n    \"provider_model_id_used\": \"gemini-1.5-flash-8b\",\n    \"model_name\": \"Gemini 1.5 Flash 8B\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 311,\n    \"model_id\": \"gemini-1.5-pro\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 250,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 2097152,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.880526+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.880526+00:00\",\n    \"provider_model_id_used\": \"gemini-1.5-pro\",\n    \"model_name\": \"Gemini 1.5 Pro\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 310,\n    \"model_id\": \"gemini-2.0-flash\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 1048576,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 183.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.878419+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.878419+00:00\",\n    \"provider_model_id_used\": \"gemini-2.0-flash\",\n    \"model_name\": \"Gemini 2.0 Flash\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 309,\n    \"model_id\": \"gemini-2.0-flash-lite\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 7,\n    \"output_cents_per_million_tokens\": 30,\n    \"quantization\": null,\n    \"max_input_tokens\": 1048576,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.876262+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.876262+00:00\",\n    \"provider_model_id_used\": \"gemini-2.0-flash-lite\",\n    \"model_name\": \"Gemini 2.0 Flash-Lite\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 306,\n    \"model_id\": \"gemini-2.5-flash\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 30,\n    \"output_cents_per_million_tokens\": 250,\n    \"quantization\": null,\n    \"max_input_tokens\": 1048576,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.868859+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.868859+00:00\",\n    \"provider_model_id_used\": \"gemini-2.5-flash\",\n    \"model_name\": \"Gemini 2.5 Flash\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 305,\n    \"model_id\": \"gemini-2.5-flash-lite\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 1048576,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 5.69,\n    \"latency\": 0.44,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.866570+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.866570+00:00\",\n    \"provider_model_id_used\": \"gemini-2.5-flash-lite\",\n    \"model_name\": \"Gemini 2.5 Flash-Lite\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 307,\n    \"model_id\": \"gemini-2.5-pro\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 125,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 1048576,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.871063+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.871063+00:00\",\n    \"provider_model_id_used\": \"gemini-2.5-pro\",\n    \"model_name\": \"Gemini 2.5 Pro\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 308,\n    \"model_id\": \"gemini-2.5-pro-preview-06-05\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 125,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 1048576,\n    \"max_output_tokens\": 65535,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.873667+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.873667+00:00\",\n    \"provider_model_id_used\": \"gemini-2.5-pro-preview-06-05\",\n    \"model_name\": \"Gemini 2.5 Pro Preview 06-05\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 316,\n    \"model_id\": \"jamba-1.5-large\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 200,\n    \"output_cents_per_million_tokens\": 800,\n    \"quantization\": null,\n    \"max_input_tokens\": 256000,\n    \"max_output_tokens\": 256000,\n    \"throughput\": 42.0,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.891518+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.891518+00:00\",\n    \"provider_model_id_used\": \"jamba-1.5-large\",\n    \"model_name\": \"Jamba 1.5 Large\",\n    \"organization_id\": \"ai21\"\n  },\n  {\n    \"model_provider_id\": 317,\n    \"model_id\": \"jamba-1.5-mini\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 256144,\n    \"max_output_tokens\": 256144,\n    \"throughput\": 100.0,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.893779+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.893779+00:00\",\n    \"provider_model_id_used\": \"jamba-1.5-mini\",\n    \"model_name\": \"Jamba 1.5 Mini\",\n    \"organization_id\": \"ai21\"\n  },\n  {\n    \"model_provider_id\": 323,\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 500,\n    \"output_cents_per_million_tokens\": 1600,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.905332+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.905332+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-405b-instruct\",\n    \"model_name\": \"Llama 3.1 405B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 324,\n    \"model_id\": \"mistral-large-2-2407\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 200,\n    \"output_cents_per_million_tokens\": 600,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.907260+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.907260+00:00\",\n    \"provider_model_id_used\": \"mistral-large-2-2407\",\n    \"model_name\": \"Mistral Large 2\",\n    \"organization_id\": \"mistral\"\n  },\n  {\n    \"model_provider_id\": 325,\n    \"model_id\": \"mistral-nemo-instruct-2407\",\n    \"provider_id\": \"google\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 15,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.909863+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.909863+00:00\",\n    \"provider_model_id_used\": \"mistral-nemo-instruct-2407\",\n    \"model_name\": \"Mistral NeMo Instruct\",\n    \"organization_id\": \"mistral\"\n  }\n]\n"
  },
  {
    "path": "data/providers/google/provider.json",
    "content": "{\n  \"provider_id\": \"google\",\n  \"name\": \"Google\",\n  \"website\": \"https://ai.google.dev\",\n  \"created_at\": \"2025-07-19T19:49:16.864633+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.864633+00:00\"\n}\n"
  },
  {
    "path": "data/providers/groq/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 345,\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"provider_id\": \"groq\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 59,\n    \"output_cents_per_million_tokens\": 78,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 250.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.955618+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.955618+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-70b-instruct\",\n    \"model_name\": \"Llama 3.1 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 346,\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"provider_id\": \"groq\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 5,\n    \"output_cents_per_million_tokens\": 8,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 750.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.957463+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.957463+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-8b-instruct\",\n    \"model_name\": \"Llama 3.1 8B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 347,\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"provider_id\": \"groq\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 18,\n    \"output_cents_per_million_tokens\": 18,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.959974+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.959974+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-11b-instruct\",\n    \"model_name\": \"Llama 3.2 11B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 348,\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"provider_id\": \"groq\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 59,\n    \"output_cents_per_million_tokens\": 790,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 268.0,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.962122+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.962122+00:00\",\n    \"provider_model_id_used\": \"llama-3.3-70b-instruct\",\n    \"model_name\": \"Llama 3.3 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 349,\n    \"model_id\": \"llama-4-maverick\",\n    \"provider_id\": \"groq\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 1000000,\n    \"max_output_tokens\": 1000000,\n    \"throughput\": 307.3,\n    \"latency\": 0.27,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.963701+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.963701+00:00\",\n    \"provider_model_id_used\": \"llama-4-maverick\",\n    \"model_name\": \"Llama 4 Maverick\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 350,\n    \"model_id\": \"llama-4-scout\",\n    \"provider_id\": \"groq\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 11,\n    \"output_cents_per_million_tokens\": 34,\n    \"quantization\": null,\n    \"max_input_tokens\": 10000000,\n    \"max_output_tokens\": 10000000,\n    \"throughput\": 776.1,\n    \"latency\": 1.08,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.965756+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.965756+00:00\",\n    \"provider_model_id_used\": \"llama-4-scout\",\n    \"model_name\": \"Llama 4 Scout\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 1231,\n    \"model_id\": \"gpt-oss-120b\",\n    \"provider_id\": \"groq\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 131000,\n    \"max_output_tokens\": 30000,\n    \"throughput\": 500,\n    \"latency\": 0.5,\n    \"feature_web_search\": true,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": true,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-08-05T19:49:16.965756+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:16.965756+00:00\",\n    \"provider_model_id_used\": \"gpt-oss-120b\",\n    \"model_name\": \"OpenAI OSS 120B\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 1232,\n    \"model_id\": \"gpt-oss-20b\",\n    \"provider_id\": \"groq\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 50,\n    \"quantization\": null,\n    \"max_input_tokens\": 131000,\n    \"max_output_tokens\": 30000,\n    \"throughput\": 1000,\n    \"latency\": 0.38,\n    \"feature_web_search\": true,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": true,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-08-05T19:49:16.965756+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:16.965756+00:00\",\n    \"provider_model_id_used\": \"gpt-oss-20b\",\n    \"model_name\": \"OpenAI OSS 20B\",\n    \"organization_id\": \"openai\"\n  }\n]\n"
  },
  {
    "path": "data/providers/groq/provider.json",
    "content": "{\n  \"provider_id\": \"groq\",\n  \"name\": \"Groq\",\n  \"website\": \"https://groq.com/\",\n  \"created_at\": \"2025-07-19T19:49:16.953587+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.953587+00:00\"\n}"
  },
  {
    "path": "data/providers/hyperbolic/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 276,\n    \"model_id\": \"deepseek-v2.5\",\n    \"provider_id\": \"hyperbolic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 200,\n    \"output_cents_per_million_tokens\": 200,\n    \"quantization\": null,\n    \"max_input_tokens\": 8192,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.801424+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.801424+00:00\",\n    \"provider_model_id_used\": \"deepseek-v2.5\",\n    \"model_name\": \"DeepSeek-V2.5\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 272,\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"provider_id\": \"hyperbolic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 400,\n    \"output_cents_per_million_tokens\": 400,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 40.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.788610+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.788610+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-405b-instruct\",\n    \"model_name\": \"Llama 3.1 405B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 271,\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"provider_id\": \"hyperbolic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 40,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.785874+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.785874+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-70b-instruct\",\n    \"model_name\": \"Llama 3.1 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 270,\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"provider_id\": \"hyperbolic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 10,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 200.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.783230+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.783230+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-8b-instruct\",\n    \"model_name\": \"Llama 3.1 8B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 273,\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"provider_id\": \"hyperbolic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 200,\n    \"output_cents_per_million_tokens\": 200,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.791634+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.791634+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-90b-instruct\",\n    \"model_name\": \"Llama 3.2 90B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 278,\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"provider_id\": \"hyperbolic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 40,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.805164+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.805164+00:00\",\n    \"provider_model_id_used\": \"llama-3.3-70b-instruct\",\n    \"model_name\": \"Llama 3.3 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 274,\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"provider_id\": \"hyperbolic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 40,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.795011+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.795011+00:00\",\n    \"provider_model_id_used\": \"qwen-2.5-72b-instruct\",\n    \"model_name\": \"Qwen2.5 72B Instruct\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 275,\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"provider_id\": \"hyperbolic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 20,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.798904+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.798904+00:00\",\n    \"provider_model_id_used\": \"qwen-2.5-coder-32b-instruct\",\n    \"model_name\": \"Qwen2.5-Coder 32B Instruct\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 277,\n    \"model_id\": \"qwq-32b-preview\",\n    \"provider_id\": \"hyperbolic\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 20,\n    \"quantization\": null,\n    \"max_input_tokens\": 32768,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 31.9,\n    \"latency\": 1.05,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.803353+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.803353+00:00\",\n    \"provider_model_id_used\": \"qwq-32b-preview\",\n    \"model_name\": \"QwQ-32B-Preview\",\n    \"organization_id\": \"qwen\"\n  }\n]"
  },
  {
    "path": "data/providers/hyperbolic/provider.json",
    "content": "{\n  \"provider_id\": \"hyperbolic\",\n  \"name\": \"Hyperbolic\",\n  \"website\": \"https://hyperbolic.xyz\",\n  \"created_at\": \"2025-07-19T19:49:16.780946+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.780946+00:00\"\n}"
  },
  {
    "path": "data/providers/lambda/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 390,\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"provider_id\": \"lambda\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 89,\n    \"output_cents_per_million_tokens\": 89,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.054217+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.054217+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-405b-instruct\",\n    \"model_name\": \"Llama 3.1 405B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 389,\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"provider_id\": \"lambda\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 20,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.051981+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.051981+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-70b-instruct\",\n    \"model_name\": \"Llama 3.1 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 388,\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"provider_id\": \"lambda\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 3,\n    \"output_cents_per_million_tokens\": 3,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 42.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.050200+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.050200+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-8b-instruct\",\n    \"model_name\": \"Llama 3.1 8B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 391,\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"provider_id\": \"lambda\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 20,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.056567+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.056567+00:00\",\n    \"provider_model_id_used\": \"llama-3.3-70b-instruct\",\n    \"model_name\": \"Llama 3.3 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 393,\n    \"model_id\": \"llama-4-maverick\",\n    \"provider_id\": \"lambda\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 18,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 1000000,\n    \"max_output_tokens\": 1000000,\n    \"throughput\": 93.69,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.060734+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.060734+00:00\",\n    \"provider_model_id_used\": \"llama-4-maverick\",\n    \"model_name\": \"Llama 4 Maverick\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 394,\n    \"model_id\": \"llama-4-scout\",\n    \"provider_id\": \"lambda\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 8,\n    \"output_cents_per_million_tokens\": 30,\n    \"quantization\": null,\n    \"max_input_tokens\": 10000000,\n    \"max_output_tokens\": 10000000,\n    \"throughput\": 139.7,\n    \"latency\": 0.43,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.062783+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.062783+00:00\",\n    \"provider_model_id_used\": \"llama-4-scout\",\n    \"model_name\": \"Llama 4 Scout\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 392,\n    \"model_id\": \"qwen-2.5-coder-32b-instruct\",\n    \"provider_id\": \"lambda\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 9,\n    \"output_cents_per_million_tokens\": 9,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.058608+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.058608+00:00\",\n    \"provider_model_id_used\": \"qwen-2.5-coder-32b-instruct\",\n    \"model_name\": \"Qwen2.5-Coder 32B Instruct\",\n    \"organization_id\": \"qwen\"\n  }\n]"
  },
  {
    "path": "data/providers/lambda/provider.json",
    "content": "{\n  \"provider_id\": \"lambda\",\n  \"name\": \"Lambda\",\n  \"website\": \"https://lambdalabs.com\",\n  \"created_at\": \"2025-07-19T19:49:17.048564+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:17.048564+00:00\"\n}"
  },
  {
    "path": "data/providers/mistral/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 408,\n    \"model_id\": \"devstral-medium-2507\",\n    \"provider_id\": \"mistral\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 40,\n    \"output_cents_per_million_tokens\": 200,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 137.1,\n    \"latency\": 0.23,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.098942+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.098942+00:00\",\n    \"provider_model_id_used\": \"devstral-medium-2507\",\n    \"model_name\": \"Devstral Medium\",\n    \"organization_id\": \"mistral\"\n  },\n  {\n    \"model_provider_id\": 409,\n    \"model_id\": \"devstral-small-2507\",\n    \"provider_id\": \"mistral\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 30,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 137.1,\n    \"latency\": 0.23,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.100512+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.100512+00:00\",\n    \"provider_model_id_used\": \"devstral-small-2507\",\n    \"model_name\": \"Devstral Small 1.1\",\n    \"organization_id\": \"mistral\"\n  },\n  {\n    \"model_provider_id\": 415,\n    \"model_id\": \"ministral-8b-instruct-2410\",\n    \"provider_id\": \"mistral\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 10,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 0.1,\n    \"latency\": 0.18,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.113059+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.113059+00:00\",\n    \"provider_model_id_used\": \"ministral-8b-instruct-2410\",\n    \"model_name\": \"Ministral 8B Instruct\",\n    \"organization_id\": \"mistral\"\n  },\n  {\n    \"model_provider_id\": 412,\n    \"model_id\": \"mistral-large-2-2407\",\n    \"provider_id\": \"mistral\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 200,\n    \"output_cents_per_million_tokens\": 600,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 0.1,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.106626+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.106626+00:00\",\n    \"provider_model_id_used\": \"mistral-large-2-2407\",\n    \"model_name\": \"Mistral Large 2\",\n    \"organization_id\": \"mistral\"\n  },\n  {\n    \"model_provider_id\": 417,\n    \"model_id\": \"mistral-nemo-instruct-2407\",\n    \"provider_id\": \"mistral\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 15,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 0.1,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.116560+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.116560+00:00\",\n    \"provider_model_id_used\": \"mistral-nemo-instruct-2407\",\n    \"model_name\": \"Mistral NeMo Instruct\",\n    \"organization_id\": \"mistral\"\n  },\n  {\n    \"model_provider_id\": 414,\n    \"model_id\": \"mistral-small-2409\",\n    \"provider_id\": \"mistral\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 32768,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 0.1,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.111268+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.111268+00:00\",\n    \"provider_model_id_used\": \"mistral-small-2409\",\n    \"model_name\": \"Mistral Small\",\n    \"organization_id\": \"mistral\"\n  },\n  {\n    \"model_provider_id\": 419,\n    \"model_id\": \"mistral-small-24b-instruct-2501\",\n    \"provider_id\": \"mistral\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 30,\n    \"quantization\": null,\n    \"max_input_tokens\": 32000,\n    \"max_output_tokens\": 32000,\n    \"throughput\": 134.0,\n    \"latency\": 0.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.120575+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.120575+00:00\",\n    \"provider_model_id_used\": \"mistral-small-24b-instruct-2501\",\n    \"model_name\": \"Mistral Small 3 24B Instruct\",\n    \"organization_id\": \"mistral\"\n  },\n  {\n    \"model_provider_id\": 410,\n    \"model_id\": \"mistral-small-3.1-24b-base-2503\",\n    \"provider_id\": \"mistral\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 30,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 137.1,\n    \"latency\": 0.23,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.102773+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.102773+00:00\",\n    \"provider_model_id_used\": \"mistral-small-3.1-24b-base-2503\",\n    \"model_name\": \"Mistral Small 3.1 24B Base\",\n    \"organization_id\": \"mistral\"\n  },\n  {\n    \"model_provider_id\": 416,\n    \"model_id\": \"pixtral-12b-2409\",\n    \"provider_id\": \"mistral\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 15,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 0.1,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.114646+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.114646+00:00\",\n    \"provider_model_id_used\": \"pixtral-12b-2409\",\n    \"model_name\": \"Pixtral-12B\",\n    \"organization_id\": \"mistral\"\n  },\n  {\n    \"model_provider_id\": 413,\n    \"model_id\": \"pixtral-large\",\n    \"provider_id\": \"mistral\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 200,\n    \"output_cents_per_million_tokens\": 600,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 0.1,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.108807+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.108807+00:00\",\n    \"provider_model_id_used\": \"pixtral-large\",\n    \"model_name\": \"Pixtral Large\",\n    \"organization_id\": \"mistral\"\n  }\n]\n"
  },
  {
    "path": "data/providers/mistral/provider.json",
    "content": "{\n  \"provider_id\": \"mistral\",\n  \"name\": \"Mistral AI\",\n  \"website\": \"https://mistral.ai\",\n  \"created_at\": \"2025-07-19T19:49:17.096952+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:17.096952+00:00\"\n}\n"
  },
  {
    "path": "data/providers/novita/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 359,\n    \"model_id\": \"qwen3-235b-a22b-instruct-2507\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 80,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 16384,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"qwen/qwen3-235b-a22b-instruct-2507\",\n    \"model_name\": \"Qwen3-235B-A22B-Instruct-2507\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 360,\n    \"model_id\": \"gpt-oss-20b\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 5,\n    \"output_cents_per_million_tokens\": 20,\n    \"quantization\": \"bf16\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 32768,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"openai/gpt-oss-20b\",\n    \"model_name\": \"GPT-OSS-20B\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 364,\n    \"model_id\": \"gpt-oss-120b\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 50,\n    \"quantization\": \"bf16\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"openai/gpt-oss-120b\",\n    \"model_name\": \"GPT-OSS-120B\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 361,\n    \"model_id\": \"qwen3-235b-a22b-thinking-2507\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 30,\n    \"output_cents_per_million_tokens\": 300,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 256000,\n    \"max_output_tokens\": 131072,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"qwen/qwen3-235b-a22b-thinking-2507\",\n    \"model_name\": \"Qwen3-235B-A22B-Thinking-2507\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 362,\n    \"model_id\": \"deepseek-v3-0324\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 28,\n    \"output_cents_per_million_tokens\": 114,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 163840,\n    \"max_output_tokens\": 163840,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"deepseek/deepseek-v3-0324\",\n    \"model_name\": \"DeepSeek-V3-0324\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 363,\n    \"model_id\": \"deepseek-v3.1\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 27,\n    \"output_cents_per_million_tokens\": 100,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 163840,\n    \"max_output_tokens\": 163840,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"deepseek/deepseek-v3.1\",\n    \"model_name\": \"DeepSeek V3.1\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 357,\n    \"model_id\": \"deepseek-r1-0528\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 70,\n    \"output_cents_per_million_tokens\": 250,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 37.96,\n    \"latency\": 1.18,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.982118+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.982118+00:00\",\n    \"provider_model_id_used\": \"deepseek-r1-0528\",\n    \"model_name\": \"DeepSeek-R1-0528\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 351,\n    \"model_id\": \"gemma-3-27b-it\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 11,\n    \"output_cents_per_million_tokens\": 20,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 33.0,\n    \"latency\": 0.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.969199+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.969199+00:00\",\n    \"provider_model_id_used\": \"gemma-3-27b-it\",\n    \"model_name\": \"Gemma 3 27B\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 358,\n    \"model_id\": \"kimi-k2-instruct\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 57,\n    \"output_cents_per_million_tokens\": 230,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 45.0,\n    \"latency\": 0.95,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.984536+00:00\",\n    \"updated_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"moonshotai/kimi-k2-instruct\",\n    \"model_name\": \"Kimi K2 Instruct\",\n    \"organization_id\": \"moonshotai\"\n  },\n  {\n    \"model_provider_id\": 365,\n    \"model_id\": \"kimi-k2-0905\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 60,\n    \"output_cents_per_million_tokens\": 250,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 262144,\n    \"max_output_tokens\": 262144,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"moonshotai/kimi-k2-0905\",\n    \"model_name\": \"Kimi K2 0905\",\n    \"organization_id\": \"moonshotai\"\n  },\n  {\n    \"model_provider_id\": 366,\n    \"model_id\": \"qwen3-next-80b-a3b-thinking\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 150,\n    \"quantization\": \"bf16\",\n    \"max_input_tokens\": 65536,\n    \"max_output_tokens\": 65536,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"qwen/qwen3-next-80b-a3b-thinking\",\n    \"model_name\": \"Qwen3 Next 80B A3B Thinking\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 367,\n    \"model_id\": \"qwen3-next-80b-a3b-instruct\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 150,\n    \"quantization\": \"bf16\",\n    \"max_input_tokens\": 65536,\n    \"max_output_tokens\": 65536,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-14T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"qwen/qwen3-next-80b-a3b-instruct\",\n    \"model_name\": \"Qwen3 Next 80B A3B Instruct\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 352,\n    \"model_id\": \"llama-4-maverick\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 17,\n    \"output_cents_per_million_tokens\": 85,\n    \"quantization\": null,\n    \"max_input_tokens\": 1000000,\n    \"max_output_tokens\": 1000000,\n    \"throughput\": 69.42,\n    \"latency\": 0.62,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.970871+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.970871+00:00\",\n    \"provider_model_id_used\": \"llama-4-maverick\",\n    \"model_name\": \"Llama 4 Maverick\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 353,\n    \"model_id\": \"llama-4-scout\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 50,\n    \"quantization\": null,\n    \"max_input_tokens\": 10000000,\n    \"max_output_tokens\": 10000000,\n    \"throughput\": 69.82,\n    \"latency\": 0.85,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.972719+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.972719+00:00\",\n    \"provider_model_id_used\": \"llama-4-scout\",\n    \"model_name\": \"Llama 4 Scout\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 354,\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 80,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 38.51,\n    \"latency\": 1.02,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.975233+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.975233+00:00\",\n    \"provider_model_id_used\": \"qwen3-235b-a22b\",\n    \"model_name\": \"Qwen3 235B A22B\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 356,\n    \"model_id\": \"qwen3-30b-a3b\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 44,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 88.84,\n    \"latency\": 0.73,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.980126+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.980126+00:00\",\n    \"provider_model_id_used\": \"qwen3-30b-a3b\",\n    \"model_name\": \"Qwen3 30B A3B\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 355,\n    \"model_id\": \"qwen3-32b\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 44,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 32.43,\n    \"latency\": 0.93,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.977464+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.977464+00:00\",\n    \"provider_model_id_used\": \"qwen3-32b\",\n    \"model_name\": \"Qwen3 32B\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 368,\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 27,\n    \"output_cents_per_million_tokens\": 41,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 163840,\n    \"max_output_tokens\": 65536,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"deepseek/deepseek-v3.2-exp\",\n    \"model_name\": \"DeepSeek V3.2 Exp\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 369,\n    \"model_id\": \"glm-4.5\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 60,\n    \"output_cents_per_million_tokens\": 220,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 98304,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"zai-org/glm-4.5\",\n    \"model_name\": \"GLM-4.5\",\n    \"organization_id\": \"zai-org\"\n  },\n  {\n    \"model_provider_id\": 370,\n    \"model_id\": \"glm-4.5v\",\n    \"provider_id\": \"novita\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 60,\n    \"output_cents_per_million_tokens\": 220,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 65536,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": true,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"zai-org/GLM-4.5V\",\n    \"model_name\": \"GLM-4.5V\",\n    \"organization_id\": \"zai-org\"\n  }\n]\n"
  },
  {
    "path": "data/providers/novita/provider.json",
    "content": "{\n  \"provider_id\": \"novita\",\n  \"name\": \"Novita\",\n  \"website\": \"https://novita.ai/\",\n  \"created_at\": \"2025-07-19T19:49:16.967182+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.967182+00:00\"\n}"
  },
  {
    "path": "data/providers/openai/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 422,\n    \"model_id\": \"gpt-3.5-turbo-0125\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 50,\n    \"output_cents_per_million_tokens\": 150,\n    \"quantization\": null,\n    \"max_input_tokens\": 16385,\n    \"max_output_tokens\": 4096,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.128446+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.128446+00:00\",\n    \"provider_model_id_used\": \"gpt-3.5-turbo-0125\",\n    \"model_name\": \"GPT-3.5 Turbo\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 420,\n    \"model_id\": \"gpt-4-0613\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 3000,\n    \"output_cents_per_million_tokens\": 6000,\n    \"quantization\": null,\n    \"max_input_tokens\": 32768,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.123888+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.123888+00:00\",\n    \"provider_model_id_used\": \"gpt-4-0613\",\n    \"model_name\": \"GPT-4\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 430,\n    \"model_id\": \"gpt-4.1-2025-04-14\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 200,\n    \"output_cents_per_million_tokens\": 800,\n    \"quantization\": null,\n    \"max_input_tokens\": 1047576,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 100.0,\n    \"latency\": 10.0,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.150851+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.150851+00:00\",\n    \"provider_model_id_used\": \"gpt-4.1-2025-04-14\",\n    \"model_name\": \"GPT-4.1\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 431,\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 40,\n    \"output_cents_per_million_tokens\": 160,\n    \"quantization\": null,\n    \"max_input_tokens\": 1047576,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 150.0,\n    \"latency\": 5.0,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.152948+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.152948+00:00\",\n    \"provider_model_id_used\": \"gpt-4.1-mini-2025-04-14\",\n    \"model_name\": \"GPT-4.1 mini\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 432,\n    \"model_id\": \"gpt-4.1-nano-2025-04-14\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 1047576,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 200.0,\n    \"latency\": 2.0,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.154798+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.154798+00:00\",\n    \"provider_model_id_used\": \"gpt-4.1-nano-2025-04-14\",\n    \"model_name\": \"GPT-4.1 nano\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 429,\n    \"model_id\": \"gpt-4.5\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 7500,\n    \"output_cents_per_million_tokens\": 15000,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 4096,\n    \"throughput\": 50.0,\n    \"latency\": 20.0,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.148982+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.148982+00:00\",\n    \"provider_model_id_used\": \"gpt-4.5\",\n    \"model_name\": \"GPT-4.5\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 424,\n    \"model_id\": \"gpt-4o-2024-05-13\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 250,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 4096,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.132398+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.132398+00:00\",\n    \"provider_model_id_used\": \"gpt-4o-2024-05-13\",\n    \"model_name\": \"GPT-4o\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 423,\n    \"model_id\": \"gpt-4o-2024-08-06\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 250,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 16384,\n    \"throughput\": 132.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.130542+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.130542+00:00\",\n    \"provider_model_id_used\": \"gpt-4o-2024-08-06\",\n    \"model_name\": \"GPT-4o\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 421,\n    \"model_id\": \"gpt-4-turbo-2024-04-09\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1000,\n    \"output_cents_per_million_tokens\": 3000,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 4096,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.126193+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.126193+00:00\",\n    \"provider_model_id_used\": \"gpt-4-turbo-2024-04-09\",\n    \"model_name\": \"GPT-4 Turbo\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 426,\n    \"model_id\": \"o1-2024-12-17\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 6000,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 100000,\n    \"throughput\": 66.0,\n    \"latency\": 16.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.136375+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.136375+00:00\",\n    \"provider_model_id_used\": \"o1-2024-12-17\",\n    \"model_name\": \"o1\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 427,\n    \"model_id\": \"o1-mini\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1200,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 115.0,\n    \"latency\": 5.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.137957+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.137957+00:00\",\n    \"provider_model_id_used\": \"o1-mini\",\n    \"model_name\": \"o1-mini\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 425,\n    \"model_id\": \"o1-preview\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 6000,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 66.0,\n    \"latency\": 16.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.134477+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.134477+00:00\",\n    \"provider_model_id_used\": \"o1-preview\",\n    \"model_name\": \"o1-preview\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 433,\n    \"model_id\": \"o3-2025-04-16\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 200,\n    \"output_cents_per_million_tokens\": 800,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 100000,\n    \"throughput\": 50.0,\n    \"latency\": 20.0,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.156370+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.156370+00:00\",\n    \"provider_model_id_used\": \"o3-2025-04-16\",\n    \"model_name\": \"o3\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 428,\n    \"model_id\": \"o3-mini\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 110,\n    \"output_cents_per_million_tokens\": 440,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 100000,\n    \"throughput\": 115.0,\n    \"latency\": 5.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.147026+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.147026+00:00\",\n    \"provider_model_id_used\": \"o3-mini\",\n    \"model_name\": \"o3-mini\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 435,\n    \"model_id\": \"o3-pro-2025-06-10\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 2000,\n    \"output_cents_per_million_tokens\": 8000,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 100000,\n    \"throughput\": 25.0,\n    \"latency\": 30.0,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.161549+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.161549+00:00\",\n    \"provider_model_id_used\": \"o3-pro-2025-06-10\",\n    \"model_name\": \"o3-pro\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 434,\n    \"model_id\": \"o4-mini\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 110,\n    \"output_cents_per_million_tokens\": 440,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 100000,\n    \"throughput\": 115.0,\n    \"latency\": 5.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.159618+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.159618+00:00\",\n    \"provider_model_id_used\": \"o4-mini\",\n    \"model_name\": \"o4-mini\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 434,\n    \"model_id\": \"gpt-oss-120b\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 50,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 115.0,\n    \"latency\": 5.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.159618+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.159618+00:00\",\n    \"provider_model_id_used\": \"gpt-oss-120b\",\n    \"model_name\": \"GPT OSS 120B\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 434,\n    \"model_id\": \"gpt-oss-20b\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 50,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 115.0,\n    \"latency\": 5.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.159618+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.159618+00:00\",\n    \"provider_model_id_used\": \"gpt-oss-20b\",\n    \"model_name\": \"GPT OSS 20B\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 436,\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 125,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 400000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 2.0,\n    \"feature_web_search\": true,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": true,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": true,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"gpt-5\",\n    \"model_name\": \"GPT-5\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 437,\n    \"model_id\": \"gpt-5-mini-2025-08-07\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 25,\n    \"output_cents_per_million_tokens\": 200,\n    \"quantization\": null,\n    \"max_input_tokens\": 400000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 200.0,\n    \"latency\": 1.0,\n    \"feature_web_search\": true,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": true,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": true,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"gpt-5-mini\",\n    \"model_name\": \"GPT-5 mini\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 438,\n    \"model_id\": \"gpt-5-nano-2025-08-07\",\n    \"provider_id\": \"openai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 5,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 400000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 500.0,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": true,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": true,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"gpt-5-nano\",\n    \"model_name\": \"GPT-5 nano\",\n    \"organization_id\": \"openai\"\n  }\n]\n"
  },
  {
    "path": "data/providers/openai/provider.json",
    "content": "{\n  \"provider_id\": \"openai\",\n  \"name\": \"OpenAI\",\n  \"website\": \"https://openai.com\",\n  \"created_at\": \"2025-07-19T19:49:17.121876+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:17.121876+00:00\"\n}\n"
  },
  {
    "path": "data/providers/replicate/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 396,\n    \"model_id\": \"deepseek-vl2\",\n    \"provider_id\": \"replicate\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 950,\n    \"output_cents_per_million_tokens\": 480000,\n    \"quantization\": null,\n    \"max_input_tokens\": 129280,\n    \"max_output_tokens\": 129280,\n    \"throughput\": 22.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.068077+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.068077+00:00\",\n    \"provider_model_id_used\": \"deepseek-vl2\",\n    \"model_name\": \"DeepSeek VL2\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 395,\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"provider_id\": \"replicate\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 950,\n    \"output_cents_per_million_tokens\": 950,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 22.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.066199+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.066199+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-405b-instruct\",\n    \"model_name\": \"Llama 3.1 405B Instruct\",\n    \"organization_id\": \"meta\"\n  }\n]"
  },
  {
    "path": "data/providers/replicate/provider.json",
    "content": "{\n  \"provider_id\": \"replicate\",\n  \"name\": \"Replicate\",\n  \"website\": \"https://replicate.com/\",\n  \"created_at\": \"2025-07-19T19:49:17.064218+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:17.064218+00:00\"\n}"
  },
  {
    "path": "data/providers/sambanova/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 240,\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"provider_id\": \"sambanova\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 500,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 74.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.702554+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.702554+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-70b-instruct\",\n    \"model_name\": \"Llama 3.1 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 239,\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"provider_id\": \"sambanova\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 20,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 1050.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.699627+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.699627+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-8b-instruct\",\n    \"model_name\": \"Llama 3.1 8B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 241,\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"provider_id\": \"sambanova\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 30,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.705086+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.705086+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-11b-instruct\",\n    \"model_name\": \"Llama 3.2 11B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 242,\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"provider_id\": \"sambanova\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 60,\n    \"output_cents_per_million_tokens\": 120,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 1096.0,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.707534+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.707534+00:00\",\n    \"provider_model_id_used\": \"llama-3.3-70b-instruct\",\n    \"model_name\": \"Llama 3.3 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 243,\n    \"model_id\": \"llama-4-maverick\",\n    \"provider_id\": \"sambanova\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 63,\n    \"output_cents_per_million_tokens\": 179,\n    \"quantization\": null,\n    \"max_input_tokens\": 1000000,\n    \"max_output_tokens\": 1000000,\n    \"throughput\": 638.7,\n    \"latency\": 2.04,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.710100+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.710100+00:00\",\n    \"provider_model_id_used\": \"llama-4-maverick\",\n    \"model_name\": \"Llama 4 Maverick\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 244,\n    \"model_id\": \"qwen3-32b\",\n    \"provider_id\": \"sambanova\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 40,\n    \"output_cents_per_million_tokens\": 80,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 327.7,\n    \"latency\": 1.08,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.712669+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.712669+00:00\",\n    \"provider_model_id_used\": \"qwen3-32b\",\n    \"model_name\": \"Qwen3 32B\",\n    \"organization_id\": \"qwen\"\n  }\n]"
  },
  {
    "path": "data/providers/sambanova/provider.json",
    "content": "{\n  \"provider_id\": \"sambanova\",\n  \"name\": \"Sambanova\",\n  \"website\": \"https://sambanova.ai/\",\n  \"created_at\": \"2025-07-19T19:49:16.697204+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.697204+00:00\"\n}"
  },
  {
    "path": "data/providers/together/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 255,\n    \"model_id\": \"deepseek-r1\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 700,\n    \"output_cents_per_million_tokens\": 700,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 4.0,\n    \"latency\": 0.6,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.738387+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.738387+00:00\",\n    \"provider_model_id_used\": \"deepseek-r1\",\n    \"model_name\": \"DeepSeek-R1\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 245,\n    \"model_id\": \"gemma-3n-e4b-it\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 2000,\n    \"output_cents_per_million_tokens\": 4000,\n    \"quantization\": null,\n    \"max_input_tokens\": 32000,\n    \"max_output_tokens\": 32000,\n    \"throughput\": 42.09,\n    \"latency\": 0.43,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.716616+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.716616+00:00\",\n    \"provider_model_id_used\": \"gemma-3n-e4b-it\",\n    \"model_name\": \"Gemma 3n E4B Instructed\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 248,\n    \"model_id\": \"llama-3.1-405b-instruct\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 350,\n    \"output_cents_per_million_tokens\": 350,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 35.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.722263+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.722263+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-405b-instruct\",\n    \"model_name\": \"Llama 3.1 405B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 247,\n    \"model_id\": \"llama-3.1-70b-instruct\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 89,\n    \"output_cents_per_million_tokens\": 89,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 94.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.720699+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.720699+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-70b-instruct\",\n    \"model_name\": \"Llama 3.1 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 246,\n    \"model_id\": \"llama-3.1-8b-instruct\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 20,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 131072,\n    \"throughput\": 194.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.718652+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.718652+00:00\",\n    \"provider_model_id_used\": \"llama-3.1-8b-instruct\",\n    \"model_name\": \"Llama 3.1 8B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 249,\n    \"model_id\": \"llama-3.2-11b-instruct\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 18,\n    \"output_cents_per_million_tokens\": 18,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 168.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.724215+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.724215+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-11b-instruct\",\n    \"model_name\": \"Llama 3.2 11B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 250,\n    \"model_id\": \"llama-3.2-90b-instruct\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 120,\n    \"output_cents_per_million_tokens\": 120,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 57.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.726568+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.726568+00:00\",\n    \"provider_model_id_used\": \"llama-3.2-90b-instruct\",\n    \"model_name\": \"Llama 3.2 90B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 254,\n    \"model_id\": \"llama-3.3-70b-instruct\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 88,\n    \"output_cents_per_million_tokens\": 88,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 65.0,\n    \"latency\": 0.65,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.735754+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.735754+00:00\",\n    \"provider_model_id_used\": \"llama-3.3-70b-instruct\",\n    \"model_name\": \"Llama 3.3 70B Instruct\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 256,\n    \"model_id\": \"llama-4-maverick\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 27,\n    \"output_cents_per_million_tokens\": 85,\n    \"quantization\": null,\n    \"max_input_tokens\": 1000000,\n    \"max_output_tokens\": 1000000,\n    \"throughput\": 97.93,\n    \"latency\": 0.2,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.740112+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.740112+00:00\",\n    \"provider_model_id_used\": \"llama-4-maverick\",\n    \"model_name\": \"Llama 4 Maverick\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 257,\n    \"model_id\": \"llama-4-scout\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 18,\n    \"output_cents_per_million_tokens\": 59,\n    \"quantization\": null,\n    \"max_input_tokens\": 10000000,\n    \"max_output_tokens\": 10000000,\n    \"throughput\": 106.9,\n    \"latency\": 0.54,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.742126+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.742126+00:00\",\n    \"provider_model_id_used\": \"llama-4-scout\",\n    \"model_name\": \"Llama 4 Scout\",\n    \"organization_id\": \"meta\"\n  },\n  {\n    \"model_provider_id\": 252,\n    \"model_id\": \"qwen-2.5-72b-instruct\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 120,\n    \"output_cents_per_million_tokens\": 120,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 47.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.731610+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.731610+00:00\",\n    \"provider_model_id_used\": \"qwen-2.5-72b-instruct\",\n    \"model_name\": \"Qwen2.5 72B Instruct\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 251,\n    \"model_id\": \"qwen-2.5-7b-instruct\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 30,\n    \"output_cents_per_million_tokens\": 30,\n    \"quantization\": null,\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 8192,\n    \"throughput\": 138.0,\n    \"latency\": 0.5,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.728846+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.728846+00:00\",\n    \"provider_model_id_used\": \"qwen-2.5-7b-instruct\",\n    \"model_name\": \"Qwen2.5 7B Instruct\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 258,\n    \"model_id\": \"qwen3-235b-a22b\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 23.74,\n    \"latency\": 0.79,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.746014+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.746014+00:00\",\n    \"provider_model_id_used\": \"qwen3-235b-a22b\",\n    \"model_name\": \"Qwen3 235B A22B\",\n    \"organization_id\": \"qwen\"\n  },\n  {\n    \"model_provider_id\": 253,\n    \"model_id\": \"qwq-32b-preview\",\n    \"provider_id\": \"together\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 120,\n    \"output_cents_per_million_tokens\": 120,\n    \"quantization\": null,\n    \"max_input_tokens\": 32768,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 62.14,\n    \"latency\": 0.74,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.733822+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.733822+00:00\",\n    \"provider_model_id_used\": \"qwq-32b-preview\",\n    \"model_name\": \"QwQ-32B-Preview\",\n    \"organization_id\": \"qwen\"\n  }\n]"
  },
  {
    "path": "data/providers/together/provider.json",
    "content": "{\n  \"provider_id\": \"together\",\n  \"name\": \"Together\",\n  \"website\": \"https://together.ai/\",\n  \"created_at\": \"2025-07-19T19:49:16.714534+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.714534+00:00\"\n}"
  },
  {
    "path": "data/providers/xai/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 363,\n    \"model_id\": \"grok-2\",\n    \"provider_id\": \"xai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 200,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 8000,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.997220+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.997220+00:00\",\n    \"provider_model_id_used\": \"grok-2\",\n    \"model_name\": \"Grok-2\",\n    \"organization_id\": \"xai\"\n  },\n  {\n    \"model_provider_id\": 364,\n    \"model_id\": \"grok-3\",\n    \"provider_id\": \"xai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 8000,\n    \"throughput\": 100.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:16.998872+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:16.998872+00:00\",\n    \"provider_model_id_used\": \"grok-3\",\n    \"model_name\": \"Grok-3\",\n    \"organization_id\": \"xai\"\n  },\n  {\n    \"model_provider_id\": 365,\n    \"model_id\": \"grok-3-mini\",\n    \"provider_id\": \"xai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 30,\n    \"output_cents_per_million_tokens\": 50,\n    \"quantization\": null,\n    \"max_input_tokens\": 128000,\n    \"max_output_tokens\": 8000,\n    \"throughput\": 100.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.000676+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.000676+00:00\",\n    \"provider_model_id_used\": \"grok-3-mini\",\n    \"model_name\": \"Grok-3 Mini\",\n    \"organization_id\": \"xai\"\n  },\n  {\n    \"model_provider_id\": 366,\n    \"model_id\": \"grok-4\",\n    \"provider_id\": \"xai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 256000,\n    \"max_output_tokens\": 8000,\n    \"throughput\": 100.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.002399+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.002399+00:00\",\n    \"provider_model_id_used\": \"grok-4\",\n    \"model_name\": \"Grok-4\",\n    \"organization_id\": \"xai\"\n  },\n  {\n    \"model_provider_id\": 367,\n    \"model_id\": \"grok-code-fast-1\",\n    \"provider_id\": \"xai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 150,\n    \"quantization\": null,\n    \"max_input_tokens\": 256000,\n    \"max_output_tokens\": 10000,\n    \"throughput\": 76.41,\n    \"latency\": 1.38,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-10-03T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-03T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"grok-code-fast-1\",\n    \"model_name\": \"Grok Code Fast 1\",\n    \"organization_id\": \"xai\"\n  },\n  {\n    \"model_provider_id\": 444,\n    \"model_id\": \"grok-4-fast\",\n    \"provider_id\": \"xai\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 20,\n    \"output_cents_per_million_tokens\": 50,\n    \"quantization\": null,\n    \"max_input_tokens\": 2000000,\n    \"max_output_tokens\": 30000,\n    \"throughput\": 90,\n    \"latency\": null,\n    \"feature_web_search\": true,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": false,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-10-11T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"grok-4-fast\",\n    \"model_name\": \"Grok 4 Fast\",\n    \"organization_id\": \"xai\"\n  }\n]"
  },
  {
    "path": "data/providers/xai/provider.json",
    "content": "{\n  \"provider_id\": \"xai\",\n  \"name\": \"xAI\",\n  \"website\": \"https://docs.x.ai\",\n  \"created_at\": \"2025-07-19T19:49:16.995303+00:00\",\n  \"updated_at\": \"2025-07-19T19:49:16.995303+00:00\"\n}\n"
  },
  {
    "path": "data/providers/zeroeval/models.json",
    "content": "[\n  {\n    \"model_provider_id\": 441,\n    \"model_id\": \"claude-3-7-sonnet-20250219\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.176639+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.176639+00:00\",\n    \"provider_model_id_used\": \"claude-3-7-sonnet-20250219\",\n    \"model_name\": \"Claude 3.7 Sonnet\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 436,\n    \"model_id\": \"claude-opus-4-20250514\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 7500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.165236+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.165236+00:00\",\n    \"provider_model_id_used\": \"claude-opus-4-20250514\",\n    \"model_name\": \"Claude Opus 4\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 437,\n    \"model_id\": \"claude-opus-4-1-20250805\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 1500,\n    \"output_cents_per_million_tokens\": 7500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 32000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-08-05T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"claude-opus-4-1-20250805\",\n    \"model_name\": \"Claude Opus 4.1\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 438,\n    \"model_id\": \"claude-sonnet-4-20250514\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.170880+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.170880+00:00\",\n    \"provider_model_id_used\": \"claude-sonnet-4-20250514\",\n    \"model_name\": \"Claude Sonnet 4\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 442,\n    \"model_id\": \"gemini-2.5-flash\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 30,\n    \"output_cents_per_million_tokens\": 250,\n    \"quantization\": null,\n    \"max_input_tokens\": 1048576,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.179386+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.179386+00:00\",\n    \"provider_model_id_used\": \"gemini-2.5-flash\",\n    \"model_name\": \"Gemini 2.5 Flash\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 437,\n    \"model_id\": \"gemini-2.5-pro\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 125,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 1048576,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.168497+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.168497+00:00\",\n    \"provider_model_id_used\": \"gemini-2.5-pro\",\n    \"model_name\": \"Gemini 2.5 Pro\",\n    \"organization_id\": \"google\"\n  },\n  {\n    \"model_provider_id\": 440,\n    \"model_id\": \"gpt-4.1-mini-2025-04-14\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 40,\n    \"output_cents_per_million_tokens\": 160,\n    \"quantization\": null,\n    \"max_input_tokens\": 1047576,\n    \"max_output_tokens\": 32768,\n    \"throughput\": 150.0,\n    \"latency\": 5.0,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.174218+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.174218+00:00\",\n    \"provider_model_id_used\": \"gpt-4.1-mini-2025-04-14\",\n    \"model_name\": \"GPT-4.1 mini\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 439,\n    \"model_id\": \"grok-4\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 256000,\n    \"max_output_tokens\": 8000,\n    \"throughput\": 100.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.172505+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.172505+00:00\",\n    \"provider_model_id_used\": \"grok-4\",\n    \"model_name\": \"Grok-4\",\n    \"organization_id\": \"xai\"\n  },\n  {\n    \"model_provider_id\": 1231,\n    \"model_id\": \"gpt-oss-120b\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 15,\n    \"output_cents_per_million_tokens\": 60,\n    \"quantization\": null,\n    \"max_input_tokens\": 131000,\n    \"max_output_tokens\": 30000,\n    \"throughput\": 500,\n    \"latency\": 0.5,\n    \"feature_web_search\": true,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": true,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-08-05T19:49:16.965756+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:16.965756+00:00\",\n    \"provider_model_id_used\": \"gpt-oss-120b\",\n    \"model_name\": \"OpenAI OSS 120B\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 1232,\n    \"model_id\": \"gpt-oss-20b\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 10,\n    \"output_cents_per_million_tokens\": 50,\n    \"quantization\": null,\n    \"max_input_tokens\": 131000,\n    \"max_output_tokens\": 30000,\n    \"throughput\": 1000,\n    \"latency\": 0.38,\n    \"feature_web_search\": true,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": true,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-08-05T19:49:16.965756+00:00\",\n    \"updated_at\": \"2025-08-05T19:49:16.965756+00:00\",\n    \"provider_model_id_used\": \"gpt-oss-20b\",\n    \"model_name\": \"OpenAI OSS 20B\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 1233,\n    \"model_id\": \"gpt-5-2025-08-07\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 125,\n    \"output_cents_per_million_tokens\": 1000,\n    \"quantization\": null,\n    \"max_input_tokens\": 400000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 100.0,\n    \"latency\": 2.0,\n    \"feature_web_search\": true,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": true,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": true,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"gpt-5\",\n    \"model_name\": \"GPT-5\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 1234,\n    \"model_id\": \"gpt-5-mini-2025-08-07\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 25,\n    \"output_cents_per_million_tokens\": 200,\n    \"quantization\": null,\n    \"max_input_tokens\": 400000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 200.0,\n    \"latency\": 1.0,\n    \"feature_web_search\": true,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": true,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": true,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"gpt-5-mini\",\n    \"model_name\": \"GPT-5 mini\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 1235,\n    \"model_id\": \"gpt-5-nano-2025-08-07\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 5,\n    \"output_cents_per_million_tokens\": 40,\n    \"quantization\": null,\n    \"max_input_tokens\": 400000,\n    \"max_output_tokens\": 128000,\n    \"throughput\": 500.0,\n    \"latency\": 0.3,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": true,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": true,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"updated_at\": \"2025-07-24T12:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"gpt-5-nano\",\n    \"model_name\": \"GPT-5 nano\",\n    \"organization_id\": \"openai\"\n  },\n  {\n    \"model_provider_id\": 1236,\n    \"model_id\": \"deepseek-v3.2-exp\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 27,\n    \"output_cents_per_million_tokens\": 41,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 163840,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 100.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"deepseek/deepseek-v3.2-exp\",\n    \"model_name\": \"DeepSeek V3.2 Exp\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 1237,\n    \"model_id\": \"glm-4.5\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 60,\n    \"output_cents_per_million_tokens\": 220,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 98304,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"zai-org/glm-4.5\",\n    \"model_name\": \"GLM-4.5\",\n    \"organization_id\": \"zai-org\"\n  },\n  {\n    \"model_provider_id\": 1238,\n    \"model_id\": \"glm-4.5v\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 60,\n    \"output_cents_per_million_tokens\": 220,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": true,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"zai-org/GLM-4.5V\",\n    \"model_name\": \"GLM-4.5V\",\n    \"organization_id\": \"zai-org\"\n  },\n  {\n    \"model_provider_id\": 1239,\n    \"model_id\": \"kimi-k2-0905\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 60,\n    \"output_cents_per_million_tokens\": 250,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 262144,\n    \"max_output_tokens\": 262144,\n    \"throughput\": null,\n    \"latency\": null,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"moonshotai/kimi-k2-0905\",\n    \"model_name\": \"Kimi K2 0905\",\n    \"organization_id\": \"moonshotai\"\n  },\n  {\n    \"model_provider_id\": 1241,\n    \"model_id\": \"deepseek-r1\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 600,\n    \"quantization\": null,\n    \"max_input_tokens\": 65536,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 189.0,\n    \"latency\": 0.067,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": false,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": false,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-29T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"deepseek-r1\",\n    \"model_name\": \"DeepSeek R1 671B\",\n    \"organization_id\": \"deepseek\"\n  },\n  {\n    \"model_provider_id\": 1242,\n    \"model_id\": \"claude-sonnet-4-5-20250929\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 300,\n    \"output_cents_per_million_tokens\": 1500,\n    \"quantization\": null,\n    \"max_input_tokens\": 200000,\n    \"max_output_tokens\": 64000,\n    \"throughput\": 42.0,\n    \"latency\": 0.4,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": true,\n    \"input_modality_video\": true,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-07-19T19:49:17.084616+00:00\",\n    \"updated_at\": \"2025-07-19T19:49:17.084616+00:00\",\n    \"provider_model_id_used\": \"claude-sonnet-4-5-20250929\",\n    \"model_name\": \"Claude Sonnet 4.5\",\n    \"organization_id\": \"anthropic\"\n  },\n  {\n    \"model_provider_id\": 1243,\n    \"model_id\": \"glm-4.6\",\n    \"provider_id\": \"zeroeval\",\n    \"deprecated_at\": null,\n    \"input_cents_per_million_tokens\": 60,\n    \"output_cents_per_million_tokens\": 200,\n    \"quantization\": \"fp8\",\n    \"max_input_tokens\": 131072,\n    \"max_output_tokens\": 65536,\n    \"throughput\": 85.0,\n    \"latency\": 0.7,\n    \"feature_web_search\": false,\n    \"feature_function_calling\": true,\n    \"feature_structured_output\": true,\n    \"feature_code_execution\": false,\n    \"feature_batch_inference\": true,\n    \"feature_finetuning\": false,\n    \"input_modality_text\": true,\n    \"input_modality_image\": true,\n    \"input_modality_audio\": false,\n    \"input_modality_video\": true,\n    \"output_modality_text\": true,\n    \"output_modality_image\": false,\n    \"output_modality_audio\": false,\n    \"output_modality_video\": false,\n    \"created_at\": \"2025-09-30T00:00:00.000000+00:00\",\n    \"updated_at\": \"2025-09-30T00:00:00.000000+00:00\",\n    \"provider_model_id_used\": \"zai-org/GLM-4.6\",\n    \"model_name\": \"GLM-4.6\",\n    \"organization_id\": \"zai-org\"\n  }\n]\n"
  },
  {
    "path": "data/providers/zeroeval/provider.json",
    "content": "{\n  \"provider_id\": \"zeroeval\",\n  \"name\": \"ZeroEval\",\n  \"website\": \"https://zeroeval.com\",\n  \"created_at\": \"2025-07-15T06:36:02.543462+00:00\",\n  \"updated_at\": \"2025-07-15T06:36:02.543462+00:00\"\n}"
  },
  {
    "path": "package.json",
    "content": "{\n  \"scripts\": {\n    \"validate-schemas\": \"node scripts/validate-schemas.js\"\n  },\n  \"devDependencies\": {\n    \"glob\": \"^10.4.5\",\n    \"tv4\": \"^1.3.0\"\n  }\n}\n"
  },
  {
    "path": "schemas/README.md",
    "content": "# JSON Schemas for LLM Stats Data\n\nThis directory contains JSON Schema definitions for all data types used in the LLM Stats project. These schemas define the structure, types, and validation rules for data stored in the hierarchical file system under `data/`.\n\n## Schema Files\n\n### Core Entity Schemas\n\n- **`organization.schema.json`** - Schema for AI/ML organizations (e.g., OpenAI, Anthropic)\n- **`model.schema.json`** - Schema for model metadata\n- **`license.schema.json`** - Schema for software licenses governing model usage\n- **`benchmark.schema.json`** - Schema for evaluation benchmark definitions\n- **`provider.schema.json`** - Schema for model inference providers (e.g., AWS Bedrock, Google Vertex)\n\n### Relationship Schemas\n\n- **`benchmark-results.schema.json`** - Schema for model performance scores on benchmarks\n- **`provider-models.schema.json`** - Schema for provider-specific model configurations and pricing\n\n## Data Structure\n\nThe schemas correspond to data organized hierarchically:\n\n```\ndata/\n├── organizations/\n│   └── [org_id]/\n│       ├── organization.json    # Validates against organization.schema.json\n│       └── models/\n│           └── [model_id]/\n│               ├── model.json       # Validates against model.schema.json\n│               └── benchmarks.json  # Array validating against benchmark-results.schema.json\n├── providers/\n│   └── [provider_id]/\n│       ├── provider.json        # Validates against provider.schema.json\n│       └── models.json          # Array validating against provider-models.schema.json\n├── licenses/\n│   └── [license_id].json        # Validates against license.schema.json\n└── benchmarks/\n    └── [benchmark_id].json      # Validates against benchmark.schema.json\n```\n\n## Usage\n\nThese schemas can be used for:\n\n1. **Data Validation** - Ensure all data files conform to expected structure\n2. **Documentation** - Understand what fields are available and their meanings\n3. **Code Generation** - Generate TypeScript interfaces or other language types\n4. **API Contracts** - Define expected request/response formats\n\n## Validation Example\n\nTo validate a data file against its schema using Python:\n\n```python\nimport json\nimport jsonschema\n\n# Load schema\nwith open('schemas/model.schema.json') as f:\n    schema = json.load(f)\n\n# Load data\nwith open('data/organizations/openai/models/gpt-4/model.json') as f:\n    data = json.load(f)\n\n# Validate\njsonschema.validate(instance=data, schema=schema)\n```\n\n## Schema Features\n\nAll schemas use JSON Schema Draft 7 and include:\n\n- **Descriptions** - Every field has a human-readable description\n- **Types** - Strict type definitions with null handling\n- **Patterns** - Regular expressions for ID formats\n- **Examples** - Real-world examples for clarity\n- **Enums** - Restricted value sets where applicable\n- **Format Validators** - For dates, URIs, etc.\n- **Required Fields** - Clearly defined required vs optional\n\n## Contributing\n\nWhen adding new fields or modifying schemas:\n\n1. Update the relevant schema file\n2. Add clear descriptions and examples\n3. Consider backward compatibility\n4. Update this README if adding new schemas\n5. Validate existing data against updated schemas\n\n## Schema Versioning\n\nCurrently, all schemas target JSON Schema Draft 7. Future versions may adopt newer drafts as tooling support improves.\n"
  },
  {
    "path": "schemas/benchmark-results.schema.json",
    "content": "{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"title\": \"ModelBenchmark\",\n  \"description\": \"Schema for model performance scores on benchmarks\",\n  \"type\": \"object\",\n  \"properties\": {\n    \"model_benchmark_id\": {\n      \"type\": \"integer\",\n      \"description\": \"Unique identifier for this model-benchmark result\",\n      \"minimum\": 1\n    },\n    \"benchmark_id\": {\n      \"type\": \"string\",\n      \"description\": \"ID of the benchmark\"\n    },\n    \"model_id\": {\n      \"type\": \"string\",\n      \"description\": \"ID of the model\"\n    },\n    \"score\": {\n      \"type\": \"number\",\n      \"description\": \"Raw score achieved on the benchmark\",\n      \"minimum\": 0\n    },\n    \"normalized_score\": {\n      \"type\": [\"number\", \"null\"],\n      \"description\": \"Score normalized to 0-1 range for cross-benchmark comparison\",\n      \"minimum\": 0,\n      \"maximum\": 1\n    },\n    \"is_self_reported\": {\n      \"type\": \"boolean\",\n      \"description\": \"Whether the score was self-reported by the model creator\",\n      \"default\": true\n    },\n    \"self_reported_source_link\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"uri\",\n      \"description\": \"URL to the source of self-reported scores\"\n    },\n    \"verified_by_llmstats\": {\n      \"type\": \"boolean\",\n      \"description\": \"Whether the score has been independently verified by llm-stats\",\n      \"default\": false\n    },\n    \"analysis_method\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"Method used for evaluation (e.g., '0-shot', '5-shot', 'CoT')\",\n      \"examples\": [\n        \"0-shot\",\n        \"5-shot\",\n        \"few-shot\",\n        \"chain-of-thought\",\n        \"zero-shot CoT\"\n      ]\n    },\n    \"verification_provider_id\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"Provider used for independent verification\"\n    },\n    \"verification_hardware\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"Hardware used for verification\",\n      \"examples\": [\"H100 on Modal\", \"A100 on AWS\", \"4xA100 on GCP\"]\n    },\n    \"verification_date\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"date\",\n      \"description\": \"Date when the score was independently verified\"\n    },\n    \"verification_notes\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"Additional notes about the verification process\"\n    },\n    \"created_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was created\"\n    },\n    \"updated_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was last updated\"\n    },\n    \"benchmark_name\": {\n      \"type\": \"string\",\n      \"description\": \"Display name of the benchmark (denormalized for convenience)\"\n    }\n  },\n  \"required\": [\n    \"model_benchmark_id\",\n    \"benchmark_id\",\n    \"model_id\",\n    \"score\",\n    \"is_self_reported\",\n    \"verified_by_llmstats\",\n    \"created_at\",\n    \"updated_at\",\n    \"benchmark_name\"\n  ],\n  \"additionalProperties\": false\n}\n"
  },
  {
    "path": "schemas/benchmark.schema.json",
    "content": "{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"title\": \"Benchmark\",\n  \"description\": \"Schema for AI/ML evaluation benchmark definitions\",\n  \"type\": \"object\",\n  \"properties\": {\n    \"benchmark_id\": {\n      \"type\": \"string\",\n      \"description\": \"Unique identifier for the benchmark\",\n      \"examples\": [\n        \"mmlu\",\n        \"humaneval\",\n        \"arc-c\",\n        \"gsm8k\",\n        \"mbpp-pass@1\",\n        \"humanity's-last-exam\"\n      ]\n    },\n    \"name\": {\n      \"type\": \"string\",\n      \"description\": \"Display name of the benchmark\",\n      \"examples\": [\"MMLU\", \"HumanEval\", \"ARC-Challenge\", \"GSM8K\"]\n    },\n    \"parent_benchmark_id\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"ID of parent benchmark if this is a subset or variant\"\n    },\n    \"categories\": {\n      \"type\": \"array\",\n      \"description\": \"Array of categories that this benchmark belongs to\",\n      \"items\": {\n        \"type\": \"string\",\n        \"enum\": [\n          \"general\",\n          \"code\",\n          \"math\",\n          \"reasoning\",\n          \"language\",\n          \"multimodal\",\n          \"safety\",\n          \"long_context\",\n          \"roleplay\",\n          \"agents\",\n          \"factuality\",\n          \"vision\",\n          \"audio\",\n          \"video\",\n          \"text-to-image\",\n          \"image-to-text\",\n          \"text-to-speech\",\n          \"speech-to-text\",\n          \"text-to-video\",\n          \"video-to-text\",\n          \"legal\",\n          \"healthcare\",\n          \"finance\",\n          \"chemistry\",\n          \"economics\",\n          \"coding\",\n          \"creativity\",\n          \"psychology\",\n          \"games\",\n          \"communication\",\n          \"physics\",\n          \"spatial_reasoning\",\n          \"summarization\",\n          \"frontend_development\",\n          \"writing\",\n          \"search\"\n        ]\n      },\n      \"minItems\": 1,\n      \"uniqueItems\": true,\n      \"examples\": [\n        [\"general\"],\n        [\"code\", \"reasoning\"],\n        [\"math\", \"reasoning\"],\n        [\"vision\", \"multimodal\"]\n      ]\n    },\n    \"modality\": {\n      \"type\": \"string\",\n      \"description\": \"Primary modality of the benchmark\",\n      \"enum\": [\"text\", \"image\", \"audio\", \"video\", \"multimodal\"]\n    },\n    \"multilingual\": {\n      \"type\": \"boolean\",\n      \"description\": \"Whether the benchmark tests multiple languages\",\n      \"default\": false\n    },\n    \"max_score\": {\n      \"type\": \"number\",\n      \"description\": \"Maximum possible score on the benchmark\",\n      \"minimum\": 0,\n      \"default\": 1.0,\n      \"examples\": [1.0, 100.0]\n    },\n    \"language\": {\n      \"type\": \"string\",\n      \"description\": \"Primary language of the benchmark (ISO 639-1 code)\",\n      \"default\": \"en\",\n      \"examples\": [\"en\", \"zh\", \"es\", \"fr\"]\n    },\n    \"description\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"Detailed description of what the benchmark measures\"\n    },\n    \"paper_link\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"uri\",\n      \"description\": \"URL to the research paper introducing the benchmark\"\n    },\n    \"implementation_link\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"uri\",\n      \"description\": \"URL to the official implementation or dataset\"\n    },\n    \"verified\": {\n      \"type\": \"boolean\",\n      \"description\": \"Whether the benchmark has been verified by llm-stats maintainers\",\n      \"default\": false\n    },\n    \"created_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was created\"\n    },\n    \"updated_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was last updated\"\n    }\n  },\n  \"required\": [\n    \"benchmark_id\",\n    \"name\",\n    \"categories\",\n    \"modality\",\n    \"multilingual\",\n    \"max_score\",\n    \"language\",\n    \"verified\",\n    \"created_at\",\n    \"updated_at\"\n  ],\n  \"additionalProperties\": false\n}\n"
  },
  {
    "path": "schemas/integrity-validator.js",
    "content": "const fs = require(\"fs\");\nconst path = require(\"path\");\nconst glob = require(\"glob\");\n\nclass IntegrityValidator {\n  constructor(dataDir) {\n    this.dataDir = dataDir || path.join(__dirname, \"..\", \"data\");\n    this.errors = [];\n    this.warnings = [];\n\n    // Collections to store all entities\n    this.models = new Map();\n    this.benchmarks = new Map();\n    this.organizations = new Map();\n    this.licenses = new Map();\n    this.providers = new Map();\n\n    // Maps to check for duplicates\n    // Note: Model names can be duplicated (e.g., different versions), only IDs must be unique\n    this.benchmarkNames = new Map();\n  }\n\n  loadJSON(filePath) {\n    try {\n      const content = fs.readFileSync(filePath, \"utf8\");\n      return JSON.parse(content);\n    } catch (error) {\n      this.errors.push(`Failed to load ${filePath}: ${error.message}`);\n      return null;\n    }\n  }\n\n  // Load all data into memory\n  async loadAllData() {\n    console.log(\"\\n📂 Loading all data files...\\n\");\n\n    // Load organizations\n    const orgFiles = glob.sync(\n      path.join(this.dataDir, \"organizations/*/organization.json\")\n    );\n    for (const file of orgFiles) {\n      const data = this.loadJSON(file);\n      if (data) {\n        this.organizations.set(data.organization_id, data);\n      }\n    }\n    console.log(`✅ Loaded ${this.organizations.size} organizations`);\n\n    // Load models\n    const modelFiles = glob.sync(\n      path.join(this.dataDir, \"organizations/*/models/*/model.json\")\n    );\n    for (const file of modelFiles) {\n      const data = this.loadJSON(file);\n      if (data) {\n        // Check for duplicate model IDs\n        if (this.models.has(data.model_id)) {\n          const existing = this.models.get(data.model_id);\n          this.errors.push(\n            `❌ Duplicate model ID \"${data.model_id}\" found:\\n` +\n            `   - First occurrence: ${path.relative(this.dataDir, existing.file)}\\n` +\n            `   - Duplicate found: ${path.relative(this.dataDir, file)}`\n          );\n        }\n        this.models.set(data.model_id, { ...data, file });\n      }\n    }\n    console.log(`✅ Loaded ${this.models.size} models`);\n\n    // Load benchmarks\n    const benchmarkFiles = glob.sync(\n      path.join(this.dataDir, \"benchmarks/*.json\")\n    );\n    for (const file of benchmarkFiles) {\n      const data = this.loadJSON(file);\n      if (data) {\n        // Check for duplicate benchmark IDs\n        if (this.benchmarks.has(data.benchmark_id)) {\n          const existing = this.benchmarks.get(data.benchmark_id);\n          this.errors.push(\n            `❌ Duplicate benchmark ID \"${data.benchmark_id}\" found:\\n` +\n            `   - First occurrence: ${path.relative(this.dataDir, existing.file)}\\n` +\n            `   - Duplicate found: ${path.relative(this.dataDir, file)}`\n          );\n        }\n        this.benchmarks.set(data.benchmark_id, { ...data, file });\n\n        // Check for duplicate benchmark names\n        if (this.benchmarkNames.has(data.name)) {\n          this.benchmarkNames\n            .get(data.name)\n            .push({ id: data.benchmark_id, file });\n        } else {\n          this.benchmarkNames.set(data.name, [{ id: data.benchmark_id, file }]);\n        }\n      }\n    }\n    console.log(`✅ Loaded ${this.benchmarks.size} benchmarks`);\n\n    // Load licenses\n    const licenseFiles = glob.sync(path.join(this.dataDir, \"licenses/*.json\"));\n    for (const file of licenseFiles) {\n      const data = this.loadJSON(file);\n      if (data) {\n        this.licenses.set(data.license_id, data);\n      }\n    }\n    console.log(`✅ Loaded ${this.licenses.size} licenses`);\n\n    // Load providers\n    const providerFiles = glob.sync(\n      path.join(this.dataDir, \"providers/*/provider.json\")\n    );\n    for (const file of providerFiles) {\n      const data = this.loadJSON(file);\n      if (data) {\n        this.providers.set(data.provider_id, data);\n      }\n    }\n    console.log(`✅ Loaded ${this.providers.size} providers`);\n  }\n\n  // Check for duplicate names\n  checkDuplicates() {\n    console.log(\"\\n🔍 Checking for duplicate names...\\n\");\n\n    let duplicatesFound = false;\n\n    // Check duplicate benchmark names (benchmark names should be unique)\n    for (const [name, instances] of this.benchmarkNames.entries()) {\n      if (instances.length > 1) {\n        duplicatesFound = true;\n        this.errors.push(\n          `❌ Duplicate benchmark name \"${name}\" found in ${instances.length} benchmarks:\\n` +\n            instances\n              .map(\n                (i) => `   - ${i.id} in ${path.relative(this.dataDir, i.file)}`\n              )\n              .join(\"\\n\")\n        );\n      }\n    }\n\n    if (!duplicatesFound) {\n      console.log(\"✅ No duplicate benchmark names found\");\n    }\n\n    // Note: Model names can be duplicated (e.g., different versions of the same model)\n    // IDs are checked during loading and must be unique\n  }\n\n  // Check all references\n  checkReferences() {\n    console.log(\"\\n🔗 Checking references...\\n\");\n\n    // Check model references\n    for (const [modelId, model] of this.models.entries()) {\n      const relPath = path.relative(this.dataDir, model.file);\n\n      // Check organization reference\n      if (\n        model.organization_id &&\n        !this.organizations.has(model.organization_id)\n      ) {\n        this.errors.push(\n          `❌ Model \"${modelId}\" references non-existent organization \"${model.organization_id}\"\\n` +\n            `   in ${relPath}`\n        );\n      }\n\n      // Check license reference\n      if (model.license_id && !this.licenses.has(model.license_id)) {\n        this.errors.push(\n          `❌ Model \"${modelId}\" references non-existent license \"${model.license_id}\"\\n` +\n            `   in ${relPath}`\n        );\n      }\n\n      // Check fine-tuned from reference\n      if (\n        model.fine_tuned_from_model_id &&\n        !this.models.has(model.fine_tuned_from_model_id)\n      ) {\n        this.errors.push(\n          `❌ Model \"${modelId}\" references non-existent base model \"${model.fine_tuned_from_model_id}\"\\n` +\n            `   in ${relPath}`\n        );\n      }\n\n      // Check model family reference\n      if (model.model_family_id && !this.models.has(model.model_family_id)) {\n        this.warnings.push(\n          `⚠️  Model \"${modelId}\" references model family \"${model.model_family_id}\" which doesn't exist as a model\\n` +\n            `   in ${relPath}`\n        );\n      }\n    }\n\n    // Check benchmark results references\n    const benchmarkResultFiles = glob.sync(\n      path.join(this.dataDir, \"organizations/*/models/*/benchmarks.json\")\n    );\n\n    for (const file of benchmarkResultFiles) {\n      const results = this.loadJSON(file);\n      if (results && Array.isArray(results)) {\n        const relPath = path.relative(this.dataDir, file);\n\n        for (let i = 0; i < results.length; i++) {\n          const result = results[i];\n\n          // Check model_id reference\n          if (result.model_id && !this.models.has(result.model_id)) {\n            this.errors.push(\n              `❌ Benchmark result [${i}] references non-existent model \"${result.model_id}\"\\n` +\n                `   in ${relPath}`\n            );\n          }\n\n          // Check benchmark_id reference\n          if (\n            result.benchmark_id &&\n            !this.benchmarks.has(result.benchmark_id)\n          ) {\n            this.errors.push(\n              `❌ Benchmark result [${i}] references non-existent benchmark \"${result.benchmark_id}\"\\n` +\n                `   in ${relPath}`\n            );\n          }\n\n          // Check verification_provider_id reference\n          if (\n            result.verification_provider_id &&\n            !this.providers.has(result.verification_provider_id)\n          ) {\n            this.warnings.push(\n              `⚠️  Benchmark result [${i}] references non-existent verification provider \"${result.verification_provider_id}\"\\n` +\n                `   in ${relPath}`\n            );\n          }\n        }\n      }\n    }\n\n    // Check provider models references\n    const providerModelFiles = glob.sync(\n      path.join(this.dataDir, \"providers/*/models.json\")\n    );\n\n    for (const file of providerModelFiles) {\n      const models = this.loadJSON(file);\n      if (models && Array.isArray(models)) {\n        const relPath = path.relative(this.dataDir, file);\n\n        for (let i = 0; i < models.length; i++) {\n          const providerModel = models[i];\n\n          // Check model_id reference\n          if (\n            providerModel.model_id &&\n            !this.models.has(providerModel.model_id)\n          ) {\n            this.errors.push(\n              `❌ Provider model [${i}] references non-existent model \"${providerModel.model_id}\"\\n` +\n                `   in ${relPath}`\n            );\n          }\n\n          // Check provider_id reference\n          if (\n            providerModel.provider_id &&\n            !this.providers.has(providerModel.provider_id)\n          ) {\n            this.errors.push(\n              `❌ Provider model [${i}] references non-existent provider \"${providerModel.provider_id}\"\\n` +\n                `   in ${relPath}`\n            );\n          }\n        }\n      }\n    }\n\n    // Check benchmark parent references\n    for (const [benchmarkId, benchmark] of this.benchmarks.entries()) {\n      if (\n        benchmark.parent_benchmark_id &&\n        !this.benchmarks.has(benchmark.parent_benchmark_id)\n      ) {\n        const relPath = path.relative(this.dataDir, benchmark.file);\n        this.errors.push(\n          `❌ Benchmark \"${benchmarkId}\" references non-existent parent benchmark \"${benchmark.parent_benchmark_id}\"\\n` +\n            `   in ${relPath}`\n        );\n      }\n    }\n\n    if (this.errors.length === 0 && this.warnings.length === 0) {\n      console.log(\"✅ All references are valid\");\n    }\n  }\n\n  // Check for orphaned data\n  checkOrphans() {\n    console.log(\"\\n👻 Checking for orphaned data...\\n\");\n\n    // Check for models without benchmark results\n    const modelsWithBenchmarks = new Set();\n    const benchmarkResultFiles = glob.sync(\n      path.join(this.dataDir, \"organizations/*/models/*/benchmarks.json\")\n    );\n\n    for (const file of benchmarkResultFiles) {\n      const results = this.loadJSON(file);\n      if (results && Array.isArray(results)) {\n        results.forEach((r) => modelsWithBenchmarks.add(r.model_id));\n      }\n    }\n\n    let modelsWithoutBenchmarks = 0;\n    for (const modelId of this.models.keys()) {\n      if (!modelsWithBenchmarks.has(modelId)) {\n        modelsWithoutBenchmarks++;\n      }\n    }\n\n    if (modelsWithoutBenchmarks > 0) {\n      this.warnings.push(\n        `⚠️  ${modelsWithoutBenchmarks} models have no benchmark results`\n      );\n    }\n\n    // Check for unused benchmarks\n    const usedBenchmarks = new Set();\n    for (const file of benchmarkResultFiles) {\n      const results = this.loadJSON(file);\n      if (results && Array.isArray(results)) {\n        results.forEach((r) => usedBenchmarks.add(r.benchmark_id));\n      }\n    }\n\n    let unusedBenchmarks = 0;\n    for (const benchmarkId of this.benchmarks.keys()) {\n      if (!usedBenchmarks.has(benchmarkId)) {\n        unusedBenchmarks++;\n      }\n    }\n\n    if (unusedBenchmarks > 0) {\n      this.warnings.push(\n        `⚠️  ${unusedBenchmarks} benchmarks are not used by any model`\n      );\n    }\n\n    // Check for unused licenses\n    const usedLicenses = new Set();\n    for (const model of this.models.values()) {\n      if (model.license_id) {\n        usedLicenses.add(model.license_id);\n      }\n    }\n\n    let unusedLicenses = 0;\n    for (const licenseId of this.licenses.keys()) {\n      if (!usedLicenses.has(licenseId)) {\n        unusedLicenses++;\n      }\n    }\n\n    if (unusedLicenses > 0) {\n      this.warnings.push(\n        `⚠️  ${unusedLicenses} licenses are not used by any model`\n      );\n    }\n  }\n\n  // Main validation function\n  async validate() {\n    console.log(\"🔍 Running Data Integrity Validation...\\n\");\n    console.log(`Data directory: ${this.dataDir}\\n`);\n\n    await this.loadAllData();\n    this.checkDuplicates();\n    this.checkReferences();\n    this.checkOrphans();\n\n    // Print summary\n    console.log(\"\\n\" + \"=\".repeat(60));\n    console.log(\"📊 Validation Summary\");\n    console.log(\"=\".repeat(60));\n\n    if (this.errors.length > 0) {\n      console.log(`\\n❌ Found ${this.errors.length} errors:\\n`);\n      this.errors.forEach((error) => console.log(error));\n    }\n\n    if (this.warnings.length > 0) {\n      console.log(`\\n⚠️  Found ${this.warnings.length} warnings:\\n`);\n      this.warnings.forEach((warning) => console.log(warning));\n    }\n\n    if (this.errors.length === 0 && this.warnings.length === 0) {\n      console.log(\"\\n✅ All integrity checks passed! 🎉\");\n      return true;\n    }\n\n    console.log(\"\\n\" + \"=\".repeat(60));\n\n    return this.errors.length === 0;\n  }\n}\n\n// Run validation if called directly\nif (require.main === module) {\n  const validator = new IntegrityValidator();\n  validator.validate().then((success) => {\n    process.exit(success ? 0 : 1);\n  });\n}\n\nmodule.exports = IntegrityValidator;\n"
  },
  {
    "path": "schemas/license.schema.json",
    "content": "{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"title\": \"License\",\n  \"description\": \"Schema for model license definitions\",\n  \"type\": \"object\",\n  \"properties\": {\n    \"license_id\": {\n      \"type\": \"string\",\n      \"description\": \"Unique identifier for the license\",\n      \"examples\": [\"apache_2_0\", \"mit\", \"proprietary\", \"cc_by_nc\"]\n    },\n    \"name\": {\n      \"type\": \"string\",\n      \"description\": \"Display name of the license\",\n      \"examples\": [\"Apache 2.0\", \"MIT License\", \"Proprietary\", \"CC BY-NC 4.0\"]\n    },\n    \"allow_commercial\": {\n      \"type\": \"boolean\",\n      \"description\": \"Whether the license allows commercial use of the model\"\n    },\n    \"description\": {\n      \"type\": \"string\",\n      \"description\": \"Brief description of the license terms and restrictions\",\n      \"examples\": [\n        \"Apache License 2.0 - allows commercial use\",\n        \"Non-commercial research use only\",\n        \"Proprietary license - contact vendor for terms\"\n      ]\n    },\n    \"created_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was created\"\n    },\n    \"updated_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was last updated\"\n    }\n  },\n  \"required\": [\n    \"license_id\",\n    \"name\",\n    \"allow_commercial\",\n    \"description\",\n    \"created_at\",\n    \"updated_at\"\n  ],\n  \"additionalProperties\": false\n}\n"
  },
  {
    "path": "schemas/model.schema.json",
    "content": "{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"title\": \"Model\",\n  \"description\": \"Schema for AI/ML model metadata\",\n  \"type\": \"object\",\n  \"properties\": {\n    \"model_id\": {\n      \"type\": \"string\",\n      \"description\": \"Unique identifier for the model\",\n      \"examples\": [\"gpt-4\", \"claude-3-opus\", \"llama-3.1-405b\"]\n    },\n    \"name\": {\n      \"type\": \"string\",\n      \"description\": \"Display name of the model\",\n      \"examples\": [\"GPT-4\", \"Claude 3 Opus\", \"Llama 3.1 405B\"]\n    },\n    \"organization_id\": {\n      \"type\": \"string\",\n      \"description\": \"ID of the organization that created the model\"\n    },\n    \"model_family_id\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"ID of the model family this model belongs to\",\n      \"examples\": [\"gpt-4\", \"claude-3\", \"llama-3-1\"]\n    },\n    \"fine_tuned_from_model_id\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"ID of the base model this was fine-tuned from\"\n    },\n    \"description\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"Detailed description of the model's capabilities and use cases\"\n    },\n    \"release_date\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"date\",\n      \"description\": \"Date when the model was released (YYYY-MM-DD)\",\n      \"examples\": [\"2024-11-20\", \"2023-03-14\"]\n    },\n    \"announcement_date\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"date\",\n      \"description\": \"Date when the model was first announced (YYYY-MM-DD)\"\n    },\n    \"license_id\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"ID of the license governing the model's use\"\n    },\n    \"multimodal\": {\n      \"type\": \"boolean\",\n      \"description\": \"Whether the model supports multiple input/output modalities\",\n      \"default\": false\n    },\n    \"knowledge_cutoff\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"date\",\n      \"description\": \"Date up to which the model has training data (YYYY-MM-DD)\"\n    },\n    \"param_count\": {\n      \"type\": [\"number\", \"null\"],\n      \"description\": \"Number of parameters in the model (in billions)\",\n      \"minimum\": 0,\n      \"examples\": [175, 405, 1.8]\n    },\n    \"training_tokens\": {\n      \"type\": [\"number\", \"null\"],\n      \"description\": \"Number of tokens the model was trained on (in trillions)\",\n      \"minimum\": 0\n    },\n    \"available_in_zeroeval\": {\n      \"type\": \"boolean\",\n      \"description\": \"Whether the model is available for evaluation in ZeroEval\",\n      \"default\": true\n    },\n    \"source_api_ref\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"uri\",\n      \"description\": \"URL to the official API documentation\"\n    },\n    \"source_playground\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"uri\",\n      \"description\": \"URL to an interactive playground or demo\"\n    },\n    \"source_paper\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"uri\",\n      \"description\": \"URL to the research paper or technical report\"\n    },\n    \"source_scorecard_blog_link\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"uri\",\n      \"description\": \"URL to scorecard or evaluation blog post\"\n    },\n    \"source_repo_link\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"uri\",\n      \"description\": \"URL to the model's code repository\"\n    },\n    \"source_weights_link\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"uri\",\n      \"description\": \"URL to download model weights\"\n    },\n    \"created_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was created\"\n    },\n    \"updated_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was last updated\"\n    }\n  },\n  \"required\": [\n    \"model_id\",\n    \"name\",\n    \"organization_id\",\n    \"multimodal\",\n    \"available_in_zeroeval\",\n    \"created_at\",\n    \"updated_at\"\n  ],\n  \"additionalProperties\": false\n}\n"
  },
  {
    "path": "schemas/organization.schema.json",
    "content": "{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"title\": \"Organization\",\n  \"description\": \"Schema for AI/ML organization data\",\n  \"type\": \"object\",\n  \"properties\": {\n    \"organization_id\": {\n      \"type\": \"string\",\n      \"description\": \"Unique identifier for the organization\",\n      \"examples\": [\"openai\", \"anthropic\", \"google\", \"amazon\"]\n    },\n    \"name\": {\n      \"type\": \"string\",\n      \"description\": \"Display name of the organization\",\n      \"examples\": [\"OpenAI\", \"Anthropic\", \"Google\", \"Amazon\"]\n    },\n    \"website\": {\n      \"type\": \"string\",\n      \"format\": \"uri\",\n      \"description\": \"Official website URL of the organization\",\n      \"examples\": [\"https://openai.com\", \"https://anthropic.com\"]\n    },\n    \"description\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"Brief description of the organization and its focus areas\",\n      \"examples\": [\"Cloud and AI services\", \"AI safety and research company\"]\n    },\n    \"country\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"Country where the organization is headquartered (ISO 3166-1 alpha-2 code)\",\n      \"examples\": [\"US\", \"UK\", \"CN\"]\n    },\n    \"created_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was created in the database\"\n    },\n    \"updated_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was last updated in the database\"\n    }\n  },\n  \"required\": [\n    \"organization_id\",\n    \"name\",\n    \"website\",\n    \"created_at\",\n    \"updated_at\"\n  ],\n  \"additionalProperties\": false\n}\n"
  },
  {
    "path": "schemas/provider-models.schema.json",
    "content": "{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"title\": \"ProviderModel\",\n  \"description\": \"Schema for provider-specific model configurations and pricing\",\n  \"type\": \"object\",\n  \"properties\": {\n    \"model_provider_id\": {\n      \"type\": \"integer\",\n      \"description\": \"Unique identifier for this provider-model configuration\",\n      \"minimum\": 1\n    },\n    \"model_id\": {\n      \"type\": \"string\",\n      \"description\": \"ID of the model\"\n    },\n    \"provider_id\": {\n      \"type\": \"string\",\n      \"description\": \"ID of the provider offering this model\"\n    },\n    \"provider_model_id_used\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"Model ID as used by the provider's API\",\n      \"examples\": [\"gpt-4-turbo\", \"claude-3-opus-20240229\"]\n    },\n    \"deprecated_at\": {\n      \"type\": [\"string\", \"null\"],\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when this model configuration was deprecated\"\n    },\n    \"input_cents_per_million_tokens\": {\n      \"type\": [\"number\", \"null\"],\n      \"description\": \"Cost in cents per million input tokens\",\n      \"minimum\": 0,\n      \"examples\": [1000, 300, 80]\n    },\n    \"output_cents_per_million_tokens\": {\n      \"type\": [\"number\", \"null\"],\n      \"description\": \"Cost in cents per million output tokens\",\n      \"minimum\": 0,\n      \"examples\": [3000, 1500, 400]\n    },\n    \"quantization\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"Quantization method applied to the model\",\n      \"examples\": [\"int8\", \"int4\", \"fp16\", \"bf16\"]\n    },\n    \"max_input_tokens\": {\n      \"type\": [\"integer\", \"null\"],\n      \"description\": \"Maximum number of input tokens supported\",\n      \"minimum\": 1,\n      \"examples\": [128000, 200000, 32000]\n    },\n    \"max_output_tokens\": {\n      \"type\": [\"integer\", \"null\"],\n      \"description\": \"Maximum number of output tokens supported\",\n      \"minimum\": 1,\n      \"examples\": [4096, 8192, 200000]\n    },\n    \"throughput\": {\n      \"type\": [\"number\", \"null\"],\n      \"description\": \"Tokens per second throughput\",\n      \"minimum\": 0,\n      \"examples\": [42.0, 150.5, 200.0]\n    },\n    \"latency\": {\n      \"type\": [\"number\", \"null\"],\n      \"description\": \"Time to first token in seconds\",\n      \"minimum\": 0,\n      \"examples\": [0.4, 0.2, 1.5]\n    },\n    \"feature_web_search\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether web search is available\",\n      \"default\": false\n    },\n    \"feature_function_calling\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether function/tool calling is supported\",\n      \"default\": false\n    },\n    \"feature_structured_output\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether structured output (JSON mode) is supported\",\n      \"default\": false\n    },\n    \"feature_code_execution\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether code execution is supported\",\n      \"default\": false\n    },\n    \"feature_batch_inference\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether batch inference is available\",\n      \"default\": false\n    },\n    \"feature_finetuning\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether fine-tuning is available\",\n      \"default\": false\n    },\n    \"input_modality_text\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether text input is supported\",\n      \"default\": true\n    },\n    \"input_modality_image\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether image input is supported\",\n      \"default\": false\n    },\n    \"input_modality_audio\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether audio input is supported\",\n      \"default\": false\n    },\n    \"input_modality_video\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether video input is supported\",\n      \"default\": false\n    },\n    \"output_modality_text\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether text output is supported\",\n      \"default\": true\n    },\n    \"output_modality_image\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether image output is supported\",\n      \"default\": false\n    },\n    \"output_modality_audio\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether audio output is supported\",\n      \"default\": false\n    },\n    \"output_modality_video\": {\n      \"type\": [\"boolean\", \"null\"],\n      \"description\": \"Whether video output is supported\",\n      \"default\": false\n    },\n    \"created_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was created\"\n    },\n    \"updated_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was last updated\"\n    },\n    \"model_name\": {\n      \"type\": \"string\",\n      \"description\": \"Display name of the model (denormalized for convenience)\"\n    },\n    \"organization_id\": {\n      \"type\": \"string\",\n      \"description\": \"ID of the organization that created the model (denormalized)\"\n    }\n  },\n  \"required\": [\n    \"model_provider_id\",\n    \"model_id\",\n    \"provider_id\",\n    \"created_at\",\n    \"updated_at\",\n    \"model_name\",\n    \"organization_id\"\n  ],\n  \"additionalProperties\": false\n}\n"
  },
  {
    "path": "schemas/provider.schema.json",
    "content": "{\n  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n  \"title\": \"Provider\",\n  \"description\": \"Schema for AI model inference providers\",\n  \"type\": \"object\",\n  \"properties\": {\n    \"provider_id\": {\n      \"type\": \"string\",\n      \"description\": \"Unique identifier for the provider\",\n      \"examples\": [\"openai\", \"anthropic\", \"google\", \"aws-bedrock\", \"azure\"]\n    },\n    \"name\": {\n      \"type\": \"string\",\n      \"description\": \"Display name of the provider\",\n      \"examples\": [\n        \"OpenAI\",\n        \"Anthropic\",\n        \"Google\",\n        \"AWS Bedrock\",\n        \"Azure OpenAI\"\n      ]\n    },\n    \"website\": {\n      \"type\": \"string\",\n      \"format\": \"uri\",\n      \"description\": \"Official website or API documentation URL\",\n      \"examples\": [\"https://openai.com/api\", \"https://docs.anthropic.com\"]\n    },\n    \"created_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was created\"\n    },\n    \"updated_at\": {\n      \"type\": \"string\",\n      \"format\": \"date-time\",\n      \"description\": \"Timestamp when the record was last updated\"\n    }\n  },\n  \"required\": [\"provider_id\", \"name\", \"website\", \"created_at\", \"updated_at\"],\n  \"additionalProperties\": false\n}\n"
  },
  {
    "path": "schemas/validator.js",
    "content": "const fs = require(\"fs\");\nconst path = require(\"path\");\nconst tv4 = require(\"tv4\");\nconst glob = require(\"glob\");\n\nfunction validateSchema(schemaName, filePattern, isArray = false) {\n  console.log(`\\nValidating ${schemaName}...`);\n  const schemaPath = path.join(__dirname, `${schemaName}.schema.json`);\n\n  let schema;\n  try {\n    schema = JSON.parse(fs.readFileSync(schemaPath, \"utf8\"));\n  } catch (error) {\n    console.error(`Error reading schema file: ${schemaPath}`);\n    console.error(error);\n    return false;\n  }\n\n  const files = glob.sync(path.join(__dirname, \"..\", filePattern));\n\n  if (files.length === 0) {\n    console.warn(`⚠️ No files found matching pattern: ${filePattern}`);\n    return true;\n  }\n\n  let isValid = true;\n\n  for (const file of files) {\n    try {\n      const data = JSON.parse(fs.readFileSync(file, \"utf8\"));\n\n      // If expecting an array, validate each item\n      if (isArray) {\n        if (!Array.isArray(data)) {\n          console.error(\n            `❌ Invalid: ${file} - Expected array but got ${typeof data}`\n          );\n          isValid = false;\n          continue;\n        }\n\n        let allItemsValid = true;\n        data.forEach((item, index) => {\n          const result = tv4.validateMultiple(item, schema);\n          if (!result.valid) {\n            console.error(`❌ Invalid item [${index}] in: ${file}`);\n            result.errors.forEach((error) =>\n              console.error(`  - ${error.message} at ${error.dataPath}`)\n            );\n            allItemsValid = false;\n          }\n        });\n\n        if (allItemsValid) {\n          console.log(`✅ Valid: ${file} (${data.length} items)`);\n        } else {\n          isValid = false;\n        }\n      } else {\n        // Single object validation\n        const result = tv4.validateMultiple(data, schema);\n\n        if (result.valid) {\n          console.log(`✅ Valid: ${file}`);\n        } else {\n          console.error(`❌ Invalid: ${file}`);\n          result.errors.forEach((error) =>\n            console.error(`  - ${error.message} at ${error.dataPath}`)\n          );\n          isValid = false;\n        }\n      }\n    } catch (error) {\n      console.error(`Error processing file: ${file}`);\n      console.error(error);\n      isValid = false;\n    }\n  }\n\n  return isValid;\n}\n\nconsole.log(\"🔍 Validating LLM Stats Data Structure...\\n\");\nconsole.log(\"=\".repeat(60));\nconsole.log(\"Phase 1: Schema Validation\");\nconsole.log(\"=\".repeat(60));\n\n// Validate all data types\nconst validations = [\n  // Core entities\n  {\n    schema: \"organization\",\n    pattern: \"data/organizations/*/organization.json\",\n  },\n  {\n    schema: \"model\",\n    pattern: \"data/organizations/*/models/*/model.json\",\n  },\n  { schema: \"license\", pattern: \"data/licenses/*.json\" },\n  { schema: \"benchmark\", pattern: \"data/benchmarks/*.json\" },\n  { schema: \"provider\", pattern: \"data/providers/*/provider.json\" },\n\n  // Arrays\n  {\n    schema: \"benchmark-results\",\n    pattern: \"data/organizations/*/models/*/benchmarks.json\",\n    isArray: true,\n  },\n  {\n    schema: \"provider-models\",\n    pattern: \"data/providers/*/models.json\",\n    isArray: true,\n  },\n];\n\nlet allValid = true;\n\nfor (const { schema, pattern, isArray } of validations) {\n  const isValid = validateSchema(schema, pattern, isArray);\n  allValid = allValid && isValid;\n}\n\nif (allValid) {\n  console.log(\"\\n✅ All schemas are valid! 🎉\");\n\n  // Run integrity validation\n  console.log(\"\\n\" + \"=\".repeat(60));\n  console.log(\"Phase 2: Data Integrity Validation\");\n  console.log(\"=\".repeat(60));\n\n  const IntegrityValidator = require(\"./integrity-validator.js\");\n  const integrityValidator = new IntegrityValidator();\n\n  integrityValidator.validate().then((integrityValid) => {\n    if (integrityValid) {\n      console.log(\"\\n🎉 All validations passed successfully!\");\n      process.exit(0);\n    } else {\n      console.error(\"\\n❌ Data integrity validation failed.\");\n      process.exit(1);\n    }\n  });\n} else {\n  console.error(\"\\n❌ Schema validation failed.\");\n  process.exit(1);\n}\n"
  }
]