Full Code of meta-llama/llama-models for AI

main 0e0b8c519242 cached

105 files

6.3 MB

1.7M tokens

503 symbols

1 requests

Download .txt

Showing preview only (6,608K chars total). Download the full file or copy to clipboard to get everything.

Repository: meta-llama/llama-models
Branch: main
Commit: 0e0b8c519242
Files: 105
Total size: 6.3 MB

Directory structure:
gitextract_941i2ph3/

├── .github/
│   ├── CODEOWNERS
│   └── workflows/
│       └── publish-to-test-pypi.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .ruff.toml
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── SECURITY.md
├── docs/
│   └── license_header.txt
├── models/
│   ├── __init__.py
│   ├── checkpoint.py
│   ├── cli/
│   │   ├── __init__.py
│   │   ├── describe.py
│   │   ├── download.py
│   │   ├── list.py
│   │   ├── llama.py
│   │   ├── prompt_format.py
│   │   ├── remove.py
│   │   ├── safety_models.py
│   │   ├── subcommand.py
│   │   ├── table.py
│   │   ├── utils.py
│   │   └── verify_download.py
│   ├── datatypes.py
│   ├── llama2/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   └── USE_POLICY.md
│   ├── llama3/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   ├── USE_POLICY.md
│   │   ├── __init__.py
│   │   ├── args.py
│   │   ├── chat_format.py
│   │   ├── generation.py
│   │   ├── model.py
│   │   ├── multimodal/
│   │   │   ├── __init__.py
│   │   │   ├── encoder_utils.py
│   │   │   ├── image_transform.py
│   │   │   ├── model.py
│   │   │   └── utils.py
│   │   ├── quantization/
│   │   │   └── loader.py
│   │   ├── requirements.txt
│   │   ├── scripts/
│   │   │   ├── __init__.py
│   │   │   ├── chat_completion.py
│   │   │   └── completion.py
│   │   ├── tests/
│   │   │   └── api/
│   │   │       ├── test_generation.py
│   │   │       ├── test_tokenizer.py
│   │   │       └── test_tool_utils.py
│   │   ├── tokenizer.model
│   │   ├── tokenizer.py
│   │   └── tool_utils.py
│   ├── llama3_1/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   ├── USE_POLICY.md
│   │   ├── eval_details.md
│   │   └── prompt_format.md
│   ├── llama3_2/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   ├── MODEL_CARD_VISION.md
│   │   ├── USE_POLICY.md
│   │   ├── eval_details.md
│   │   ├── text_prompt_format.md
│   │   └── vision_prompt_format.md
│   ├── llama3_3/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   ├── USE_POLICY.md
│   │   ├── eval_details.md
│   │   └── prompt_format.md
│   ├── llama4/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   ├── USE_POLICY.md
│   │   ├── __init__.py
│   │   ├── args.py
│   │   ├── chat_format.py
│   │   ├── datatypes.py
│   │   ├── ffn.py
│   │   ├── generation.py
│   │   ├── model.py
│   │   ├── moe.py
│   │   ├── preprocess.py
│   │   ├── prompt_format.md
│   │   ├── quantization/
│   │   │   ├── __init__.py
│   │   │   └── loader.py
│   │   ├── scripts/
│   │   │   ├── chat_completion.py
│   │   │   ├── completion.py
│   │   │   └── quantize.py
│   │   ├── tests/
│   │   │   ├── __init__.py
│   │   │   └── api/
│   │   │       ├── __init__.py
│   │   │       └── test_chat_format.py
│   │   ├── tokenizer.model
│   │   ├── tokenizer.py
│   │   └── vision/
│   │       ├── embedding.py
│   │       └── encoder.py
│   ├── quantize_impls.py
│   ├── sku_list.py
│   ├── sku_types.py
│   ├── tokenizer_utils.py
│   └── utils/
│       ├── __init__.py
│       ├── config.py
│       └── model_utils.py
├── pyproject.toml
└── requirements.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/CODEOWNERS
================================================
# Each line is a file pattern followed by one or more owners.

# These owners will be the default owners for everything in
# the repo. Unless a later match takes precedence,
* @ashwinb @yanxi0830 @hardikjshah @dltn @raghotham @ehhuang


================================================
FILE: .github/workflows/publish-to-test-pypi.yml
================================================
name: Publish Python 🐍 distribution 📦 to TestPyPI

on:
  repository_dispatch:  # on trigger from llama-stack
    types: [build-models-package]
  
  workflow_dispatch:  # Keep manual trigger
    inputs:
      version:
        description: 'Version number (e.g. 0.0.63.dev20250111)'
        required: true
        type: string

jobs:
  build:
    name: Build distribution 📦
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v4
      with:
        persist-credentials: false
    - name: Get date
      id: date
      run: echo "date=$(date +'%Y%m%d')" >> $GITHUB_OUTPUT
    - name: Update version for repository_dispatch
      if: github.event_name == 'repository_dispatch' && github.event.client_payload.source == 'llama-stack-nightly'
      run: |
        sed -i 's/version="\([^"]*\)"/version="${{ github.event.client_payload.version }}"/' setup.py
    - name: Update version for manual RC
      if: github.event_name == 'workflow_dispatch'
      run: |
        sed -i 's/version="\([^"]*\)"/version="${{ inputs.version }}"/' setup.py
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: "3.11"
    - name: Install pypa/build
      run: >-
        python3 -m
        pip install
        build
        --user
    - name: Build a binary wheel and a source tarball
      run: python3 -m build
    - name: Store the distribution packages
      uses: actions/upload-artifact@v4
      with:
        name: python-package-distributions
        path: dist/

  publish-to-testpypi:
    name: Publish Python 🐍 distribution 📦 to TestPyPI
    needs:
    - build
    runs-on: ubuntu-latest

    environment:
      name: testrelease
      url: https://test.pypi.org/p/llama-models

    permissions:
      id-token: write  # IMPORTANT: mandatory for trusted publishing

    steps:
    - name: Download all the dists
      uses: actions/download-artifact@v4
      with:
        name: python-package-distributions
        path: dist/
    - name: Publish distribution 📦 to TestPyPI
      uses: pypa/gh-action-pypi-publish@release/v1
      with:
        repository-url: https://test.pypi.org/legacy/


================================================
FILE: .gitignore
================================================
__pycache__
dist
*.egg-info
build
.DS_Store
.vscode/


================================================
FILE: .pre-commit-config.yaml
================================================
exclude: 'build/'

default_language_version:
    python: python3

repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0  # Latest stable version
    hooks:
    -   id: check-merge-conflict
    -   id: check-added-large-files
        args: ['--maxkb=1000']
    -   id: end-of-file-fixer
        exclude: '^(.*\.svg)$'

# Temporarily disabling this
#    -   id: no-commit-to-branch
#        args: ['--branch=main']

-   repo: https://github.com/Lucas-C/pre-commit-hooks
    rev: v1.5.4
    hooks:
    -   id: insert-license
        files: \.py$|\.sh$
        args:
          - --license-filepath
          - docs/license_header.txt

-   repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.9.4
    hooks:
    -   id: ruff
        args: [
            --fix,
            --exit-non-zero-on-fix
        ]
    -   id: ruff-format

-   repo: https://github.com/adamchainz/blacken-docs
    rev: 1.19.0
    hooks:
    -   id: blacken-docs
        additional_dependencies:
        - black==24.3.0

-   repo: https://github.com/astral-sh/uv-pre-commit
    rev: 0.5.26
    hooks:
    -   id: uv-export
        args: ["--frozen", "--no-hashes", "--no-emit-project"]

# -   repo: https://github.com/pre-commit/mirrors-mypy
#     rev: v1.14.0
#     hooks:
#     -   id: mypy
#         additional_dependencies:
#           - types-requests
#           - types-setuptools
#           - pydantic
#         args: [--ignore-missing-imports]

# - repo: https://github.com/jsh9/pydoclint
#   rev: d88180a8632bb1602a4d81344085cf320f288c5a
#   hooks:
#     - id: pydoclint
#       args: [--config=pyproject.toml]

# - repo: https://github.com/tcort/markdown-link-check
#   rev: v3.11.2
#   hooks:
#     - id: markdown-link-check
#       args: ['--quiet']

# -   repo: local
#     hooks:
#       - id: distro-codegen
#         name: Distribution Template Codegen
#         additional_dependencies:
#           - rich
#           - pydantic
#         entry: python -m llama_stack.scripts.distro_codegen
#         language: python
#         pass_filenames: false
#         require_serial: true
#         files: ^llama_stack/templates/.*$
#         stages: [manual]

ci:
    autofix_commit_msg: 🎨 [pre-commit.ci] Auto format from pre-commit.com hooks
    autoupdate_commit_msg: ⬆ [pre-commit.ci] pre-commit autoupdate


================================================
FILE: .ruff.toml
================================================
# Suggested config from pytorch that we can adapt
lint.select = ["B", "C", "E" , "F" , "N", "W", "B9"]

line-length = 120

# C408 ignored because we like the dict keyword argument syntax
# E501 is not flexible enough, we're using B950 instead
# N812 ignored because import torch.nn.functional as F is PyTorch convention
# N817 ignored because importing using acronyms is convention (DistributedDataParallel as DDP)
# E731 allow usage of assigning lambda expressions
# E701 let black auto-format statements on one line
# E704 let black auto-format statements on one line
lint.ignore = [
    "E203", "E305", "E402", "E501", "E721", "E741", "F405", "F821", "F841",
    "C408", "E302", "W291", "E303", "N812", "N817", "E731", "E701",
    # These are the additional ones we started ignoring after moving to ruff. We should look into each one of them later.
    "C901", "C405", "C414", "N803", "N999", "C403", "C416", "B028", "C419", "C401", "B023",
    # shebang has extra meaning in fbcode lints, so I think it's not worth trying
    # to line this up with executable bit
    "EXE001",
    # random naming hints don't need
    "N802",
    # these ignores are from flake8-bugbear; please fix!
    "B007", "B008"
]

exclude = [
    "./.git",
    "./docs/*",
    "./build",
    "./scripts",
    "./venv",
    "*.pyi",
    ".pre-commit-config.yaml",
    "*.md",
    ".flake8"
]


================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to make participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.

This Code of Conduct also applies outside the project spaces when there is a
reasonable belief that an individual's behavior may have a negative impact on
the project or its community.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at <opensource-conduct@meta.com>. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq


================================================
FILE: CONTRIBUTING.md
================================================
# Contributing to Llama-Models
We want to make contributing to this project as easy and transparent as
possible.

## Pull Requests
We actively welcome your pull requests.

1. Fork the repo and create your branch from `main`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. If you haven't already, complete the Contributor License Agreement ("CLA").

## Contributor License Agreement ("CLA")
In order to accept your pull request, we need you to submit a CLA. You only need
to do this once to work on any of Meta's open source projects.

Complete your CLA here: <https://code.facebook.com/cla>

## Issues
We use GitHub issues to track public bugs. Please ensure your description is
clear and has sufficient instructions to be able to reproduce the issue.

Meta has a [bounty program](http://facebook.com/whitehat/info) for the safe
disclosure of security bugs. In those cases, please go through the process
outlined on that page and do not file a public issue.

## Coding Style  
* 2 spaces for indentation rather than tabs
* 80 character line length
* ...

## License
By contributing to Llama, you agree that your contributions will be licensed
under the LICENSE file in the root directory of this source tree.


================================================
FILE: LICENSE
================================================
https://github.com/meta-llama/llama-models/blob/main/README.md#llama-models-1


================================================
FILE: MANIFEST.in
================================================
include pyproject.toml
include README.md
include models/llama3/tokenizer.model
include models/llama4/tokenizer.model
include models/resources/dog.jpg
include models/resources/pasta.jpeg
include models/llama3_1/prompt_format.md
include models/llama3_2/text_prompt_format.md
include models/llama3_2/vision_prompt_format.md
include llama_models/llama3/tokenizer.model
include llama_models/llama4/tokenizer.model
include llama_models/resources/dog.jpg
include llama_models/resources/pasta.jpeg
include llama_models/llama3_1/prompt_format.md
include llama_models/llama3_2/text_prompt_format.md
include llama_models/llama3_2/vision_prompt_format.md


================================================
FILE: README.md
================================================
<p align="center">
  <img src="/Llama_Repo.jpeg" width="400"/>
</p>

<p align="center">
        🤗 <a href="https://huggingface.co/meta-Llama"> Models on Hugging Face</a>&nbsp | <a href="https://ai.meta.com/blog/"> Blog</a>&nbsp |  <a href="https://llama.meta.com/">Website</a>&nbsp | <a href="https://llama.meta.com/get-started/">Get Started</a>&nbsp | <a href="https://github.com/meta-llama/llama-cookbook">Llama Cookbook</a>&nbsp
<br>

---

# Llama Models

Llama is an accessible, open large language model (LLM) designed for developers, researchers, and businesses to build, experiment, and responsibly scale their generative AI ideas. Part of a foundational system, it serves as a bedrock for innovation in the global community. A few key aspects:
1. **Open access**: Easy accessibility to cutting-edge large language models, fostering collaboration and advancements among developers, researchers, and organizations
2. **Broad ecosystem**: Llama models have been downloaded hundreds of millions of times, there are thousands of community projects built on Llama and platform support is broad from cloud providers to startups - the world is building with Llama!
3. **Trust & safety**: Llama models are part of a comprehensive approach to trust and safety, releasing models and tools that are designed to enable community collaboration and encourage the standardization of the development and usage of trust and safety tools for generative AI

Our mission is to empower individuals and industry through this opportunity while fostering an environment of discovery and ethical AI advancements. The model weights are licensed for researchers and commercial entities, upholding the principles of openness.

## Llama Models

[![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-models)](https://pypi.org/project/llama-models/)
[![Discord](https://img.shields.io/discord/1257833999603335178)](https://discord.gg/TZAAYNVtrU)

|  **Model** | **Launch date** | **Model sizes** | **Context Length** | **Tokenizer** | **Acceptable use policy**  |  **License** | **Model Card** |
| :----: | :----: | :----: | :----:|:----:|:----:|:----:|:----:|
| Llama 2 | 7/18/2023 | 7B, 13B, 70B | 4K | Sentencepiece | [Use Policy](models/llama2/USE_POLICY.md) | [License](models/llama2/LICENSE) | [Model Card](models/llama2/MODEL_CARD.md) |
| Llama 3 | 4/18/2024 | 8B, 70B | 8K | TikToken-based | [Use Policy](models/llama3/USE_POLICY.md) | [License](models/llama3/LICENSE) | [Model Card](models/llama3/MODEL_CARD.md) |
| Llama 3.1 | 7/23/2024 | 8B, 70B, 405B | 128K | TikToken-based | [Use Policy](models/llama3_1/USE_POLICY.md) | [License](models/llama3_1/LICENSE) | [Model Card](models/llama3_1/MODEL_CARD.md) |
| Llama 3.2 | 9/25/2024 | 1B, 3B | 128K | TikToken-based | [Use Policy](models/llama3_2/USE_POLICY.md) | [License](models/llama3_2/LICENSE) | [Model Card](models/llama3_2/MODEL_CARD.md) |
| Llama 3.2-Vision | 9/25/2024 | 11B, 90B | 128K | TikToken-based | [Use Policy](models/llama3_2/USE_POLICY.md) | [License](models/llama3_2/LICENSE) | [Model Card](models/llama3_2/MODEL_CARD_VISION.md) |
| Llama 3.3 | 12/04/2024 | 70B | 128K | TikToken-based | [Use Policy](models/llama3_3/USE_POLICY.md) | [License](models/llama3_3/LICENSE) | [Model Card](models/llama3_3/MODEL_CARD.md) |
| Llama 4 | 4/5/2025 | Scout-17B-16E, Maverick-17B-128E | 10M, 1M | TikToken-based | [Use Policy](models/llama4/USE_POLICY.md) | [License](models/llama4/LICENSE) | [Model Card](models/llama4/MODEL_CARD.md) |

## Download

To download the model weights and tokenizer:

1. Visit the [Meta Llama website](https://llama.meta.com/llama-downloads/).
2. Read and accept the license.
3. Once your request is approved you will receive a signed URL via email.
4. Install the Llama Models CLI: `pip install llama-models`. (**<-- Start Here if you have received an email already.**)
5. Run `llama-model list` to show the latest available models and determine the model ID you wish to download. **NOTE**:
If you want older versions of models, run `llama-model list --show-all` to show all the available Llama models.

6. Run: `llama-model download --source meta --model-id CHOSEN_MODEL_ID`
7. Pass the URL provided when prompted to start the download.

Remember that the links expire after 24 hours and a certain amount of downloads. You can always re-request a link if you start seeing errors such as `403: Forbidden`.

### CLI Commands Reference

Once installed, the `llama-model` CLI provides the following commands:

```bash
llama-model list              # List available models
llama-model list --show-all   # List all models (including older versions)
llama-model describe -m MODEL_ID     # Show detailed information about a model
llama-model download          # Download models from Meta or Hugging Face
llama-model verify-download   # Verify integrity of downloaded models
llama-model remove -m MODEL_ID       # Remove a downloaded model
llama-model prompt-format -m MODEL_ID  # Show the prompt format for a model
```

For detailed help on any command, run `llama-model COMMAND --help`.

## Running the models

In order to run the models, you will need to install dependencies after checking out the repository.

```bash
# Run this within a suitable Python environment (uv, conda, or virtualenv)
pip install .[torch]
```

Example scripts are available in `models/{ llama3, llama4 }/scripts/` sub-directory. Note that the Llama4 series of models require at least 4 GPUs to run inference at full (bf16) precision.

```bash
#!/bin/bash

NGPUS=4
CHECKPOINT_DIR=~/.llama/checkpoints/Llama-4-Scout-17B-16E-Instruct
PYTHONPATH=$(git rev-parse --show-toplevel) \
  torchrun --nproc_per_node=$NGPUS \
  -m models.llama4.scripts.chat_completion $CHECKPOINT_DIR \
  --world_size $NGPUS
```

The above script should be used with an Instruct (Chat) model. For a Base model, update the `CHECKPOINT_DIR` path and use the script `models.llama4.scripts.completion`.


## Running inference with FP8 and Int4 Quantization

You can reduce the memory footprint of the models at the cost of minimal loss in accuracy by running inference with FP8 or Int4 quantization. Use the `--quantization-mode` flag to specify the quantization mode. There are two modes:
- `fp8_mixed`: Mixed precision inference with FP8 for some weights and bfloat16 for activations.
- `int4_mixed`: Mixed precision inference with Int4 for some weights and bfloat16 for activations.

Using FP8, running Llama-4-Scout-17B-16E-Instruct requires 2 GPUs with 80GB of memory. Using Int4, you need a single GPU with 80GB of memory.

```bash
MODE=fp8_mixed  # or int4_mixed
if [ $MODE == "fp8_mixed" ]; then
  NGPUS=2
else
  NGPUS=1
fi
CHECKPOINT_DIR=~/.llama/checkpoints/Llama-4-Scout-17B-16E-Instruct
PYTHONPATH=$(git rev-parse --show-toplevel) \
  torchrun --nproc_per_node=$NGPUS \
  -m models.llama4.scripts.chat_completion $CHECKPOINT_DIR \
  --world_size $NGPUS \
  --quantization-mode $MODE
```


For more flexibility in running inference (including using other providers), please see the [`Llama Stack`](https://github.com/meta-llama/llama-stack) toolset.


## Access to Hugging Face

We also provide downloads on [Hugging Face](https://huggingface.co/meta-llama), in both transformers and native `llama4` formats. To download the weights from Hugging Face, please follow these steps:

- Visit one of the repos, for example [meta-llama/Llama-4-Scout-17B-16E](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E).
- Read and accept the license. Once your request is approved, you'll be granted access to all Llama 3.1 models as well as previous versions. Note that requests used to take up to one hour to get processed.
- To download the original native weights to use with this repo, click on the "Files and versions" tab and download the contents of the `original` folder. You can also download them from the command line if you `pip install huggingface-hub`:

```bash
huggingface-cli download meta-llama/Llama-4-Scout-17B-16E-Instruct-Original --local-dir meta-llama/Llama-4-Scout-17B-16E-Instruct-Original
```

- To use with transformers, the following snippet will download and cache the weights:

  ```python
  # inference.py
  from transformers import AutoTokenizer, Llama4ForConditionalGeneration
  import torch

  model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

  tokenizer = AutoTokenizer.from_pretrained(model_id)

  messages = [
      {"role": "user", "content": "Who are you?"},
  ]
  inputs = tokenizer.apply_chat_template(
      messages, add_generation_prompt=True, return_tensors="pt", return_dict=True
  )

  model = Llama4ForConditionalGeneration.from_pretrained(
      model_id, device_map="auto", torch_dtype=torch.bfloat16
  )

  outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
  outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1] :])
  print(outputs[0])
  ```
  ```bash
   torchrun --nnodes=1 --nproc_per_node=8 inference.py
   ```

## Installations

You can install this repository as a [package](https://pypi.org/project/llama-models/) by just doing `pip install llama-models`

## Responsible Use

Llama models are a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios.
To help developers address these risks, we have created the [Responsible Use Guide](https://ai.meta.com/static-resource/responsible-use-guide/).

## Issues

Please report any software “bug” or other problems with the models through one of the following means:
- Reporting issues with the model: [https://github.com/meta-llama/llama-models/issues](https://github.com/meta-llama/llama-models/issues)
- Reporting risky content generated by the model: [developers.facebook.com/llama_output_feedback](http://developers.facebook.com/llama_output_feedback)
- Reporting bugs and security concerns: [facebook.com/whitehat/info](http://facebook.com/whitehat/info)


## Questions

For common questions, the FAQ can be found [here](https://llama.meta.com/faq), which will be updated over time as new questions arise.


================================================
FILE: SECURITY.md
================================================
# Security Policy

## Reporting a Vulnerability

Please report vulnerabilities to our bug bounty program at https://bugbounty.meta.com/


================================================
FILE: docs/license_header.txt
================================================
Copyright (c) Meta Platforms, Inc. and affiliates.
All rights reserved.

This source code is licensed under the terms described in the LICENSE file in
top-level folder for each specific model found within the models/ directory at
the top-level of this source tree.


================================================
FILE: models/__init__.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.


================================================
FILE: models/checkpoint.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import concurrent.futures
import re
from pathlib import Path
from typing import Any, Dict, List, Optional, Union

import numpy as np
import torch
from fairscale.nn.model_parallel.initialize import get_model_parallel_rank, get_model_parallel_world_size


def map_mp_rank(old_mp_size: int, new_mp_size: int, new_mp_rank: int) -> List[int]:
    """Map a new MP rank to a list of old MP ranks given a change in MP size."""
    if new_mp_size % old_mp_size == 0:
        # Read old MP shard and split it into smaller ones
        return [new_mp_rank * old_mp_size // new_mp_size]
    elif old_mp_size % new_mp_size == 0:
        # Merge old MP shards into a single one
        mp_factor = old_mp_size // new_mp_size
        return list(range(new_mp_rank * mp_factor, (new_mp_rank + 1) * mp_factor))
    else:
        raise ValueError(
            f"Either old MP size or new MP size should be a multiple of the other: "
            f"{old_mp_size} % {new_mp_size} != 0 and {new_mp_size} % {old_mp_size} != 0"
        )


def maybe_reshard_state_dict(
    ckpt_paths: List[Path],
    n_kv_heads: int,
    moe_num_experts: Optional[int] = None,
    map_location: Union[str, torch.device] = "cpu",
    mmap: bool = True,
) -> Dict[str, torch.Tensor]:
    if str(map_location) == "cpu":
        torch.set_default_tensor_type(torch.BFloat16Tensor)
    else:
        torch.set_default_tensor_type(torch.cuda.BFloat16Tensor)

    ckpt_paths = np.array(sorted(ckpt_paths))

    new_mp_size, new_mp_rank = get_model_parallel_world_size(), get_model_parallel_rank()
    old_mp_size = len(ckpt_paths)
    old_mp_ranks = map_mp_rank(old_mp_size, new_mp_size, new_mp_rank)

    print(f"Loading checkpoint shards:\n{str(ckpt_paths[old_mp_ranks])}")  # type: ignore
    paths = ckpt_paths[old_mp_ranks]  # type: ignore
    state_dicts = [torch.load(str(p), map_location=map_location, mmap=mmap) for p in paths]

    if new_mp_size == old_mp_size:
        return state_dicts[0]  # type: ignore

    if moe_num_experts is not None:
        state_dicts = [convert_moe_weights(d, moe_num_experts) for d in state_dicts]

    print(f"Resharding {len(state_dicts)} state dicts from MP size {old_mp_size} to MP size {new_mp_size}")
    return reshard_mp(
        state_dicts,
        size=max(new_mp_size // old_mp_size, 1),
        rank=new_mp_rank % max(new_mp_size // old_mp_size, 1),
        repeat_qk_qv=max(new_mp_size // n_kv_heads, 1),
    )


_WEIGHT_ROW_KEY = {
    "feed_forward.w2",
    "feed_forward.mlp.fc2",
    "attention.wo",
    "feed_forward.mlp.fc2_weight",
    "feed_forward.w_out_shared_DF.weight",
    "attn.wo.weight",
    "mlp.c_proj.weight",
}
_MOE_WEIGHT_ROW_KEY = {"feed_forward.experts.(moe_w_in_eD_F|moe_w_swiglu_eD_F)"}

_WEIGHT_COLUMN_KEY = {
    "output",
    "feed_forward.(w1|w3)",
    "feed_forward.mlp.(fc1|fc3)",
    "feed_forward.mlp.fc1_weight",
    "attention.(wk|wq|wv|wqkv).weight",
    "feed_forward.(w_in_shared_FD|w_swiglu_FD)",
    "attn.(wk|wq|wv).weight",
    "attn.(wk|wq|wv).bias",
    "mlp.c_fc.weight",
    "mlp.c_fc.bias",
    "conv1._linear.weight",
    "tok_embeddings.weight",
    "vision_projection.weight",
}
_MOE_WEIGHT_COLUMN_KEY = {"feed_forward.experts.moe_w_out_eF_D"}


def reshard_mp(
    state_dicts: List[Dict[str, torch.Tensor]],
    size: int,
    rank: int,
    repeat_qk_qv: int = 1,
) -> Dict[str, torch.Tensor]:
    """
    Reshard a list of state dicts into a single state dict given a change in MP size.
    If the list has more than one state dict, we concatenate the values of the same
    key across all state dicts. Otherwise, we just slice it for the current MP rank.
    """

    def concat_or_chunk(tensors: List[torch.Tensor], dim: int) -> torch.Tensor:
        if len(tensors) > 1:
            return torch.cat(tensors, dim=dim)
        return tensors[0].chunk(size, dim=dim)[rank].clone()

    def process_key(key: str) -> torch.Tensor:
        if row_regex.search(key):
            return concat_or_chunk([s[key] for s in state_dicts], dim=-1)
        elif column_regex.search(key):
            if "w13" in key or "fc1_weight" in key:
                dims = state_dicts[0][key].size()
                values = [s[key].view(2, dims[0] // 2, *dims[1:]) for s in state_dicts]
                return concat_or_chunk(values, dim=1).flatten(0, 1)
            elif "qkv" in key:
                q_dim = state_dicts[0][key.replace("qkv", "o")].size(1)
                kv_dim = (state_dicts[0][key].size(0) - q_dim) // 2
                values = [s[key].split((q_dim, kv_dim, kv_dim)) for s in state_dicts]
                return torch.cat([concat_or_chunk(x, dim=0) for x in zip(*values)])  # type: ignore
            elif "wk.weight" in key or "wv.weight" in key:
                # Support MP > #kv_head
                return concat_or_chunk([s[key].repeat(repeat_qk_qv, 1) for s in state_dicts], dim=0)
            elif key == "output.bias" or key == "fc.weight":
                return concat_or_chunk([s[key] for s in state_dicts], dim=0)
            elif "w_" in key:
                return concat_or_chunk([s[key] for s in state_dicts], dim=-2)
            else:
                return concat_or_chunk([s[key] for s in state_dicts], dim=0)
        else:
            return state_dicts[0][key].clone()

    row_keys = _WEIGHT_ROW_KEY | _MOE_WEIGHT_ROW_KEY
    column_keys = _WEIGHT_COLUMN_KEY | _MOE_WEIGHT_COLUMN_KEY

    column_regex = re.compile("|".join(column_keys))
    row_regex = re.compile("|".join(row_keys))

    output: Dict[str, torch.Tensor] = {}
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # Note: only processes keys in the first state dict.
        # Assumes keys are the same across all state dicts.
        mappings = {executor.submit(process_key, key): key for key in state_dicts[0]}
        for future in concurrent.futures.as_completed(mappings):
            output[mappings[future]] = future.result()
    return output


def convert_moe_weights(state_dict: Dict[str, Any], num_experts: int) -> Dict[str, Any]:
    routed_keys = _MOE_WEIGHT_ROW_KEY | _MOE_WEIGHT_COLUMN_KEY
    routed_regex = re.compile("|".join(routed_keys))
    keys = list(state_dict.keys())
    for key in keys:
        if routed_regex.search(key):
            state_dict[key] = state_dict.pop(key).unflatten(0, (num_experts, -1)).squeeze(dim=0)
    return state_dict


================================================
FILE: models/cli/__init__.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.


================================================
FILE: models/cli/describe.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import argparse
import json

from llama_models.cli.subcommand import Subcommand
from llama_models.cli.table import print_table
from llama_models.sku_list import resolve_model


class Describe(Subcommand):
    """Show details about a model"""

    def __init__(self, subparsers: argparse._SubParsersAction):
        super().__init__()
        self.parser = subparsers.add_parser(
            "describe",
            prog="llama-model describe",
            description="Show details about a llama model",
            formatter_class=argparse.RawTextHelpFormatter,
        )
        self._add_arguments()
        self.parser.set_defaults(func=self._run_model_describe_cmd)

    def _add_arguments(self):
        self.parser.add_argument(
            "-m",
            "--model-id",
            type=str,
            required=True,
            help="See `llama-model list` or `llama-model list --show-all` for the list of available models",
        )

    def _run_model_describe_cmd(self, args: argparse.Namespace) -> None:
        from llama_models.cli.safety_models import prompt_guard_model_sku_map

        prompt_guard_model_map = prompt_guard_model_sku_map()
        if args.model_id in prompt_guard_model_map.keys():
            model = prompt_guard_model_map[args.model_id]
        else:
            model = resolve_model(args.model_id)

        if model is None:
            self.parser.error(
                f"Model {args.model_id} not found; try 'llama-model list' for a list of available models."
            )
            return

        headers = [
            "Model",
            model.descriptor(),
        ]

        rows = [
            ("Hugging Face ID", model.huggingface_repo or "<Not Available>"),
            ("Description", model.description),
            ("Context Length", f"{model.max_seq_length // 1024}K tokens"),
            ("Weights format", model.quantization_format.value),
            ("Model params.json", json.dumps(model.arch_args, indent=4)),
        ]

        print_table(
            rows,
            headers,
            separate_rows=True,
        )


================================================
FILE: models/cli/download.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import argparse
import asyncio
import json
import os
import shutil
import sys
from dataclasses import dataclass
from datetime import UTC, datetime
from functools import partial
from pathlib import Path

import httpx
from pydantic import BaseModel, ConfigDict
from rich.console import Console
from rich.progress import (
    BarColumn,
    DownloadColumn,
    Progress,
    TextColumn,
    TimeRemainingColumn,
    TransferSpeedColumn,
)
from termcolor import cprint

from llama_models.cli.subcommand import Subcommand
from llama_models.sku_list import LlamaDownloadInfo
from llama_models.sku_types import Model


class Download(Subcommand):
    """Llama cli for downloading llama toolchain assets"""

    def __init__(self, subparsers: argparse._SubParsersAction):
        super().__init__()
        self.parser = subparsers.add_parser(
            "download",
            prog="llama download",
            description="Download a model from llama.meta.com or Hugging Face Hub",
            formatter_class=argparse.RawTextHelpFormatter,
        )
        setup_download_parser(self.parser)


def setup_download_parser(parser: argparse.ArgumentParser) -> None:
    parser.add_argument(
        "--source",
        choices=["meta", "huggingface"],
        default="meta",
    )
    parser.add_argument(
        "--model-id",
        required=False,
        help="See `llama model list` or `llama model list --show-all` for the list of available models. Specify multiple model IDs with commas, e.g. --model-id Llama3.2-1B,Llama3.2-3B",
    )
    parser.add_argument(
        "--hf-token",
        type=str,
        required=False,
        default=None,
        help="Hugging Face API token. Needed for gated models like llama2/3. Will also try to read environment variable `HF_TOKEN` as default.",
    )
    parser.add_argument(
        "--meta-url",
        type=str,
        required=False,
        help="For source=meta, URL obtained from llama.meta.com after accepting license terms",
    )
    parser.add_argument(
        "--max-parallel",
        type=int,
        required=False,
        default=3,
        help="Maximum number of concurrent downloads",
    )
    parser.add_argument(
        "--ignore-patterns",
        type=str,
        required=False,
        default="*.safetensors",
        help="""For source=huggingface, files matching any of the patterns are not downloaded. Defaults to ignoring
safetensors files to avoid downloading duplicate weights.
""",
    )
    parser.add_argument(
        "--manifest-file",
        type=str,
        help="For source=meta, you can download models from a manifest file containing a file => URL mapping",
        required=False,
    )
    parser.set_defaults(func=partial(run_download_cmd, parser=parser))


@dataclass
class DownloadTask:
    url: str
    output_file: str
    total_size: int = 0
    downloaded_size: int = 0
    task_id: int | None = None
    retries: int = 0
    max_retries: int = 3


class DownloadError(Exception):
    pass


class CustomTransferSpeedColumn(TransferSpeedColumn):
    def render(self, task):
        if task.finished:
            return "-"
        return super().render(task)


class ParallelDownloader:
    def __init__(
        self,
        max_concurrent_downloads: int = 3,
        buffer_size: int = 1024 * 1024,
        timeout: int = 30,
    ):
        self.max_concurrent_downloads = max_concurrent_downloads
        self.buffer_size = buffer_size
        self.timeout = timeout
        self.console = Console()
        self.progress = Progress(
            TextColumn("[bold blue]{task.description}"),
            BarColumn(bar_width=40),
            "[progress.percentage]{task.percentage:>3.1f}%",
            DownloadColumn(),
            CustomTransferSpeedColumn(),
            TimeRemainingColumn(),
            console=self.console,
            expand=True,
        )
        self.client_options = {
            "timeout": httpx.Timeout(timeout),
            "follow_redirects": True,
        }

    async def retry_with_exponential_backoff(self, task: DownloadTask, func, *args, **kwargs):
        last_exception = None
        for attempt in range(task.max_retries):
            try:
                return await func(*args, **kwargs)
            except Exception as e:
                last_exception = e
                if attempt < task.max_retries - 1:
                    wait_time = min(30, 2**attempt)  # Cap at 30 seconds
                    self.console.print(
                        f"[yellow]Attempt {attempt + 1}/{task.max_retries} failed, "
                        f"retrying in {wait_time} seconds: {str(e)}[/yellow]"
                    )
                    await asyncio.sleep(wait_time)
                    continue
        raise last_exception

    async def get_file_info(self, client: httpx.AsyncClient, task: DownloadTask) -> None:
        if task.total_size > 0:
            self.progress.update(task.task_id, total=task.total_size)
            return

        async def _get_info():
            response = await client.head(task.url, headers={"Accept-Encoding": "identity"}, **self.client_options)
            response.raise_for_status()
            return response

        try:
            response = await self.retry_with_exponential_backoff(task, _get_info)

            task.url = str(response.url)
            task.total_size = int(response.headers.get("Content-Length", 0))

            if task.total_size == 0:
                raise DownloadError(
                    f"Unable to determine file size for {task.output_file}. "
                    "The server might not support range requests."
                )

            # Update the progress bar's total size once we know it
            if task.task_id is not None:
                self.progress.update(task.task_id, total=task.total_size)

        except httpx.HTTPError as e:
            self.console.print(f"[red]Error getting file info: {str(e)}[/red]")
            raise

    def verify_file_integrity(self, task: DownloadTask) -> bool:
        if not os.path.exists(task.output_file):
            return False
        return os.path.getsize(task.output_file) == task.total_size

    async def download_chunk(self, client: httpx.AsyncClient, task: DownloadTask, start: int, end: int) -> None:
        async def _download_chunk():
            headers = {"Range": f"bytes={start}-{end}"}
            async with client.stream("GET", task.url, headers=headers, **self.client_options) as response:
                response.raise_for_status()

                with open(task.output_file, "ab") as file:
                    file.seek(start)
                    async for chunk in response.aiter_bytes(self.buffer_size):
                        file.write(chunk)
                        task.downloaded_size += len(chunk)
                        self.progress.update(
                            task.task_id,
                            completed=task.downloaded_size,
                        )

        try:
            await self.retry_with_exponential_backoff(task, _download_chunk)
        except Exception as e:
            raise DownloadError(
                f"Failed to download chunk {start}-{end} after {task.max_retries} attempts: {str(e)}"
            ) from e

    async def prepare_download(self, task: DownloadTask) -> None:
        output_dir = os.path.dirname(task.output_file)
        os.makedirs(output_dir, exist_ok=True)

        if os.path.exists(task.output_file):
            task.downloaded_size = os.path.getsize(task.output_file)

    async def download_file(self, task: DownloadTask) -> None:
        try:
            async with httpx.AsyncClient(**self.client_options) as client:
                await self.get_file_info(client, task)

                # Check if file is already downloaded
                if os.path.exists(task.output_file):
                    if self.verify_file_integrity(task):
                        self.console.print(f"[green]Already downloaded {task.output_file}[/green]")
                        self.progress.update(task.task_id, completed=task.total_size)
                        return

                await self.prepare_download(task)

                try:
                    # Split the remaining download into chunks
                    chunk_size = 27_000_000_000  # Cloudfront max chunk size
                    chunks = []

                    current_pos = task.downloaded_size
                    while current_pos < task.total_size:
                        chunk_end = min(current_pos + chunk_size - 1, task.total_size - 1)
                        chunks.append((current_pos, chunk_end))
                        current_pos = chunk_end + 1

                    # Download chunks in sequence
                    for chunk_start, chunk_end in chunks:
                        await self.download_chunk(client, task, chunk_start, chunk_end)

                except Exception as e:
                    raise DownloadError(f"Download failed: {str(e)}") from e

        except Exception as e:
            self.progress.update(task.task_id, description=f"[red]Failed: {task.output_file}[/red]")
            raise DownloadError(f"Download failed for {task.output_file}: {str(e)}") from e

    def has_disk_space(self, tasks: list[DownloadTask]) -> bool:
        try:
            total_remaining_size = sum(task.total_size - task.downloaded_size for task in tasks)
            dir_path = os.path.dirname(os.path.abspath(tasks[0].output_file))
            free_space = shutil.disk_usage(dir_path).free

            # Add 10% buffer for safety
            required_space = int(total_remaining_size * 1.1)

            if free_space < required_space:
                self.console.print(
                    f"[red]Not enough disk space. Required: {required_space // (1024 * 1024)} MB, "
                    f"Available: {free_space // (1024 * 1024)} MB[/red]"
                )
                return False
            return True

        except Exception as e:
            raise DownloadError(f"Failed to check disk space: {str(e)}") from e

    async def download_all(self, tasks: list[DownloadTask]) -> None:
        if not tasks:
            raise ValueError("No download tasks provided")

        if not os.environ.get("LLAMA_DOWNLOAD_NO_SPACE_CHECK") and not self.has_disk_space(tasks):
            raise DownloadError("Insufficient disk space for downloads")

        failed_tasks = []

        with self.progress:
            for task in tasks:
                desc = f"Downloading {Path(task.output_file).name}"
                task.task_id = self.progress.add_task(desc, total=task.total_size, completed=task.downloaded_size)

            semaphore = asyncio.Semaphore(self.max_concurrent_downloads)

            async def download_with_semaphore(task: DownloadTask):
                async with semaphore:
                    try:
                        await self.download_file(task)
                    except Exception as e:
                        failed_tasks.append((task, str(e)))

            await asyncio.gather(*(download_with_semaphore(task) for task in tasks))

        if failed_tasks:
            self.console.print("\n[red]Some downloads failed:[/red]")
            for task, error in failed_tasks:
                self.console.print(f"[red]- {Path(task.output_file).name}: {error}[/red]")
            raise DownloadError(f"{len(failed_tasks)} downloads failed")


def _hf_download(
    model: "Model",
    hf_token: str,
    ignore_patterns: str,
    parser: argparse.ArgumentParser,
):
    from huggingface_hub import snapshot_download
    from huggingface_hub.utils import GatedRepoError, RepositoryNotFoundError

    from llama_models.utils.model_utils import model_local_dir

    repo_id = model.huggingface_repo
    if repo_id is None:
        raise ValueError(f"No repo id found for model {model.descriptor()}")

    output_dir = model_local_dir(model.descriptor())
    os.makedirs(output_dir, exist_ok=True)
    try:
        true_output_dir = snapshot_download(
            repo_id,
            local_dir=output_dir,
            ignore_patterns=ignore_patterns,
            token=hf_token,
            library_name="llama-stack",
        )
    except GatedRepoError:
        parser.error(
            "It looks like you are trying to access a gated repository. Please ensure you "
            "have access to the repository and have provided the proper Hugging Face API token "
            "using the option `--hf-token` or by running `huggingface-cli login`."
            "You can find your token by visiting https://huggingface.co/settings/tokens"
        )
    except RepositoryNotFoundError:
        parser.error(f"Repository '{repo_id}' not found on the Hugging Face Hub or incorrect Hugging Face token.")
    except Exception as e:
        parser.error(e)

    print(f"\nSuccessfully downloaded model to {true_output_dir}")


def _meta_download(
    model: "Model",
    model_id: str,
    meta_url: str,
    info: "LlamaDownloadInfo",
    max_concurrent_downloads: int,
):
    from llama_models.utils.model_utils import model_local_dir

    output_dir = Path(model_local_dir(model.descriptor()))
    os.makedirs(output_dir, exist_ok=True)

    # Create download tasks for each file
    tasks = []
    for f in info.files:
        output_file = str(output_dir / f)
        url = meta_url.replace("*", f"{info.folder}/{f}")
        total_size = info.pth_size if "consolidated" in f else 0
        tasks.append(DownloadTask(url=url, output_file=output_file, total_size=total_size, max_retries=3))

    # Initialize and run parallel downloader
    downloader = ParallelDownloader(max_concurrent_downloads=max_concurrent_downloads)
    asyncio.run(downloader.download_all(tasks))

    cprint(f"\nSuccessfully downloaded model to {output_dir}", color="green", file=sys.stderr)
    cprint(
        f"\nView MD5 checksum files at: {output_dir / 'checklist.chk'}",
        file=sys.stderr,
    )
    cprint(
        f"\n[Optionally] To run MD5 checksums, use the following command: llama model verify-download --model-id {model_id}",
        color="yellow",
        file=sys.stderr,
    )


class ModelEntry(BaseModel):
    model_id: str
    files: dict[str, str]

    model_config = ConfigDict(protected_namespaces=())


class Manifest(BaseModel):
    models: list[ModelEntry]
    expires_on: datetime


def _download_from_manifest(manifest_file: str, max_concurrent_downloads: int):
    from llama_models.utils.model_utils import model_local_dir

    with open(manifest_file) as f:
        d = json.load(f)
        manifest = Manifest(**d)

    if datetime.now(UTC) > manifest.expires_on.astimezone(UTC):
        raise ValueError(f"Manifest URLs have expired on {manifest.expires_on}")

    console = Console()
    for entry in manifest.models:
        console.print(f"[blue]Downloading model {entry.model_id}...[/blue]")
        output_dir = Path(model_local_dir(entry.model_id))
        os.makedirs(output_dir, exist_ok=True)

        if any(output_dir.iterdir()):
            console.print(f"[yellow]Output directory {output_dir} is not empty.[/yellow]")

            while True:
                resp = input("Do you want to (C)ontinue download or (R)estart completely? (continue/restart): ")
                if resp.lower() in ["restart", "r"]:
                    shutil.rmtree(output_dir)
                    os.makedirs(output_dir, exist_ok=True)
                    break
                elif resp.lower() in ["continue", "c"]:
                    console.print("[blue]Continuing download...[/blue]")
                    break
                else:
                    console.print("[red]Invalid response. Please try again.[/red]")

        # Create download tasks for all files in the manifest
        tasks = [
            DownloadTask(url=url, output_file=str(output_dir / fname), max_retries=3)
            for fname, url in entry.files.items()
        ]

        # Initialize and run parallel downloader
        downloader = ParallelDownloader(max_concurrent_downloads=max_concurrent_downloads)
        asyncio.run(downloader.download_all(tasks))


def run_download_cmd(args: argparse.Namespace, parser: argparse.ArgumentParser):
    """Main download command handler"""
    try:
        if args.manifest_file:
            _download_from_manifest(args.manifest_file, args.max_parallel)
            return

        if args.model_id is None:
            parser.error("Please provide a model id")
            return

        # Handle comma-separated model IDs
        model_ids = [model_id.strip() for model_id in args.model_id.split(",")]

        from llama_models.sku_list import llama_meta_net_info, resolve_model

        from .model.safety_models import (
            prompt_guard_download_info_map,
            prompt_guard_model_sku_map,
        )

        prompt_guard_model_sku_map = prompt_guard_model_sku_map()
        prompt_guard_download_info_map = prompt_guard_download_info_map()

        for model_id in model_ids:
            if model_id in prompt_guard_model_sku_map.keys():
                model = prompt_guard_model_sku_map[model_id]
                info = prompt_guard_download_info_map[model_id]
            else:
                model = resolve_model(model_id)
                if model is None:
                    parser.error(f"Model {model_id} not found")
                    continue
                info = llama_meta_net_info(model)

            if args.source == "huggingface":
                _hf_download(model, args.hf_token, args.ignore_patterns, parser)
            else:
                meta_url = args.meta_url or input(
                    f"Please provide the signed URL for model {model_id} you received via email "
                    f"after visiting https://www.llama.com/llama-downloads/ "
                    f"(e.g., https://llama3-1.llamameta.net/*?Policy...): "
                )
                if "llamameta.net" not in meta_url:
                    parser.error("Invalid Meta URL provided")
                _meta_download(model, model_id, meta_url, info, args.max_parallel)

    except Exception as e:
        parser.error(f"Download failed: {str(e)}")


================================================
FILE: models/cli/list.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import argparse
import os
import time
from pathlib import Path

from llama_models.cli.subcommand import Subcommand
from llama_models.cli.table import print_table
from llama_models.utils.config import DEFAULT_CHECKPOINT_DIR
from llama_models.sku_list import all_registered_models


def _get_model_size(model_dir):
    return sum(f.stat().st_size for f in Path(model_dir).rglob("*") if f.is_file())


def _convert_to_model_descriptor(model):
    for m in all_registered_models():
        if model == m.descriptor().replace(":", "-"):
            return str(m.descriptor())
    return str(model)


def _run_model_list_downloaded_cmd() -> None:
    headers = ["Model", "Size", "Modified Time"]

    rows = []
    for model in os.listdir(DEFAULT_CHECKPOINT_DIR):
        abs_path = os.path.join(DEFAULT_CHECKPOINT_DIR, model)
        space_usage = _get_model_size(abs_path)
        model_size = f"{space_usage / (1024**3):.2f} GB"
        modified_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(os.path.getmtime(abs_path)))
        rows.append(
            [
                _convert_to_model_descriptor(model),
                model_size,
                modified_time,
            ]
        )

    print_table(
        rows,
        headers,
        separate_rows=True,
    )


class List(Subcommand):
    """List available llama models"""

    def __init__(self, subparsers: argparse._SubParsersAction):
        super().__init__()
        self.parser = subparsers.add_parser(
            "list",
            prog="llama-model list",
            description="Show available llama models",
            formatter_class=argparse.RawTextHelpFormatter,
        )
        self._add_arguments()
        self.parser.set_defaults(func=self._run_model_list_cmd)

    def _add_arguments(self):
        self.parser.add_argument(
            "--show-all",
            action="store_true",
            help="Show all models (not just defaults)",
        )
        self.parser.add_argument(
            "--downloaded",
            action="store_true",
            help="List the downloaded models",
        )
        self.parser.add_argument(
            "-s",
            "--search",
            type=str,
            required=False,
            help="Search for the input string as a substring in the model descriptor(ID)",
        )

    def _run_model_list_cmd(self, args: argparse.Namespace) -> None:
        from llama_models.cli.safety_models import prompt_guard_model_skus

        if args.downloaded:
            return _run_model_list_downloaded_cmd()

        headers = [
            "Model Descriptor(ID)",
            "Hugging Face Repo",
            "Context Length",
        ]

        rows = []
        for model in all_registered_models() + prompt_guard_model_skus():
            if not args.show_all and not model.is_featured:
                continue

            descriptor = model.descriptor()
            if not args.search or args.search.lower() in descriptor.lower():
                rows.append(
                    [
                        descriptor,
                        model.huggingface_repo,
                        f"{model.max_seq_length // 1024}K",
                    ]
                )
        if len(rows) == 0:
            print(f"Did not find any model matching `{args.search}`.")
        else:
            print_table(
                rows,
                headers,
                separate_rows=True,
            )


================================================
FILE: models/cli/llama.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import argparse

from .describe import Describe
from .download import Download
from .list import List
from .prompt_format import PromptFormat
from .remove import Remove
from .utils import print_subcommand_description
from .verify_download import VerifyDownload


class LlamaModelsCLIParser:
    """Defines CLI parser for Llama Models CLI"""

    def __init__(self):
        self.parser = argparse.ArgumentParser(
            prog="llama-model",
            description="Llama Model Management CLI",
            add_help=True,
            formatter_class=argparse.RawTextHelpFormatter,
        )

        # Default command is to print help
        self.parser.set_defaults(func=lambda args: self.parser.print_help())

        subparsers = self.parser.add_subparsers(title="subcommands")

        # Add sub-commands
        List.create(subparsers)
        Describe.create(subparsers)
        Download.create(subparsers)
        PromptFormat.create(subparsers)
        Remove.create(subparsers)
        VerifyDownload.create(subparsers)

        print_subcommand_description(self.parser, subparsers)

    def parse_args(self) -> argparse.Namespace:
        args = self.parser.parse_args()
        if not isinstance(args, argparse.Namespace):
            raise TypeError(f"Expected argparse.Namespace, got {type(args)}")
        return args

    def run(self, args: argparse.Namespace) -> None:
        args.func(args)


def main():
    parser = LlamaModelsCLIParser()
    args = parser.parse_args()
    parser.run(args)


if __name__ == "__main__":
    main()


================================================
FILE: models/cli/prompt_format.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import argparse
import textwrap
from io import StringIO
from pathlib import Path

from llama_models.cli.subcommand import Subcommand
from llama_models.cli.table import print_table
from llama_models.sku_types import CoreModelId, ModelFamily, is_multimodal, model_family

ROOT_DIR = Path(__file__).parent.parent.parent


class PromptFormat(Subcommand):
    """Llama model cli for describe a model prompt format (message formats)"""

    def __init__(self, subparsers: argparse._SubParsersAction):
        super().__init__()
        self.parser = subparsers.add_parser(
            "prompt-format",
            prog="llama-model prompt-format",
            description="Show llama model message formats",
            epilog=textwrap.dedent(
                """
                Example:
                    llama model prompt-format <options>
                """
            ),
            formatter_class=argparse.RawTextHelpFormatter,
        )
        self._add_arguments()
        self.parser.set_defaults(func=self._run_model_template_cmd)

    def _add_arguments(self):
        self.parser.add_argument(
            "-m",
            "--model-name",
            type=str,
            help="Example: Llama3.1-8B or Llama3.2-11B-Vision, etc\n"
            "(Run `llama-model list` to see a list of valid model names)",
        )
        self.parser.add_argument(
            "-l",
            "--list",
            action="store_true",
            help="List all available models",
        )

    def _run_model_template_cmd(self, args: argparse.Namespace) -> None:
        import importlib.resources

        # Only Llama 3.1 and 3.2 are supported
        supported_model_ids = [
            m for m in CoreModelId if model_family(m) in {ModelFamily.llama3_1, ModelFamily.llama3_2}
        ]

        model_list = [m.value for m in supported_model_ids]

        if args.list:
            headers = ["Model(s)"]
            rows = []
            for m in model_list:
                rows.append(
                    [
                        m,
                    ]
                )
            print_table(
                rows,
                headers,
                separate_rows=True,
            )
            return

        try:
            model_id = CoreModelId(args.model_name)
        except ValueError:
            self.parser.error(
                f"{args.model_name} is not a valid Model. Choose one from the list of valid models. "
                f"Run `llama-model list` to see the valid model names."
            )

        if model_id not in supported_model_ids:
            self.parser.error(
                f"{model_id} is not a valid Model. Choose one from the list of valid models. "
                f"Run `llama-model list` to see the valid model names."
            )

        llama_3_1_file = ROOT_DIR / "llama3_1" / "prompt_format.md"
        llama_3_2_text_file = ROOT_DIR / "llama3_2" / "text_prompt_format.md"
        llama_3_2_vision_file = ROOT_DIR / "llama3_2" / "vision_prompt_format.md"
        if model_family(model_id) == ModelFamily.llama3_1:
            with importlib.resources.as_file(llama_3_1_file) as f:
                content = f.open("r").read()
        elif model_family(model_id) == ModelFamily.llama3_2:
            if is_multimodal(model_id):
                with importlib.resources.as_file(llama_3_2_vision_file) as f:
                    content = f.open("r").read()
            else:
                with importlib.resources.as_file(llama_3_2_text_file) as f:
                    content = f.open("r").read()

        render_markdown_to_pager(content)


def render_markdown_to_pager(markdown_content: str):
    from rich.console import Console
    from rich.markdown import Markdown
    from rich.style import Style
    from rich.text import Text

    class LeftAlignedHeaderMarkdown(Markdown):
        def parse_header(self, token):
            level = token.type.count("h")
            content = Text(token.content)
            header_style = Style(color="bright_blue", bold=True)
            header = Text(f"{'#' * level} ", style=header_style) + content
            self.add_text(header)

    # Render the Markdown
    md = LeftAlignedHeaderMarkdown(markdown_content)

    # Capture the rendered output
    output = StringIO()
    console = Console(file=output, force_terminal=True, width=100)  # Set a fixed width
    console.print(md)
    rendered_content = output.getvalue()
    print(rendered_content)


================================================
FILE: models/cli/remove.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import argparse
import os
import shutil

from llama_models.cli.subcommand import Subcommand
from llama_models.utils.config import DEFAULT_CHECKPOINT_DIR
from llama_models.sku_list import resolve_model


class Remove(Subcommand):
    """Remove the downloaded llama model"""

    def __init__(self, subparsers: argparse._SubParsersAction):
        super().__init__()
        self.parser = subparsers.add_parser(
            "remove",
            prog="llama-model remove",
            description="Remove the downloaded llama model",
            formatter_class=argparse.RawTextHelpFormatter,
        )
        self._add_arguments()
        self.parser.set_defaults(func=self._run_model_remove_cmd)

    def _add_arguments(self):
        self.parser.add_argument(
            "-m",
            "--model",
            required=True,
            help="Specify the llama downloaded model name, see `llama-model list --downloaded`",
        )
        self.parser.add_argument(
            "-f",
            "--force",
            action="store_true",
            help="Used to forcefully remove the llama model from the storage without further confirmation",
        )

    def _run_model_remove_cmd(self, args: argparse.Namespace) -> None:
        from llama_models.cli.safety_models import prompt_guard_model_sku_map

        prompt_guard_model_map = prompt_guard_model_sku_map()

        if args.model in prompt_guard_model_map.keys():
            model = prompt_guard_model_map[args.model]
        else:
            model = resolve_model(args.model)

        model_path = os.path.join(DEFAULT_CHECKPOINT_DIR, args.model.replace(":", "-"))

        if model is None or not os.path.isdir(model_path):
            print(f"'{args.model}' is not a valid llama model or does not exist.")
            return

        if args.force:
            shutil.rmtree(model_path)
            print(f"{args.model} removed.")
        else:
            if input(f"Are you sure you want to remove {args.model}? (y/n): ").strip().lower() == "y":
                shutil.rmtree(model_path)
                print(f"{args.model} removed.")
            else:
                print("Removal aborted.")


================================================
FILE: models/cli/safety_models.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

from typing import Any

from pydantic import BaseModel, ConfigDict, Field

from llama_models.sku_list import LlamaDownloadInfo
from llama_models.sku_types import CheckpointQuantizationFormat


class PromptGuardModel(BaseModel):
    """Make a 'fake' Model-like object for Prompt Guard. Eventually this will be removed."""

    model_id: str
    huggingface_repo: str
    description: str = "Prompt Guard. NOTE: this model will not be provided via `llama` CLI soon."
    is_featured: bool = False
    max_seq_length: int = 512
    is_instruct_model: bool = False
    quantization_format: CheckpointQuantizationFormat = CheckpointQuantizationFormat.bf16
    arch_args: dict[str, Any] = Field(default_factory=dict)

    def descriptor(self) -> str:
        return self.model_id

    model_config = ConfigDict(protected_namespaces=())


def prompt_guard_model_skus():
    return [
        PromptGuardModel(model_id="Prompt-Guard-86M", huggingface_repo="meta-llama/Prompt-Guard-86M"),
        PromptGuardModel(
            model_id="Llama-Prompt-Guard-2-86M",
            huggingface_repo="meta-llama/Llama-Prompt-Guard-2-86M",
        ),
        PromptGuardModel(
            model_id="Llama-Prompt-Guard-2-22M",
            huggingface_repo="meta-llama/Llama-Prompt-Guard-2-22M",
        ),
    ]


def prompt_guard_model_sku_map() -> dict[str, Any]:
    return {model.model_id: model for model in prompt_guard_model_skus()}


def prompt_guard_download_info_map() -> dict[str, LlamaDownloadInfo]:
    return {
        model.model_id: LlamaDownloadInfo(
            folder="Prompt-Guard" if model.model_id == "Prompt-Guard-86M" else model.model_id,
            files=[
                "model.safetensors",
                "special_tokens_map.json",
                "tokenizer.json",
                "tokenizer_config.json",
            ],
            pth_size=1,
        )
        for model in prompt_guard_model_skus()
    }


================================================
FILE: models/cli/subcommand.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.


class Subcommand:
    """All llama cli subcommands must inherit from this class"""

    def __init__(self, *args, **kwargs):
        pass

    @classmethod
    def create(cls, *args, **kwargs):
        return cls(*args, **kwargs)

    def _add_arguments(self):
        pass


================================================
FILE: models/cli/table.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

from collections.abc import Iterable

from rich.console import Console
from rich.table import Table


def print_table(rows, headers=None, separate_rows: bool = False, sort_by: Iterable[int] = tuple()):
    # Convert rows and handle None values
    rows = [[x or "" for x in row] for row in rows]

    # Sort rows if sort_by is specified
    if sort_by:
        rows.sort(key=lambda x: tuple(x[i] for i in sort_by))

    # Create Rich table
    table = Table(show_lines=separate_rows)

    # Add headers if provided
    if headers:
        for header in headers:
            table.add_column(header, style="bold white")
    else:
        # Add unnamed columns based on first row
        for _ in range(len(rows[0]) if rows else 0):
            table.add_column()

    # Add rows
    for row in rows:
        table.add_row(*row)

    # Print table
    console = Console()
    console.print(table)


================================================
FILE: models/cli/utils.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.


def print_subcommand_description(parser, subparsers):
    """Print descriptions of subcommands."""
    description_text = ""
    for name, subcommand in subparsers.choices.items():
        description = subcommand.description
        description_text += f"  {name:<21} {description}\n"
    parser.epilog = description_text


================================================
FILE: models/cli/verify_download.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import argparse
import hashlib
from dataclasses import dataclass
from functools import partial
from pathlib import Path

from rich.console import Console
from rich.progress import Progress, SpinnerColumn, TextColumn

from llama_models.cli.subcommand import Subcommand


@dataclass
class VerificationResult:
    filename: str
    expected_hash: str
    actual_hash: str | None
    exists: bool
    matches: bool


class VerifyDownload(Subcommand):
    """Llama cli for verifying downloaded model files"""

    def __init__(self, subparsers: argparse._SubParsersAction):
        super().__init__()
        self.parser = subparsers.add_parser(
            "verify-download",
            prog="llama verify-download",
            description="Verify integrity of downloaded model files",
            formatter_class=argparse.RawTextHelpFormatter,
        )
        setup_verify_download_parser(self.parser)


def setup_verify_download_parser(parser: argparse.ArgumentParser) -> None:
    parser.add_argument(
        "--model-id",
        required=True,
        help="Model ID to verify (only for models downloaded from Meta)",
    )
    parser.set_defaults(func=partial(run_verify_cmd, parser=parser))


def calculate_sha256(filepath: Path, chunk_size: int = 8192) -> str:
    sha256_hash = hashlib.sha256()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(chunk_size), b""):
            sha256_hash.update(chunk)
    return sha256_hash.hexdigest()


def load_checksums(checklist_path: Path) -> dict[str, str]:
    checksums = {}
    with open(checklist_path) as f:
        for line in f:
            if line.strip():
                sha256sum, filepath = line.strip().split("  ", 1)
                # Remove leading './' if present
                filepath = filepath.lstrip("./")
                checksums[filepath] = sha256sum
    return checksums


def verify_files(model_dir: Path, checksums: dict[str, str], console: Console) -> list[VerificationResult]:
    results = []

    with Progress(
        SpinnerColumn(),
        TextColumn("[progress.description]{task.description}"),
        console=console,
    ) as progress:
        for filepath, expected_hash in checksums.items():
            full_path = model_dir / filepath
            task_id = progress.add_task(f"Verifying {filepath}...", total=None)

            exists = full_path.exists()
            actual_hash = None
            matches = False

            if exists:
                actual_hash = calculate_sha256(full_path)
                matches = actual_hash == expected_hash

            results.append(
                VerificationResult(
                    filename=filepath,
                    expected_hash=expected_hash,
                    actual_hash=actual_hash,
                    exists=exists,
                    matches=matches,
                )
            )

            progress.remove_task(task_id)

    return results


def run_verify_cmd(args: argparse.Namespace, parser: argparse.ArgumentParser):
    from llama_models.utils.model_utils import model_local_dir

    console = Console()
    model_dir = Path(model_local_dir(args.model_id))
    checklist_path = model_dir / "checklist.chk"

    if not model_dir.exists():
        parser.error(f"Model directory not found: {model_dir}")

    if not checklist_path.exists():
        parser.error(f"Checklist file not found: {checklist_path}")

    checksums = load_checksums(checklist_path)
    results = verify_files(model_dir, checksums, console)

    # Print results
    console.print("\nVerification Results:")

    all_good = True
    for result in results:
        if not result.exists:
            console.print(f"[red]❌ {result.filename}: File not found[/red]")
            all_good = False
        elif not result.matches:
            console.print(
                f"[red]❌ {result.filename}: Hash mismatch[/red]\n"
                f"   Expected: {result.expected_hash}\n"
                f"   Got:      {result.actual_hash}"
            )
            all_good = False
        else:
            console.print(f"[green]✓ {result.filename}: Verified[/green]")

    if all_good:
        console.print("\n[green]All files verified successfully![/green]")


================================================
FILE: models/datatypes.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import base64
from enum import Enum
from io import BytesIO
from typing import Any, Dict, List, Literal, Optional, Union

from pydantic import BaseModel, ConfigDict, Field, field_serializer, field_validator
from typing_extensions import Annotated

# The goal is that these set of types are relevant for all Llama models.
# That isn't the current state yet -- e.g., BuiltinTool is somewhat specific to
# the llama3 series of models.


class Role(Enum):
    system = "system"
    user = "user"
    assistant = "assistant"
    tool = "tool"


class BuiltinTool(Enum):
    brave_search = "brave_search"
    wolfram_alpha = "wolfram_alpha"
    photogen = "photogen"
    code_interpreter = "code_interpreter"


Primitive = Union[str, int, float, bool, None]
RecursiveType = Union[Primitive, List[Primitive], Dict[str, Primitive]]


class ToolCall(BaseModel):
    call_id: str
    tool_name: Union[BuiltinTool, str]
    # Plan is to deprecate the Dict in favor of a JSON string
    # that is parsed on the client side instead of trying to manage
    # the recursive type here.
    # Making this a union so that client side can start prepping for this change.
    # Eventually, we will remove both the Dict and arguments_json field,
    # and arguments will just be a str
    arguments: Union[str, Dict[str, RecursiveType]]
    arguments_json: Optional[str] = None

    @field_validator("tool_name", mode="before")
    @classmethod
    def validate_field(cls, v):
        if isinstance(v, str):
            try:
                return BuiltinTool(v)
            except ValueError:
                return v
        return v


class ToolPromptFormat(Enum):
    """Prompt format for calling custom / zero shot tools.

    :cvar json: JSON format for calling tools. It takes the form:
        {
            "type": "function",
            "function" : {
                "name": "function_name",
                "description": "function_description",
                "parameters": {...}
            }
        }
    :cvar function_tag: Function tag format, pseudo-XML. This looks like:
        <function=function_name>(parameters)</function>

    :cvar python_list: Python list. The output is a valid Python expression that can be
        evaluated to a list. Each element in the list is a function call. Example:
        ["function_name(param1, param2)", "function_name(param1, param2)"]
    """

    json = "json"
    function_tag = "function_tag"
    python_list = "python_list"


class StopReason(Enum):
    end_of_turn = "end_of_turn"
    end_of_message = "end_of_message"
    out_of_tokens = "out_of_tokens"


class ToolParamDefinition(BaseModel):
    param_type: str
    description: Optional[str] = None
    required: Optional[bool] = True
    default: Optional[Any] = None


class ToolDefinition(BaseModel):
    tool_name: Union[BuiltinTool, str]
    description: Optional[str] = None
    parameters: Optional[Dict[str, ToolParamDefinition]] = None

    @field_validator("tool_name", mode="before")
    @classmethod
    def validate_field(cls, v):
        if isinstance(v, str):
            try:
                return BuiltinTool(v)
            except ValueError:
                return v
        return v


class RawMediaItem(BaseModel):
    type: Literal["image"] = "image"
    data: bytes | BytesIO

    model_config = ConfigDict(arbitrary_types_allowed=True)

    @field_serializer("data")
    def serialize_data(self, data: Optional[bytes], _info):
        if data is None:
            return None
        return base64.b64encode(data).decode("utf-8")

    @field_validator("data", mode="before")
    @classmethod
    def validate_data(cls, v):
        if isinstance(v, str):
            return base64.b64decode(v)
        return v


class RawTextItem(BaseModel):
    type: Literal["text"] = "text"
    text: str


RawContentItem = Annotated[Union[RawTextItem, RawMediaItem], Field(discriminator="type")]

RawContent = str | RawContentItem | List[RawContentItem]


class RawMessage(BaseModel):
    role: Literal["user"] | Literal["system"] | Literal["tool"] | Literal["assistant"]
    content: RawContent

    # This is for RAG but likely should be absorbed into content
    context: Optional[RawContent] = None

    # These are for the output message coming from the assistant
    stop_reason: Optional[StopReason] = None
    tool_calls: List[ToolCall] = Field(default_factory=list)


class GenerationResult(BaseModel):
    token: int
    text: str
    logprobs: Optional[List[float]] = None

    source: Literal["input"] | Literal["output"]

    # index within the batch
    batch_idx: int
    # whether generation for this item is already finished. note that tokens can
    # get returned even afterwards since other items in the batch can still be generating tokens
    finished: bool
    # because a batch is parallel processed, useful decoding for one item can correspond to processing
    # pad tokens or tokens beyond EOS for other items. we could have decided to return None for this case
    # but it's more convenient to return a list of GenerationResult and filter out the ignored tokens
    ignore_token: bool


class QuantizationMode(str, Enum):
    none = "none"
    fp8_mixed = "fp8_mixed"
    int4_mixed = "int4_mixed"


================================================
FILE: models/llama2/LICENSE
================================================
LLAMA 2 COMMUNITY LICENSE AGREEMENT
Llama 2 Version Release Date: July 18, 2023

"Agreement" means the terms and conditions for use, reproduction, distribution and
modification of the Llama Materials set forth herein.

"Documentation" means the specifications, manuals and documentation
accompanying Llama 2 distributed by Meta at ai.meta.com/resources/models-and-
libraries/llama-downloads/.

"Licensee" or "you" means you, or your employer or any other person or entity (if
you are entering into this Agreement on such person or entity's behalf), of the age
required under applicable laws, rules or regulations to provide legal consent and that
has legal authority to bind your employer or such other person or entity if you are
entering in this Agreement on their behalf.

"Llama 2" means the foundational large language models and software and
algorithms, including machine-learning model code, trained model weights,
inference-enabling code, training-enabling code, fine-tuning enabling code and other
elements of the foregoing distributed by Meta at ai.meta.com/resources/models-and-
libraries/llama-downloads/.

"Llama Materials" means, collectively, Meta's proprietary Llama 2 and
Documentation (and any portion thereof) made available under this Agreement.

"Meta" or "we" means Meta Platforms Ireland Limited (if you are located in or, if you
are an entity, your principal place of business is in the EEA or Switzerland) and Meta
Platforms, Inc. (if you are located outside of the EEA or Switzerland).

By clicking "I Accept" below or by using or distributing any portion or element of the
Llama Materials, you agree to be bound by this Agreement.

1. License Rights and Redistribution.

      a. Grant of Rights. You are granted a non-exclusive, worldwide, non-
transferable and royalty-free limited license under Meta's intellectual property or
other rights owned by Meta embodied in the Llama Materials to use, reproduce,
distribute, copy, create derivative works of, and make modifications to the Llama
Materials.

      b. Redistribution and Use.

            i. If you distribute or make the Llama Materials, or any derivative works
thereof, available to a third party, you shall provide a copy of this Agreement to such
third party.
            ii.  If you receive Llama Materials, or any derivative works thereof, from
a Licensee as part of an integrated end user product, then Section 2 of this
Agreement will not apply to you.

            iii. You must retain in all copies of the Llama Materials that you
distribute the following attribution notice within a "Notice" text file distributed as a
part of such copies: "Llama 2 is licensed under the LLAMA 2 Community License,
Copyright (c) Meta Platforms, Inc. All Rights Reserved."

            iv. Your use of the Llama Materials must comply with applicable laws
and regulations (including trade compliance laws and regulations) and adhere to the
Acceptable Use Policy for the Llama Materials (available at
https://ai.meta.com/llama/use-policy), which is hereby incorporated by reference into
this Agreement.

            v. You will not use the Llama Materials or any output or results of the
Llama Materials to improve any other large language model (excluding Llama 2 or
derivative works thereof).

2. Additional Commercial Terms. If, on the Llama 2 version release date, the
monthly active users of the products or services made available by or for Licensee,
or Licensee's affiliates, is greater than 700 million monthly active users in the
preceding calendar month, you must request a license from Meta, which Meta may
grant to you in its sole discretion, and you are not authorized to exercise any of the
rights under this Agreement unless or until Meta otherwise expressly grants you
such rights.

3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE
LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE
PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY
WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR
FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE
FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING
THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR
USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.

4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE
LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT,
NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS
AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL,
CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN
IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF
ANY OF THE FOREGOING.

5. Intellectual Property.

      a. No trademark licenses are granted under this Agreement, and in
connection with the Llama Materials, neither Meta nor Licensee may use any name
or mark owned by or associated with the other or any of its affiliates, except as
required for reasonable and customary use in describing and redistributing the
Llama Materials.

      b. Subject to Meta's ownership of Llama Materials and derivatives made by or
for Meta, with respect to any derivative works and modifications of the Llama
Materials that are made by you, as between you and Meta, you are and will be the
owner of such derivative works and modifications.

      c. If you institute litigation or other proceedings against Meta or any entity
(including a cross-claim or counterclaim in a lawsuit) alleging that the Llama
Materials or Llama 2 outputs or results, or any portion of any of the foregoing,
constitutes an infringement of intellectual property or other rights owned or licensable
by you, then any licenses granted to you under this Agreement shall terminate as of
the date such litigation or claim is filed or instituted. You will indemnify and hold
harmless Meta from and against any claim by any third party arising out of or related
to your use or distribution of the Llama Materials.

6. Term and Termination. The term of this Agreement will commence upon your
acceptance of this Agreement or access to the Llama Materials and will continue in
full force and effect until terminated in accordance with the terms and conditions
herein. Meta may terminate this Agreement if you are in breach of any term or
condition of this Agreement. Upon termination of this Agreement, you shall delete
and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the
termination of this Agreement.

7. Governing Law and Jurisdiction. This Agreement will be governed and
construed under the laws of the State of California without regard to choice of law
principles, and the UN Convention on Contracts for the International Sale of Goods
does not apply to this Agreement. The courts of California shall have exclusive
jurisdiction of any dispute arising out of this Agreement.


================================================
FILE: models/llama2/MODEL_CARD.md
================================================
# **Model Details**

Meta developed and released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM.

**Model Developers** Meta

**Variations** Llama 2 comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations.

**Input** Models input text only.

**Output** Models generate text only.

**Model Architecture** Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

||Training Data|Params|Context Length|GQA|Tokens|LR|
|---|---|---|---|---|---|---|
Llama 2|*A new mix of publicly available online data*|7B|4k|&#10007;|2.0T|3.0 x 10<sup>-4</sup>
Llama 2|*A new mix of publicly available online data*|13B|4k|&#10007;|2.0T|3.0 x 10<sup>-4</sup>
Llama 2|*A new mix of publicly available online data*|70B|4k|&#10004;|2.0T|1.5 x 10<sup>-4</sup>

**Llama 2 family of models.** Token counts refer to pretraining data only. All models are trained with a global batch-size of 4M tokens. The 70B version uses Grouped-Query Attention (GQA) for improved inference scalability.

**Model Dates** Llama 2 was trained between January 2023 and July 2023.

**Status** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.

**License** A custom commercial license is available at: [https://ai.meta.com/resources/models-and-libraries/llama-downloads/](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)

**Research Paper** More information can be found in the paper "Llama-2: Open Foundation and Fine-tuned Chat Models", available at https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/.

**Feedback:** Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md).

# **Intended Use**
**Intended Use Cases** Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks.

**Out-of-scope Uses** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 2 Community License. Use in languages other than English**.

**Note: Developers may fine-tune Llama 2 models for languages beyond English provided they comply with the Llama 2 Community License and the Acceptable Use Policy.

# **Hardware and Software**
**Training Factors** We used custom training libraries, Meta's Research Super Cluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute.

**Carbon Footprint** Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 539 tCO2eq, 100% of which were offset by Meta’s sustainability program.

||Time (GPU hours)|Power Consumption (W)|Carbon Emitted(tCO<sub>2</sub>eq)|
|---|---|---|---|
|Llama 2 7B|184320|400|31.22|
|Llama 2 13B|368640|400|62.44|
|Llama 2 70B|1720320|400|291.42|
|Total|3311616||539.00|

**CO<sub>2</sub> emissions during pretraining.** Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others.

# **Training Data**
**Overview** Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data.

**Data Freshness** The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023.

# **Evaluation Results**

In this section, we report the results for the Llama 1 and Llama 2 models on standard academic benchmarks.
For all the evaluations, we use our internal evaluations library.

|Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math|MMLU|BBH|AGI Eval|
|---|---|---|---|---|---|---|---|---|---|
|Llama 1|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|23.9|
|Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|33.9|
|Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|41.7|
|Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|47.6|
|Llama 2|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|29.3|
|Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|39.1|
|Llama 2|70B|**37.5**|**71.9**|**63.6**|**69.4**|**35.2**|**68.9**|**51.2**|**54.2**|

**Overall performance on grouped academic benchmarks.** *Code:* We report the average pass@1 scores of our models on HumanEval and MBPP. *Commonsense Reasoning:* We report the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. *World Knowledge:* We evaluate the 5-shot performance on NaturalQuestions and TriviaQA and report the average. *Reading Comprehension:* For reading comprehension, we report the 0-shot average on SQuAD, QuAC, and BoolQ. *MATH:* We report the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at the top 1.

|||TruthfulQA|Toxigen|
|---|---|---|---|
|Llama 1|7B|27.42|23.00|
|Llama 1|13B|41.74|23.08|
|Llama 1|33B|44.19|22.57|
|Llama 1|65B|48.71|21.77|
|Llama 2|7B|33.29|**21.25**|
|Llama 2|13B|41.86|26.10|
|Llama 2|70B|**50.18**|24.60|

**Evaluation of pretrained LLMs on automatic safety benchmarks.** For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller the better).


|||TruthfulQA|Toxigen|
|---|---|---|---|
|Llama-2-Chat|7B|57.04|**0.00**|
|Llama-2-Chat|13B|62.18|**0.00**|
|Llama-2-Chat|70B|**64.14**|0.01|

**Evaluation of fine-tuned LLMs on different safety datasets.** Same metric definitions as above.

# **Ethical Considerations and Limitations**
Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their specific applications of the model.

Please see the Responsible Use Guide available at [https://ai.meta.com/llama/responsible-use-guide/](https://ai.meta.com/llama/responsible-use-guide/)


================================================
FILE: models/llama2/USE_POLICY.md
================================================
# Llama 2 Acceptable Use Policy

Meta is committed to promoting safe and fair use of its tools and features, including Llama 2. If you access or use Llama 2, you agree to this Acceptable Use Policy (“Policy”). The most recent copy of this policy can be found at [ai.meta.com/llama/use-policy](http://ai.meta.com/llama/use-policy).

## Prohibited Uses
We want everyone to use Llama 2 safely and responsibly. You agree you will not use, or allow others to use, Llama 2 to:

1. Violate the law or others’ rights, including to:
    1. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as:
        1. Violence or terrorism
        2. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material
        3. Human trafficking, exploitation, and sexual violence
        4. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials.
        5. Sexual solicitation
        6. Any other criminal activity
    2. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals
    3. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services
    4. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices
    5. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws
    6. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any products or services using the Llama 2 Materials
    7. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system



2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of Llama 2 related to the following:
    1. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State
    2. Guns and illegal weapons (including weapon development)
    3. Illegal drugs and regulated/controlled substances
    4. Operation of critical infrastructure, transportation technologies, or heavy machinery
    5. Self-harm or harm to others, including suicide, cutting, and eating disorders
    6. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual



3. Intentionally deceive or mislead others, including use of Llama 2 related to the following:
    1. Generating, promoting, or furthering fraud or the creation or promotion of disinformation
    2. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content
    3. Generating, promoting, or further distributing spam
    4. Impersonating another individual without consent, authorization, or legal right
    5. Representing that the use of Llama 2 or outputs are human-generated
    6. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement
4. Fail to appropriately disclose to end users any known dangers of your AI system

Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation of this Policy through one of the following means:

* Reporting issues with the model: [github.com/facebookresearch/llama](http://github.com/facebookresearch/llama)
* Reporting risky content generated by the model: [developers.facebook.com/llama_output_feedback](http://developers.facebook.com/llama_output_feedback)
* Reporting bugs and security concerns: [facebook.com/whitehat/info](http://facebook.com/whitehat/info)
* Reporting violations of the Acceptable Use Policy or unlicensed uses of Llama: [LlamaUseReport@meta.com](mailto:LlamaUseReport@meta.com)


================================================
FILE: models/llama3/LICENSE
================================================
META LLAMA 3 COMMUNITY LICENSE AGREEMENT

Meta Llama 3 Version Release Date: April 18, 2024
“Agreement” means the terms and conditions for use, reproduction, distribution and modification of the Llama Materials set forth herein.

“Documentation” means the specifications, manuals and documentation accompanying Meta Llama 3 distributed by Meta at https://llama.meta.com/get-started/.

“Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.

“Meta Llama 3” means the foundational large language models and software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Meta at https://llama.meta.com/llama-downloads.

“Llama Materials” means, collectively, Meta’s proprietary Meta Llama 3 and Documentation (and any portion thereof) made available under this Agreement.

“Meta” or “we” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located outside of the EEA or Switzerland).

By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials, you agree to be bound by this Agreement.

1. License Rights and Redistribution.

	a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials.
	b. Redistribution and Use.
		i. If you distribute or make available the Llama Materials (or any derivative works thereof), or a product or service that uses any of them, including another AI model, you shall (A) provide a copy of this Agreement with any such Llama Materials; and (B) prominently display “Built with Meta Llama 3” on a related website, user interface, blogpost, about page, or product documentation. If you use the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama 3” at the beginning of any such AI model name.
		ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part of an integrated end user product, then Section 2 of this Agreement will not apply to you.
		iii. You must retain in all copies of the Llama Materials that you distribute the following attribution notice within a “Notice” text file distributed as a part of such copies: “Meta Llama 3 is licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.”
		iv. Your use of the Llama Materials must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama Materials (available at https://llama.meta.com/llama3/use-policy), which is hereby incorporated by reference into this Agreement.
		v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Meta Llama 3 or derivative works thereof).

2. Additional Commercial Terms. If, on the Meta Llama 3 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, AND META DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.

4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.

5. Intellectual Property.
	a. No trademark licenses are granted under this Agreement, and in connection with the Llama Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the Llama Materials or as set forth in this Section 5(a). Meta hereby grants you a license to use “Llama 3” (the “Mark”) solely as required to comply with the last sentence of Section 1.b.i. You will comply with Meta’s brand guidelines (currently accessible at https://about.meta.com/brand/resources/meta/company-brand/  ). All goodwill arising out of your use of the Mark will inure to the benefit of Meta.
	b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with respect to any derivative works and modifications of the Llama Materials that are made by you, as between you and Meta, you are and will be the owner of such derivative works and modifications.
	c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Meta Llama 3 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials.

6. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Llama Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this Agreement.

7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of the State of California without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. The courts of California shall have exclusive jurisdiction of any dispute arising out of this Agreement.


Meta Llama 3 Acceptable Use Policy
Meta is committed to promoting safe and fair use of its tools and features, including Meta Llama 3. If you access or use Meta Llama 3, you agree to this Acceptable Use Policy (“Policy”). The most recent copy of this policy can be found at https://llama.meta.com/llama3/use-policy
Prohibited Uses
We want everyone to use Meta Llama 3 safely and responsibly. You agree you will not use, or allow others to use, Meta Llama 3 to:
1. Violate the law or others’ rights, including to:
	a. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as:
      		i. Violence or terrorism
      		ii. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material
      		iii. Human trafficking, exploitation, and sexual violence
      		iv. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials.
      		v. Sexual solicitation
      		vi. Any other criminal activity
   	b. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals
   	c. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services
   	d. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices
   	e. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws
   	f. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any products or services using the Llama Materials
   	g. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system

2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of Meta Llama 3 related to the following:
   	a. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State
   	b. Guns and illegal weapons (including weapon development)
   	c. Illegal drugs and regulated/controlled substances
   	d. Operation of critical infrastructure, transportation technologies, or heavy machinery
   	e. Self-harm or harm to others, including suicide, cutting, and eating disorders
   	f. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual

3. Intentionally deceive or mislead others, including use of Meta Llama 3 related to the following:
   	a. Generating, promoting, or furthering fraud or the creation or promotion of disinformation
   	b. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content
   	c. Generating, promoting, or further distributing spam
   	d. Impersonating another individual without consent, authorization, or legal right
   	e. Representing that the use of Meta Llama 3 or outputs are human-generated
   	f. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement
   	g. Fail to appropriately disclose to end users any known dangers of your AI system

Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation of this Policy through one of the following means:
   	* Reporting issues with the model: https://github.com/meta-llama/llama3
   	* Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
   	* Reporting bugs and security concerns: facebook.com/whitehat/info
   	* Reporting violations of the Acceptable Use Policy or unlicensed uses of Meta Llama 3: LlamaUseReport@meta.com


================================================
FILE: models/llama3/MODEL_CARD.md
================================================
## Model Details

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. Further, in developing these models, we took great care to optimize helpfulness and safety.

**Model developers** Meta

**Llama 3 family of models** Llama 3 comes in two sizes — 8B and 70B parameters — in pre-trained and instruction tuned variants.

**Input** Models input text only.

**Output** Models generate text and code only.

**Model Architecture** Llama 3 is an auto-regressive language model that uses an optimized transformer architecture. Llama 3 uses a tokenizer with a vocabulary of 128K tokens, and was trained on on sequences of 8,192 tokens. Grouped-Query Attention (GQA) is used for all models to improve inference efficiency. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.


<table>
  <tr>
   <td>
   </td>
   <td><strong>Training Data</strong>
   </td>
   <td><strong>Params</strong>
   </td>
   <td><strong>Context length</strong>
   </td>
   <td><strong>GQA</strong>
   </td>
   <td><strong>Token count</strong>
   </td>
   <td><strong>Knowledge cutoff</strong>
   </td>
  </tr>
  <tr>
   <td rowspan="2" >Llama 3
   </td>
   <td rowspan="2" >A new mix of publicly available online data.
   </td>
   <td>8B
   </td>
   <td>8k
   </td>
   <td>Yes
   </td>
   <td rowspan="2" >15T+
   </td>
   <td>March, 2023
   </td>
  </tr>
  <tr>
   <td>70B
   </td>
   <td>8k
   </td>
   <td>Yes
   </td>
   <td>December, 2023
   </td>
  </tr>
</table>


Note: Token counts refer to pretraining data only.

**Model Release Date** April 18, 2024.

**Status** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.

**License** A custom commercial license is available at: [https://llama.meta.com/llama3/license](https://llama.meta.com/llama3/license)

**Feedback:** Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use Llama 3 in applications, please go [here](https://github.com/meta-llama/llama-recipes).


## Intended Use

**Intended Use Cases** Llama 3 is intended for commercial and research use in English. Instruction tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks.

**Out-of-scope** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the [Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/) and [Llama 3 Community License](https://llama.meta.com/llama3/license/). Use in languages other than English**.

**Note: Developers may fine-tune Llama 3 models for languages beyond English provided they comply with the [Llama 3 Community License](https://llama.meta.com/llama3/license/) and the [Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).


## Hardware and Software

**Training Factors** We used custom training libraries, Meta's Research SuperCluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute.

**Carbon Footprint Pretraining utilized a cumulative** 7.7M GPU hours of computation on hardware of type H100-80GB (TDP of 700W). Estimated total emissions were 2290 tCO2eq, 100% of which were offset by Meta’s sustainability program.


<table>
  <tr>
   <td>
   </td>
   <td><strong>Time (GPU hours)</strong>
   </td>
   <td><strong>Power Consumption (W)</strong>
   </td>
   <td><strong>Carbon Emitted(tCO2eq)</strong>
   </td>
  </tr>
  <tr>
   <td>Llama 3 8B
   </td>
   <td>1.3M
   </td>
   <td>700
   </td>
   <td>390
   </td>
  </tr>
  <tr>
   <td>Llama 3 70B
   </td>
   <td>6.4M
   </td>
   <td>700
   </td>
   <td>1900
   </td>
  </tr>
  <tr>
   <td>Total
   </td>
   <td>7.7M
   </td>
   <td>
   </td>
   <td>2290
   </td>
  </tr>
</table>



**CO2 emissions during pre-training**. Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others.


## Training Data

**Overview** Llama 3 was pretrained on over 15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 10M human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data.

**Data Freshness** The pretraining data has a cutoff of March 2023 for the 8B and December 2023 for the 70B models respectively.


## Benchmarks

In this section, we report the results for Llama 3 models on standard automatic benchmarks. For all the evaluations, we use our internal evaluations library. For details on the methodology see [here](https://github.com/meta-llama/llama3/blob/main/eval_details.md).


### Base pretrained models


<table>
  <tr>
   <td><strong>Category</strong>
   </td>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>Llama 3 8B</strong>
   </td>
   <td><strong>Llama2 7B</strong>
   </td>
   <td><strong>Llama2 13B</strong>
   </td>
   <td><strong>Llama 3 70B</strong>
   </td>
   <td><strong>Llama2 70B</strong>
   </td>
  </tr>
  <tr>
   <td rowspan="6" >General
   </td>
   <td>MMLU (5-shot)
   </td>
   <td>66.6
   </td>
   <td>45.7
   </td>
   <td>53.8
   </td>
   <td>79.5
   </td>
   <td>69.7
   </td>
  </tr>
  <tr>
   <td>AGIEval English (3-5 shot)
   </td>
   <td>45.9
   </td>
   <td>28.8
   </td>
   <td>38.7
   </td>
   <td>63.0
   </td>
   <td>54.8
   </td>
  </tr>
  <tr>
   <td>CommonSenseQA (7-shot)
   </td>
   <td>72.6
   </td>
   <td>57.6
   </td>
   <td>67.6
   </td>
   <td>83.8
   </td>
   <td>78.7
   </td>
  </tr>
  <tr>
   <td>Winogrande (5-shot)
   </td>
   <td>76.1
   </td>
   <td>73.3
   </td>
   <td>75.4
   </td>
   <td>83.1
   </td>
   <td>81.8
   </td>
  </tr>
  <tr>
   <td>BIG-Bench Hard (3-shot, CoT)
   </td>
   <td>61.1
   </td>
   <td>38.1
   </td>
   <td>47.0
   </td>
   <td>81.3
   </td>
   <td>65.7
   </td>
  </tr>
  <tr>
   <td>ARC-Challenge (25-shot)
   </td>
   <td>78.6
   </td>
   <td>53.7
   </td>
   <td>67.6
   </td>
   <td>93.0
   </td>
   <td>85.3
   </td>
  </tr>
  <tr>
   <td>Knowledge reasoning
   </td>
   <td>TriviaQA-Wiki (5-shot)
   </td>
   <td>78.5
   </td>
   <td>72.1
   </td>
   <td>79.6
   </td>
   <td>89.7
   </td>
   <td>87.5
   </td>
  </tr>
  <tr>
   <td rowspan="4" >Reading comprehension
   </td>
   <td>SQuAD (1-shot)
   </td>
   <td>76.4
   </td>
   <td>72.2
   </td>
   <td>72.1
   </td>
   <td>85.6
   </td>
   <td>82.6
   </td>
  </tr>
  <tr>
   <td>QuAC (1-shot, F1)
   </td>
   <td>44.4
   </td>
   <td>39.6
   </td>
   <td>44.9
   </td>
   <td>51.1
   </td>
   <td>49.4
   </td>
  </tr>
  <tr>
   <td>BoolQ (0-shot)
   </td>
   <td>75.7
   </td>
   <td>65.5
   </td>
   <td>66.9
   </td>
   <td>79.0
   </td>
   <td>73.1
   </td>
  </tr>
  <tr>
   <td>DROP (3-shot, F1)
   </td>
   <td>58.4
   </td>
   <td>37.9
   </td>
   <td>49.8
   </td>
   <td>79.7
   </td>
   <td>70.2
   </td>
  </tr>
</table>



### Instruction tuned models


<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>Llama 3 8B</strong>
   </td>
   <td><strong>Llama 2 7B</strong>
   </td>
   <td><strong>Llama 2 13B</strong>
   </td>
   <td><strong>Llama 3 70B</strong>
   </td>
   <td><strong>Llama 2 70B</strong>
   </td>
  </tr>
  <tr>
   <td>MMLU (5-shot)
   </td>
   <td>68.4
   </td>
   <td>34.1
   </td>
   <td>47.8
   </td>
   <td>82.0
   </td>
   <td>52.9
   </td>
  </tr>
  <tr>
   <td>GPQA (0-shot)
   </td>
   <td>34.2
   </td>
   <td>21.7
   </td>
   <td>22.3
   </td>
   <td>39.5
   </td>
   <td>21.0
   </td>
  </tr>
  <tr>
   <td>HumanEval (0-shot)
   </td>
   <td>62.2
   </td>
   <td>7.9
   </td>
   <td>14.0
   </td>
   <td>81.7
   </td>
   <td>25.6
   </td>
  </tr>
  <tr>
   <td>GSM-8K (8-shot, CoT)
   </td>
   <td>79.6
   </td>
   <td>25.7
   </td>
   <td>41.2
   </td>
   <td>93.0
   </td>
   <td>57.5
   </td>
  </tr>
  <tr>
   <td>MATH (4-shot, CoT)
   </td>
   <td>30.0
   </td>
   <td>3.8
   </td>
   <td>6.7
   </td>
   <td>50.4
   </td>
   <td>11.6
   </td>
  </tr>
</table>



### Responsibility & Safety

We believe that an open approach to AI leads to better, safer products, faster innovation, and a bigger overall market. We are committed to Responsible AI development and took a series of steps to limit misuse and harm and support the open source community.

Foundation models are widely capable technologies that are built to be used for a diverse range of applications. They are not designed to meet every developer preference on safety levels for all use cases, out-of-the-box, as those by their nature will differ across different applications.

Rather, responsible LLM-application deployment is achieved by implementing a series of safety best practices throughout the development of such applications, from the model pre-training, fine-tuning and the deployment of systems composed of safeguards to tailor the safety needs specifically to the use case and audience.


As part of the Llama 3 release, we updated our [Responsible Use Guide](https://llama.meta.com/responsible-use-guide/) to outline the steps and best practices for developers to implement model and system level safety for their application. We also provide a set of resources including [Meta Llama Guard 2](https://llama.meta.com/purple-llama/) and [Code Shield](https://llama.meta.com/purple-llama/) safeguards. These tools have proven to drastically reduce residual risks of LLM Systems, while maintaining a high level of helpfulness. We encourage developers to tune and deploy these safeguards according to their needs and we provide a [reference implementation](https://github.com/meta-llama/llama-recipes/tree/main/recipes/responsible_ai) to get you started.


#### Llama 3-Instruct

As outlined in the Responsible Use Guide, some trade-off between model helpfulness and model alignment is likely unavoidable. Developers should exercise discretion about how to weigh the benefits of alignment and helpfulness for their specific use case and audience. Developers should be mindful of residual risks when using Llama models and leverage additional safety tools as needed to reach the right safety bar for their use case.

<span style="text-decoration:underline;">Safety</span>

For our instruction tuned model, we conducted extensive red teaming exercises, performed adversarial evaluations and implemented safety mitigations techniques to lower residual risks. As with any Large Language Model, residual risks will likely remain and we recommend that developers assess these risks in the context of their use case. In parallel, we are working with the community to make AI safety benchmark standards transparent, rigorous and interpretable.

<span style="text-decoration:underline;">Refusals</span>

In addition to residual risks, we put a great emphasis on model refusals to benign prompts. Over-refusing not only can impact the user experience but could even be harmful in certain contexts as well. We’ve heard the feedback from the developer community and improved our fine tuning to ensure that Llama 3 is significantly less likely to falsely refuse to answer prompts than Llama 2.

We built internal benchmarks and developed mitigations to limit false refusals making Llama 3 our most helpful model to date.


#### Responsible release

In addition to responsible use considerations outlined above, we followed a rigorous process that requires us to take extra measures against misuse and critical risks before we make our release decision.

Misuse

If you access or use Llama 3, you agree to the Acceptable Use Policy. The most recent copy of this policy can be found at [https://llama.meta.com/llama3/use-policy/](https://llama.meta.com/llama3/use-policy/).


#### Critical risks

<span style="text-decoration:underline;">CBRNE</span> (Chemical, Biological, Radiological, Nuclear, and high yield Explosives)

We have conducted a two fold assessment of the safety of the model in this area:



* Iterative testing during model training to assess the safety of responses related to CBRNE threats and other adversarial risks.
* Involving external CBRNE experts to conduct an uplift test assessing the ability of the model to accurately provide expert knowledge and reduce barriers to potential CBRNE misuse, by reference to what can be achieved using web search (without the model).


### <span style="text-decoration:underline;">Cyber Security </span>

We have evaluated Llama 3 with CyberSecEval, Meta’s cybersecurity safety eval suite, measuring Llama 3’s propensity to suggest insecure code when used as a coding assistant, and Llama 3’s propensity to comply with requests to help carry out cyber attacks, where attacks are defined by the industry standard MITRE ATT&CK cyber attack ontology. On our insecure coding and cyber attacker helpfulness tests, Llama 3 behaved in the same range or safer than models of [equivalent coding capability](https://huggingface.co/spaces/facebook/CyberSecEval).


### <span style="text-decoration:underline;">Child Safety</span>

Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development.  For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences.


### Community

Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership in AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our [GitHub repository](https://github.com/meta-llama/PurpleLlama).

Finally, we put in place a set of resources including an [output reporting mechanism](https://developers.facebook.com/llama_output_feedback) and [bug bounty program](https://www.facebook.com/whitehat) to continuously improve the Llama technology with the help of the community.


## Ethical Considerations and Limitations

The core values of Llama 3 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress.

But Llama 3 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has been in English, and has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3 models, developers should perform safety testing and tuning tailored to their specific applications of the model. As outlined in the Responsible Use Guide, we recommend incorporating [Purple Llama](https://github.com/facebookresearch/PurpleLlama) solutions into your workflows and specifically [Llama Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/) which provides a base model to filter input and output prompts to layer system-level safety on top of model-level safety.

Please see the Responsible Use Guide available at [http://llama.meta.com/responsible-use-guide](http://llama.meta.com/responsible-use-guide)


## Citation instructions

```
@article{llama3modelcard,
  title={Llama 3 Model Card},
  author={AI@Meta},
  year={2024},
  url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
}
```

## Contributors

Aaditya Singh; Aaron Grattafiori; Abhimanyu Dubey; Abhinav Jauhri; Abhinav Pandey; Abhishek Kadian; Adam Kelsey; Adi Gangidi; Ahmad Al-Dahle; Amit Sangani; Ahuva Goldstand; Aiesha Letman; Ajay Menon; Akhil Mathur; Alan Schelten; Alex Vaughan; Amy Yang; Andrei Lupu; Andres Alvarado; Andrew Gallagher; Andrew Gu; Andrew Ho; Andrew Poulton; Andrew Ryan; Angela Fan; Ankit Ramchandani; Anthony Hartshorn; Archi Mitra; Archie Sravankumar; Artem Korenev; Arun Rao; Ashley Gabriel; Ashwin Bharambe; Assaf Eisenman; Aston Zhang; Ash JJhaveri; Aurelien Rodriguez; Austen Gregerson; Ava Spataru; Baptiste Roziere; Ben Maurer; Benjamin Leonhardi; Bernie Huang; Bhargavi Paranjape; Bing Liu; Binh Tang; Bobbie Chern; Brani Stojkovic; Brian Fuller; Catalina Mejia Arenas; Chao Zhou; Charlotte Caucheteux; Chaya Nayak; Ching-Hsiang Chu; Chloe Bi; Chris Cai; Chris Cox; Chris Marra; Chris McConnell; Christian Keller; Christoph Feichtenhofer; Christophe Touret; Chunyang Wu; Corinne Wong; Cristian Canton Ferrer; Damien Allonsius; Daniel Kreymer; Daniel Haziza; Daniel Li; Danielle Pintz; Danny Livshits; Danny Wyatt; David Adkins; David Esiobu; David Xu; Davide Testuggine; Delia David; Devi Parikh; Dhruv Choudhary; Dhruv Mahajan; Diana Liskovich; Diego Garcia-Olano; Diego Perino; Dieuwke Hupkes; Dingkang Wang; Dustin Holland; Egor Lakomkin; Elina Lobanova; Xiaoqing Ellen Tan; Emily Dinan; Eric Smith; Erik Brinkman; Esteban Arcaute; Filip Radenovic; Firat Ozgenel; Francesco Caggioni; Frank Seide; Frank Zhang; Gabriel Synnaeve; Gabriella Schwarz; Gabrielle Lee; Gada Badeer; Georgia Anderson; Graeme Nail; Gregoire Mialon; Guan Pang; Guillem Cucurell; Hailey Nguyen; Hamid Shojanazeri; Hannah Korevaar; Hannah Wang; Haroun Habeeb; Harrison Rudolph; Henry Aspegren; Hu Xu; Hugo Touvron; Iga Kozlowska; Igor Molybog; Igor Tufanov; Iliyan Zarov; Imanol Arrieta Ibarra; Irina-Elena Veliche; Isabel Kloumann; Ishan Misra; Ivan Evtimov; Jade Copet; Jake Weissman; Jan Geffert; Jana Vranes; Japhet Asher; Jason Park; Jay Mahadeokar; Jean-Baptiste Gaya; Jeet Shah; Jelmer van der Linde; Jennifer Chan; Jenny Hong; Jenya Lee; Jeremy Fu; Jeremy Teboul; Jianfeng Chi; Jianyu Huang; Jie Wang; Jiecao Yu; Joanna Bitton; Joe Spisak; Joelle Pineau; Jon Carvill; Jongsoo Park; Joseph Rocca; Joshua Johnstun; Junteng Jia; Kalyan Vasuden Alwala; Kam Hou U; Kate Plawiak; Kartikeya Upasani; Kaushik Veeraraghavan; Ke Li; Kenneth Heafield; Kevin Stone; Khalid El-Arini; Krithika Iyer; Kshitiz Malik; Kuenley Chiu; Kunal Bhalla; Kyle Huang; Lakshya Garg; Lauren Rantala-Yeary; Laurens van der Maaten; Lawrence Chen; Leandro Silva; Lee Bell; Lei Zhang; Liang Tan; Louis Martin; Lovish Madaan; Luca Wehrstedt; Lukas Blecher; Luke de Oliveira; Madeline Muzzi; Madian Khabsa; Manav Avlani; Mannat Singh; Manohar Paluri; Mark Zuckerberg; Marcin Kardas; Martynas Mankus; Mathew Oldham; Mathieu Rita; Matthew Lennie; Maya Pavlova; Meghan Keneally; Melanie Kambadur; Mihir Patel; Mikayel Samvelyan; Mike Clark; Mike Lewis; Min Si; Mitesh Kumar Singh; Mo Metanat; Mona Hassan; Naman Goyal; Narjes Torabi; Nicolas Usunier; Nikolay Bashlykov; Nikolay Bogoychev; Niladri Chatterji; Ning Dong; Oliver Aobo Yang; Olivier Duchenne; Onur Celebi; Parth Parekh; Patrick Alrassy; Paul Saab; Pavan Balaji; Pedro Rittner; Pengchuan Zhang; Pengwei Li; Petar Vasic; Peter Weng; Polina Zvyagina; Prajjwal Bhargava; Pratik Dubal; Praveen Krishnan; Punit Singh Koura; Puxin Xu; Qing He; Rachel Rodriguez; Ragavan Srinivasan; Rahul Mitra; Ramon Calderer; Raymond Li; Robert Stojnic; Roberta Raileanu; Robin Battey; Rocky Wang; Rohit Girdhar; Rohit Patel; Romain Sauvestre; Ronnie Polidoro; Roshan Sumbaly; Ross Taylor; Ruan Silva; Rui Hou; Rui Wang; Russ Howes; Ruty Rinott; Saghar Hosseini; Sai Jayesh Bondu; Samyak Datta; Sanjay Singh; Sara Chugh; Sargun Dhillon; Satadru Pan; Sean Bell; Sergey Edunov; Shaoliang Nie; Sharan Narang; Sharath Raparthy; Shaun Lindsay; Sheng Feng; Sheng Shen; Shenghao Lin; Shiva Shankar; Shruti Bhosale; Shun Zhang; Simon Vandenhende; Sinong Wang; Seohyun Sonia Kim; Soumya Batra; Sten Sootla; Steve Kehoe; Suchin Gururangan; Sumit Gupta; Sunny Virk; Sydney Borodinsky; Tamar Glaser; Tamar Herman; Tamara Best; Tara Fowler; Thomas Georgiou; Thomas Scialom; Tianhe Li; Todor Mihaylov; Tong Xiao; Ujjwal Karn; Vedanuj Goswami; Vibhor Gupta; Vignesh Ramanathan; Viktor Kerkez; Vinay Satish Kumar; Vincent Gonguet; Vish Vogeti; Vlad Poenaru; Vlad Tiberiu Mihailescu; Vladan Petrovic; Vladimir Ivanov; Wei Li; Weiwei Chu; Wenhan Xiong; Wenyin Fu; Wes Bouaziz; Whitney Meers; Will Constable; Xavier Martinet; Xiaojian Wu; Xinbo Gao; Xinfeng Xie; Xuchao Jia; Yaelle Goldschlag; Yann LeCun; Yashesh Gaur; Yasmine Babaei; Ye Qi; Yenda Li; Yi Wen; Yiwen Song; Youngjin Nam; Yuchen Hao; Yuchen Zhang; Yun Wang; Yuning Mao; Yuzi He; Zacharie Delpierre Coudert; Zachary DeVito; Zahra Hankir; Zhaoduo Wen; Zheng Yan; Zhengxing Chen; Zhenyu Yang; Zoe Papakipos


================================================
FILE: models/llama3/USE_POLICY.md
================================================
# Meta Llama 3 Acceptable Use Policy

Meta is committed to promoting safe and fair use of its tools and features, including Llama 3. If you access or use Llama 3, you agree to this Acceptable Use Policy (“Policy”). The most recent copy of this policy can be found at [ai.meta.com/llama/use-policy](http://ai.meta.com/llama/use-policy).

## Prohibited Uses
We want everyone to use Llama 3 safely and responsibly. You agree you will not use, or allow others to use, Llama 3 to:

1. Violate the law or others’ rights, including to:
    1. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as:
        1. Violence or terrorism
        2. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material
        3. Human trafficking, exploitation, and sexual violence
        4. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials.
        5. Sexual solicitation
        6. Any other criminal activity
    2. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals
    3. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services
    4. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices
    5. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws
    6. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any products or services using the Llama 3 Materials
    7. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system



2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of Llama 3 related to the following:
    1. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State
    2. Guns and illegal weapons (including weapon development)
    3. Illegal drugs and regulated/controlled substances
    4. Operation of critical infrastructure, transportation technologies, or heavy machinery
    5. Self-harm or harm to others, including suicide, cutting, and eating disorders
    6. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual



3. Intentionally deceive or mislead others, including use of Llama 3 related to the following:
    1. Generating, promoting, or furthering fraud or the creation or promotion of disinformation
    2. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content
    3. Generating, promoting, or further distributing spam
    4. Impersonating another individual without consent, authorization, or legal right
    5. Representing that the use of Llama 3 or outputs are human-generated
    6. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement
4. Fail to appropriately disclose to end users any known dangers of your AI system

Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation of this Policy through one of the following means:

* Reporting issues with the model: [github.com/facebookresearch/llama](http://github.com/facebookresearch/llama)
* Reporting risky content generated by the model: [developers.facebook.com/llama_output_feedback](http://developers.facebook.com/llama_output_feedback)
* Reporting bugs and security concerns: [facebook.com/whitehat/info](http://facebook.com/whitehat/info)
* Reporting violations of the Acceptable Use Policy or unlicensed uses of Llama: [LlamaUseReport@meta.com](mailto:LlamaUseReport@meta.com)


================================================
FILE: models/llama3/__init__.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.


================================================
FILE: models/llama3/args.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

from dataclasses import dataclass
from enum import Enum
from typing import Optional


class QuantizationScheme(Enum):
    int4_weight_int8_dynamic_activation = "int4_weight_int8_dynamic_activation"


@dataclass
class QuantizationArgs:
    scheme: Optional[QuantizationScheme] = None
    group_size: Optional[int] = None
    spinquant: bool = False

    def __init__(self, **kwargs):
        for k, v in kwargs.items():
            if k == "scheme":
                setattr(self, k, QuantizationScheme(v))
            else:
                if hasattr(self, k):
                    setattr(self, k, v)


@dataclass
class LoRAArgs:
    rank: int
    scale: float


@dataclass
class ModelArgs:
    dim: int = 4096
    n_layers: int = 32
    n_heads: int = 32
    n_kv_heads: Optional[int] = None
    vocab_size: int = -1
    multiple_of: int = 256  # make SwiGLU hidden layer size multiple of large power of 2
    ffn_dim_multiplier: Optional[float] = None
    norm_eps: float = 1e-5
    rope_theta: float = 500000
    use_scaled_rope: bool = False

    max_batch_size: int = 32
    max_seq_len: int = 2048

    # vision model params
    vision_chunk_size: int = -1  # image resolution for image models
    vision_max_num_chunks: int = 4
    vision_num_cross_attention_layers: int = -1

    quantization_args: Optional[QuantizationArgs] = None
    lora_args: Optional[LoRAArgs] = None

    def __init__(self, **kwargs):
        for k, v in kwargs.items():
            if k == "lora_args":
                setattr(self, k, LoRAArgs(**v))
            elif k == "quantization_args":
                setattr(self, k, QuantizationArgs(**v))
            elif k == "vision_model" and "cross_attention_adapter" in v:
                self.vision_num_cross_attention_layers = v["cross_attention_adapter"]["num_layers"]
            else:
                if hasattr(self, k):
                    setattr(self, k, v)

        if self.n_kv_heads is None:
            self.n_kv_heads = self.n_heads
        assert self.n_kv_heads <= self.n_heads
        assert self.n_heads % self.n_kv_heads == 0
        assert self.dim % self.n_heads == 0


================================================
FILE: models/llama3/chat_format.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import io
import uuid
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple

from PIL import Image as PIL_Image

from ..datatypes import (
    BuiltinTool,
    RawContent,
    RawMediaItem,
    RawMessage,
    RawTextItem,
    Role,
    StopReason,
    ToolCall,
    ToolPromptFormat,
)
from .tokenizer import Tokenizer
from .tool_utils import ToolUtils


@dataclass
class VisionInput:
    mask: List[List[int]]
    images: List[PIL_Image.Image]


@dataclass
class LLMInput:
    tokens: List[int]
    vision: Optional[VisionInput] = None


def role_str(role: Role) -> str:
    role_strs = {
        Role.user: "user",
        Role.system: "system",
        Role.tool: "ipython",  # special
        Role.assistant: "assistant",
    }
    return role_strs[role]


class ChatFormat:
    possible_headers: Dict[Role, str]

    def __init__(self, tokenizer: Tokenizer):
        self.tokenizer = tokenizer

        self.possible_headers = {role: f"<|start_header_id|>{role_str(role)}<|end_header_id|>\n\n" for role in Role}
        self.vision_token = self.tokenizer.special_tokens["<|image|>"]

    def _encode_header(self, role: str) -> List[int]:
        tokens = []
        tokens.append(self.tokenizer.special_tokens["<|start_header_id|>"])
        tokens.extend(self.tokenizer.encode("ipython" if role == "tool" else role, bos=False, eos=False))
        tokens.append(self.tokenizer.special_tokens["<|end_header_id|>"])
        tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))
        return tokens

    def encode_content(self, content: RawContent) -> LLMInput:
        tokens, images = self._encode_content(content, bos=True)
        return self._model_input_from_tokens_images(tokens, images)

    def _encode_content(self, content: RawContent, bos: bool = False) -> Tuple[List[int], List[PIL_Image.Image]]:
        tokens = []
        images = []

        added_bos = False

        def _process(c):
            nonlocal added_bos, bos

            if isinstance(c, str) or isinstance(c, RawTextItem):
                if isinstance(c, RawTextItem):
                    c = c.text
                tokens.extend(self.tokenizer.encode(c, bos=False if added_bos else bos, eos=False))
                added_bos = True

            elif isinstance(c, RawMediaItem):
                bos = False if added_bos else bos
                if bos:
                    tokens.append(self.tokenizer.special_tokens["<|begin_of_text|>"])
                    added_bos = True
                tokens.append(self.vision_token)

                bytes_io = io.BytesIO(c.data) if isinstance(c.data, bytes) else c.data
                image = PIL_Image.open(bytes_io)
                image = image.convert("RGB")
                images.append(image)

        if isinstance(content, list):
            for c in content:
                _process(c)
        else:
            _process(content)

        return tokens, images

    def encode_message(
        self, message: RawMessage, tool_prompt_format: ToolPromptFormat
    ) -> Tuple[List[int], List[PIL_Image.Image]]:
        tokens = self._encode_header(message.role)
        images = []

        def _process_content(c):
            toks, imgs = self._encode_content(c)
            tokens.extend(toks)
            images.extend(imgs)

        if (
            message.role == "assistant"
            and len(message.tool_calls) > 0
            and message.tool_calls[0].tool_name == BuiltinTool.code_interpreter
        ):
            tokens.append(self.tokenizer.special_tokens["<|python_tag|>"])

        _process_content(message.content)

        if message.role == "user" and message.context is not None:
            # This is RAG context; why is it here in the chat format? I don't think
            # this is needed and can be moved upwards
            _process_content("\n\n")
            _process_content(message.context)

        if message.role == "assistant":
            for t in message.tool_calls:
                content = ToolUtils.encode_tool_call(t, tool_prompt_format)
                _process_content(content)

        eom = False
        if message.role == "assistant":
            eom = message.stop_reason == StopReason.end_of_message

        tokens.append(self.tokenizer.special_tokens["<|eom_id|>" if eom else "<|eot_id|>"])
        return tokens, images

    def encode_dialog_prompt(
        self,
        messages: List[RawMessage],
        tool_prompt_format: Optional[ToolPromptFormat] = None,
    ) -> LLMInput:
        tool_prompt_format = tool_prompt_format or ToolPromptFormat.json
        tokens = []
        images = []
        tokens.append(self.tokenizer.special_tokens["<|begin_of_text|>"])
        for message in messages:
            toks, imgs = self.encode_message(message, tool_prompt_format)
            tokens.extend(toks)
            images.extend(imgs)

        # Add the start of an assistant message for the model to complete.
        tokens.extend(self._encode_header("assistant"))

        return self._model_input_from_tokens_images(tokens, images)

    # TODO(this should be generic, not only for assistant messages)
    def decode_assistant_message(self, tokens: List[int], stop_reason: StopReason) -> RawMessage:
        content = self.tokenizer.decode(tokens)

        return self.decode_assistant_message_from_content(content, stop_reason)

    def decode_assistant_message_from_content(self, content: str, stop_reason: StopReason) -> RawMessage:
        content = content.strip(" ")
        header_str = self.possible_headers[Role.assistant]
        if content.startswith(header_str):
            content = content[len(header_str) :]

        ipython = content.startswith("<|python_tag|>")
        if ipython:
            content = content[len("<|python_tag|>") :]

        if content.endswith("<|eot_id|>"):
            content = content[: -len("<|eot_id|>")]
            stop_reason = StopReason.end_of_turn
        elif content.endswith("<|eom_id|>"):
            content = content[: -len("<|eom_id|>")]
            stop_reason = StopReason.end_of_message

        tool_name = None
        tool_arguments = {}

        custom_tool_info = ToolUtils.maybe_extract_custom_tool_call(content)
        if custom_tool_info is not None:
            tool_name, tool_arguments = custom_tool_info
            # Sometimes when agent has custom tools alongside builin tools
            # Agent responds for builtin tool calls in the format of the custom tools
            # This code tries to handle that case
            if tool_name in BuiltinTool.__members__:
                tool_name = BuiltinTool[tool_name]
                tool_arguments = {
                    "query": list(tool_arguments.values())[0],
                }
        else:
            builtin_tool_info = ToolUtils.maybe_extract_builtin_tool_call(content)
            if builtin_tool_info is not None:
                tool_name, query = builtin_tool_info
                tool_arguments = {
                    "query": query,
                }
                if tool_name in BuiltinTool.__members__:
                    tool_name = BuiltinTool[tool_name]
            elif ipython:
                tool_name = BuiltinTool.code_interpreter
                tool_arguments = {
                    "code": content,
                }

        tool_calls = []
        if tool_name is not None and tool_arguments is not None:
            call_id = str(uuid.uuid4())
            tool_calls.append(
                ToolCall(
                    call_id=call_id,
                    tool_name=tool_name,
                    arguments=tool_arguments,
                )
            )
            content = ""

        return RawMessage(
            role="assistant",
            content=content,
            stop_reason=stop_reason,
            tool_calls=tool_calls,
        )

    def _model_input_from_tokens_images(self, tokens: List[int], images: List[PIL_Image.Image]) -> LLMInput:
        vision_input = None
        if len(images) > 0:
            vision_input = VisionInput(
                mask=create_vision_mask(tokens, self.vision_token),
                images=images,
            )

        return LLMInput(
            tokens=[128256 if token == self.vision_token else token for token in tokens],
            vision=vision_input,
        )


def create_vision_mask(
    tokens: List[int],
    vision_token: int,
) -> List[List[int]]:
    vision_token_locations = [i for i, token in enumerate(tokens) if token == vision_token]
    if len(vision_token_locations) == 0:
        return []

    if len(vision_token_locations) == 1:
        # only one image present, unmask until end of sequence
        return [[vision_token_locations[0], -1]]
    vision_masks = [[loc1, loc2] for loc1, loc2 in zip(vision_token_locations[:-1], vision_token_locations[1:])]
    # last image will attend to all subsequent text
    vision_masks.append([vision_token_locations[-1], len(tokens)])

    # if there are two or more consecutive vision tokens,
    # they should all attend to all subsequent
    # text present
    last_mask_end = vision_masks[-1][1]
    for vision_mask in vision_masks[::-1]:
        if vision_mask[0] == vision_mask[1] - 1:
            vision_mask[1] = last_mask_end
        last_mask_end = vision_mask[1]
    return vision_masks


================================================
FILE: models/llama3/generation.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import json
import os
import sys
import time
from pathlib import Path
from packaging import version
from typing import Callable, Generator, List, Optional

import torch
import torch.nn.functional as F
from fairscale.nn.model_parallel.initialize import (
    initialize_model_parallel,
    model_parallel_is_initialized,
)
from termcolor import cprint

from ..checkpoint import maybe_reshard_state_dict
from ..datatypes import GenerationResult, QuantizationMode, RawContent, RawMessage, ToolPromptFormat
from .args import ModelArgs
from .chat_format import ChatFormat, LLMInput
from .model import Transformer
from .multimodal.model import CrossAttentionTransformer
from .tokenizer import Tokenizer


def is_xccl_available():
    if version.parse(torch.__version__).release >= version.parse("2.7").release:
        return torch.distributed.distributed_c10d.is_xccl_available()
    return False


class Llama3:
    @staticmethod
    def build(
        ckpt_dir: str,
        max_seq_len: int,
        max_batch_size: int,
        world_size: Optional[int] = None,
        quantization_mode: Optional[QuantizationMode] = None,
        seed: int = 1,
        device: str = "cuda",
    ):
        device = torch.device(device)
        if (
            device.type == "cuda"
            and not torch.cuda.is_available()
            or device.type == "xpu"
            and not torch.xpu.is_available()
        ):
            raise RuntimeError(f"PyTorch backend for {device.type} device type is not available")

        if not torch.distributed.is_initialized():
            if device.type == "cuda":
                torch.distributed.init_process_group("nccl")
            elif device.type == "xpu" and is_xccl_available():
                torch.distributed.init_process_group("xccl")
            else:
                torch.distributed.init_process_group("gloo")

        if not model_parallel_is_initialized():
            if world_size is None:
                world_size = int(os.environ.get("WORLD_SIZE", 1))
            initialize_model_parallel(world_size)

        local_rank = int(os.environ.get("LOCAL_RANK", 0))
        if device.type == "cuda":
            torch.cuda.set_device(local_rank)
        elif device.type == "xpu":
            torch.xpu.set_device(local_rank)

        torch.manual_seed(seed)

        if local_rank > 0:
            sys.stdout = open(os.devnull, "w")

        start_time = time.time()

        ckpt_paths = sorted(Path(ckpt_dir).glob("*.pth"))
        assert len(ckpt_paths) > 0, f"no checkpoint files found in {ckpt_dir}"
        print(f"Loading a checkpoint (shards={len(ckpt_paths)}, current-mp-size={world_size})")
        with open(Path(ckpt_dir) / "params.json", "r") as f:
            params = json.loads(f.read())

        model_args: ModelArgs = ModelArgs(
            max_seq_len=max_seq_len,
            max_batch_size=max_batch_size,
            **params,
        )
        tokenizer = Tokenizer.get_instance()

        state_dict = maybe_reshard_state_dict(
            ckpt_paths,
            n_kv_heads=model_args.n_kv_heads if model_args.n_kv_heads else model_args.n_heads,
        )

        assert model_args.vocab_size == tokenizer.n_words

        def build_model():
            if model_args.vision_chunk_size > 0:
                model = CrossAttentionTransformer(model_args)
                model.setup_cache(model_args.max_batch_size, device=device, dtype=torch.get_default_dtype())
            else:
                model = Transformer(model_args)
            return model

        if quantization_mode == QuantizationMode.fp8_mixed or quantization_mode == QuantizationMode.int4_mixed:
            from .quantization.loader import convert_to_quantized_model

            torch.set_default_tensor_type(torch.BFloat16Tensor)
            model = build_model()
            print("Loading state dict...")
            model.load_state_dict(state_dict, strict=False)
            print("Done...")
            model = convert_to_quantized_model(model, ckpt_dir, quantization_mode, device=device)
            torch.set_default_device(device)
        else:
            print(f"Setting default device to {device}")
            torch.set_default_device(device)
            if device.type == "cuda":
                if torch.cuda.is_bf16_supported():
                    torch.set_default_dtype(torch.bfloat16)
                else:
                    torch.set_default_dtype(torch.half)
            elif device.type == "xpu":
                if torch.xpu.is_bf16_supported():
                    torch.set_default_dtype(torch.bfloat16)
                else:
                    torch.set_default_dtype(torch.half)

            model = build_model()
            print("Loading state dict...")
            model.load_state_dict(state_dict, strict=True)
            model.to(device)
            print("Done...")

        print(f"Loaded in {time.time() - start_time:.2f} seconds")

        return Llama3(model, tokenizer, model_args)

    def __init__(self, model: Transformer | CrossAttentionTransformer, tokenizer: Tokenizer, args: ModelArgs):
        self.args = args
        self.model = model
        self.tokenizer = tokenizer
        self.formatter = ChatFormat(tokenizer)

    @torch.inference_mode()
    def generate(
        self,
        model_inputs: List[LLMInput],
        temperature: float = 0.6,
        top_p: float = 0.9,
        max_gen_len: Optional[int] = None,
        logprobs: bool = False,
        echo: bool = False,
        print_model_input: bool = False,
        logits_processor: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None,
    ) -> Generator[List[GenerationResult], None, None]:
        if max_gen_len is None or max_gen_len == 0 or max_gen_len >= self.args.max_seq_len:
            max_gen_len = self.args.max_seq_len - 1
        params = self.model.params

        print_model_input = print_model_input or os.environ.get("LLAMA_MODELS_DEBUG", "0") == "1"
        if print_model_input:
            for inp in model_inputs:
                tokens_to_print = [self.formatter.vision_token if t == 128256 else t for t in inp.tokens]
                cprint(
                    "Input to model:\n" + self.tokenizer.decode(tokens_to_print) + "\n",
                    "red",
                )
        prompt_tokens = [inp.tokens for inp in model_inputs]

        bsz = len(model_inputs)
        assert bsz <= params.max_batch_size, (bsz, params.max_batch_size)

        min_prompt_len = min(len(t) for t in prompt_tokens)
        max_prompt_len = max(len(t) for t in prompt_tokens)

        if max_prompt_len >= params.max_seq_len:
            cprint(f"Out of token budget {max_prompt_len} vs {params.max_seq_len}", "red")
            return

        total_len = min(max_gen_len + max_prompt_len, params.max_seq_len)

        pad_id = self.tokenizer.pad_id
        tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long)
        for k, t in enumerate(prompt_tokens):
            tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long)
        if logprobs:
            token_logprobs = torch.zeros_like(tokens, dtype=torch.float)

        is_vision = not isinstance(self.model, Transformer)
        if is_vision:
            images = [inp.vision.images if inp.vision is not None else [] for inp in model_inputs]
            mask = [inp.vision.mask if inp.vision is not None else [] for inp in model_inputs]

            xattn_caches, cross_attention_masks, full_text_row_masked_out_mask = self.model.compute_vision_tokens_masks(
                batch_images=images,
                batch_masks=mask,
                total_len=total_len,
                device=tokens.device,
            )

        eos_reached = torch.tensor([False] * bsz)
        input_text_mask = tokens != pad_id

        if echo:
            for i in range(max_prompt_len):
                results = []
                for j, t in enumerate(tokens[:, i]):
                    results.append(
                        GenerationResult(
                            token=t.item(),
                            text=self.tokenizer.decode([t.item()]),
                            source="input",
                            logprobs=(token_logprobs[j, i : i + 1].tolist() if logprobs else None),
                            batch_idx=j,
                            finished=False,
                            ignore_token=t.item() == pad_id,
                        )
                    )
                yield results

        stop_tokens = torch.tensor(self.tokenizer.stop_tokens)

        prev_pos = 0
        for cur_pos in range(min_prompt_len, total_len):
            if is_vision:
                position_ids = torch.arange(prev_pos, cur_pos, dtype=torch.long)
                text_only_inference = all(inp.vision is None for inp in model_inputs)
                logits = self.model.forward(
                    position_ids,
                    tokens,
                    cross_attention_masks,
                    full_text_row_masked_out_mask,
                    xattn_caches,
                    text_only_inference,
                )
            else:
                logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)

            if logits_processor is not None:
                logits = logits_processor(tokens[:, :cur_pos], logits)

            if temperature > 0:
                probs = torch.softmax(logits[:, -1] / temperature, dim=-1)
                next_token = sample_top_p(probs, top_p)
            else:
                next_token = torch.argmax(logits[:, -1], dim=-1)

            next_token = next_token.reshape(-1)
            # only replace token if prompt has already been generated
            next_token = torch.where(input_text_mask[:, cur_pos], tokens[:, cur_pos], next_token)
            tokens[:, cur_pos] = next_token

            target = tokens[:, prev_pos + 1 : cur_pos + 1]
            if is_vision:
                # the logits space (num_classes) is designed to never contain a media_token
                # however our input token stream does contain them. we need to nuke them here
                # or else the CUDA kernels will crash with an illegal memory access
                vision_tokens = [self.tokenizer.special_tokens["<|image|>"], 128256]
                masks = [target.eq(t) for t in vision_tokens]
                if len(masks) > 1:
                    mask = torch.logical_or(*masks)
                else:
                    mask = masks[0]
                target[mask] = 0

            if logprobs:
                token_logprobs[:, prev_pos + 1 : cur_pos + 1] = -F.cross_entropy(
                    input=logits.transpose(1, 2),
                    target=target,
                    reduction="none",
                    ignore_index=pad_id,
                )
            eos_reached |= (~input_text_mask[:, cur_pos]) & (torch.isin(next_token, stop_tokens))
            results = []
            for idx, t in enumerate(next_token):
                results.append(
                    GenerationResult(
                        token=t.item(),
                        text=self.tokenizer.decode([t.item()]),
                        source="output",
                        logprobs=(token_logprobs[idx, cur_pos : cur_pos + 1].tolist() if logprobs else None),
                        batch_idx=idx,
                        finished=eos_reached[idx],
                        ignore_token=cur_pos < len(prompt_tokens[idx]),
                    )
                )
            yield results

            prev_pos = cur_pos
            if all(eos_reached):
                break

    def completion(
        self,
        contents: List[RawContent],
        temperature: float = 0.6,
        top_p: float = 0.9,
        max_gen_len: Optional[int] = None,
        logprobs: bool = False,
        echo: bool = False,
    ) -> Generator[List[GenerationResult], None, None]:
        model_inputs = [self.formatter.encode_content(c) for c in contents]
        for result in self.generate(
            model_inputs=model_inputs,
            temperature=temperature,
            top_p=top_p,
            max_gen_len=max_gen_len,
            logprobs=logprobs,
            echo=echo,
        ):
            yield result
            if all(r.finished for r in result):
                break

    def chat_completion(
        self,
        messages_batch: List[List[RawMessage]],
        temperature: float = 0.6,
        top_p: float = 0.9,
        max_gen_len: Optional[int] = None,
        logprobs: bool = False,
        tool_prompt_format: ToolPromptFormat = ToolPromptFormat.json,
        echo: bool = False,
    ) -> Generator[List[GenerationResult], None, None]:
        model_inputs = [self.formatter.encode_dialog_prompt(messages) for messages in messages_batch]
        for result in self.generate(
            model_inputs=model_inputs,
            temperature=temperature,
            top_p=top_p,
            max_gen_len=max_gen_len,
            logprobs=logprobs,
            echo=echo,
        ):
            yield result
            if all(r.finished for r in result):
                break


def sample_top_p(probs, p):
    """
    Perform top-p (nucleus) sampling on a probability distribution.

    Args:
        probs (torch.Tensor): Probability distribution tensor.
        p (float): Probability threshold for top-p sampling.

    Returns:
        torch.Tensor: Sampled token indices.

    Note:
        Top-p sampling selects the smallest set of tokens whose cumulative probability mass
        exceeds the threshold p. The distribution is renormalized based on the selected tokens.
    """
    probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True)
    probs_sum = torch.cumsum(probs_sort, dim=-1)
    mask = probs_sum - probs_sort > p
    probs_sort[mask] = 0.0
    probs_sort.div_(probs_sort.sum(dim=-1, keepdim=True))
    next_token = torch.multinomial(probs_sort, num_samples=1)
    next_token = torch.gather(probs_idx, -1, next_token)
    return next_token


================================================
FILE: models/llama3/model.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed in accordance with the terms of the Llama 3 Community License Agreement.

import math
from typing import Optional, Tuple

import fairscale.nn.model_parallel.initialize as fs_init
import torch
import torch.nn.functional as F
from fairscale.nn.model_parallel.layers import (
    ColumnParallelLinear,
    RowParallelLinear,
    VocabParallelEmbedding,
)
from torch import nn

from .args import ModelArgs

# **NOTE**: This code is not runnable without installing `torch` and `fairscale`
# dependencies. These dependencies are not part of the default dependencies
# (requirements.txt) of the `llama-models` package.


class RMSNorm(torch.nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight


def apply_scaling(freqs: torch.Tensor) -> torch.Tensor:
    # Values obtained from grid search
    scale_factor = 8
    low_freq_factor = 1
    high_freq_factor = 4
    old_context_len = 8192  # original llama3 length

    low_freq_wavelen = old_context_len / low_freq_factor
    high_freq_wavelen = old_context_len / high_freq_factor

    wavelen = 2 * torch.pi / freqs
    new_freqs = torch.where(wavelen > low_freq_wavelen, freqs / scale_factor, freqs)
    smooth = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
    return torch.where(
        (wavelen >= high_freq_wavelen) & (wavelen <= low_freq_wavelen),
        (1 - smooth) * new_freqs / scale_factor + smooth * new_freqs,
        new_freqs,
    )


def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0, use_scaled: bool = False):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device, dtype=torch.float32)
    if use_scaled:
        freqs = apply_scaling(freqs)
    freqs = torch.outer(t, freqs)
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    return freqs_cis


def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
    ndim = x.ndim
    assert 0 <= 1 < ndim
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
    return freqs_cis.view(*shape)


def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)


def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
    """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
    bs, slen, n_kv_heads, head_dim = x.shape
    if n_rep == 1:
        return x
    return (
        x[:, :, :, None, :]
        .expand(bs, slen, n_kv_heads, n_rep, head_dim)
        .reshape(bs, slen, n_kv_heads * n_rep, head_dim)
    )


class Attention(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
        world_size = fs_init.get_model_parallel_world_size()
        self.n_local_heads = args.n_heads // world_size
        self.n_local_kv_heads = self.n_kv_heads // world_size
        self.n_rep = self.n_local_heads // self.n_local_kv_heads
        self.head_dim = args.dim // args.n_heads

        self.wq = ColumnParallelLinear(
            args.dim,
            args.n_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wk = ColumnParallelLinear(
            args.dim,
            self.n_kv_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wv = ColumnParallelLinear(
            args.dim,
            self.n_kv_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wo = RowParallelLinear(
            args.n_heads * self.head_dim,
            args.dim,
            bias=False,
            input_is_parallel=True,
            init_method=lambda x: x,
        )

        self.cache_k = torch.zeros(
            (
                args.max_batch_size,
                args.max_seq_len,
                self.n_local_kv_heads,
                self.head_dim,
            )
        )
        self.cache_v = torch.zeros(
            (
                args.max_batch_size,
                args.max_seq_len,
                self.n_local_kv_heads,
                self.head_dim,
            )
        )

    def forward(
        self,
        x: torch.Tensor,
        start_pos: int,
        freqs_cis: torch.Tensor,
        mask: Optional[torch.Tensor],
    ):
        bsz, seqlen, _ = x.shape
        xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)

        xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
        xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
        xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)

        xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)

        self.cache_k = self.cache_k.to(xq)
        self.cache_v = self.cache_v.to(xq)

        self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk
        self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv

        keys = self.cache_k[:bsz, : start_pos + seqlen]
        values = self.cache_v[:bsz, : start_pos + seqlen]

        # repeat k/v heads if n_kv_heads < n_heads
        keys = repeat_kv(keys, self.n_rep)  # (bs, cache_len + seqlen, n_local_heads, head_dim)
        values = repeat_kv(values, self.n_rep)  # (bs, cache_len + seqlen, n_local_heads, head_dim)

        xq = xq.transpose(1, 2)  # (bs, n_local_heads, seqlen, head_dim)
        keys = keys.transpose(1, 2)  # (bs, n_local_heads, cache_len + seqlen, head_dim)
        values = values.transpose(1, 2)  # (bs, n_local_heads, cache_len + seqlen, head_dim)
        scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
        if mask is not None:
            scores = scores + mask  # (bs, n_local_heads, seqlen, cache_len + seqlen)
        scores = F.softmax(scores.float(), dim=-1).type_as(xq)
        output = torch.matmul(scores, values)  # (bs, n_local_heads, seqlen, head_dim)
        output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)
        return self.wo(output)


class FeedForward(nn.Module):
    def __init__(
        self,
        dim: int,
        hidden_dim: int,
        multiple_of: int,
        ffn_dim_multiplier: Optional[float],
    ):
        super().__init__()
        hidden_dim = int(2 * hidden_dim / 3)
        # custom dim factor multiplier
        if ffn_dim_multiplier is not None:
            hidden_dim = int(ffn_dim_multiplier * hidden_dim)
        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)

        self.w1 = ColumnParallelLinear(dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x)
        self.w2 = RowParallelLinear(hidden_dim, dim, bias=False, input_is_parallel=True, init_method=lambda x: x)
        self.w3 = ColumnParallelLinear(dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x)

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))


class TransformerBlock(nn.Module):
    def __init__(self, layer_id: int, args: ModelArgs):
        super().__init__()
        self.n_heads = args.n_heads
        self.dim = args.dim
        self.head_dim = args.dim // args.n_heads
        self.attention = Attention(args)
        self.feed_forward = FeedForward(
            dim=args.dim,
            hidden_dim=4 * args.dim,
            multiple_of=args.multiple_of,
            ffn_dim_multiplier=args.ffn_dim_multiplier,
        )
        self.layer_id = layer_id
        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
        self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)

    def forward(
        self,
        x: torch.Tensor,
        start_pos: int,
        freqs_cis: torch.Tensor,
        mask: Optional[torch.Tensor],
    ):
        h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
        out = h + self.feed_forward(self.ffn_norm(h))
        return out


class Transformer(nn.Module):
    def __init__(self, params: ModelArgs):
        super().__init__()
        self.params = params
        self.vocab_size = params.vocab_size
        self.n_layers = params.n_layers

        self.tok_embeddings = VocabParallelEmbedding(params.vocab_size, params.dim, init_method=lambda x: x)

        self.layers = torch.nn.ModuleList()
        for layer_id in range(params.n_layers):
            self.layers.append(TransformerBlock(layer_id, params))

        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
        self.output = ColumnParallelLinear(params.dim, params.vocab_size, bias=False, init_method=lambda x: x)

        self.freqs_cis = precompute_freqs_cis(
            params.dim // params.n_heads,
            params.max_seq_len * 2,
            params.rope_theta,
            params.use_scaled_rope,
        )

    @torch.inference_mode()
    def forward(self, tokens: torch.Tensor, start_pos: int):
        _bsz, seqlen = tokens.shape
        h = self.tok_embeddings(tokens)
        self.freqs_cis = self.freqs_cis.to(h.device)
        freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]

        mask = None
        if seqlen > 1:
            mask = torch.full((seqlen, seqlen), float("-inf"), device=tokens.device)

            mask = torch.triu(mask, diagonal=1)

            # https://github.com/pytorch/pytorch/issues/100005
            # torch.triu is buggy when the device is mps: filled values are
            # nan instead of 0.
            if mask.device.type == torch.device("mps").type:
                mask = torch.nan_to_num(mask, nan=0.0)

            # When performing key-value caching, we compute the attention scores
            # only for the new sequence. Thus, the matrix of scores is of size
            # (seqlen, cache_len + seqlen), and the only masked entries are (i, j) for
            # j > cache_len + i, since row i corresponds to token cache_len + i.
            mask = torch.hstack([torch.zeros((seqlen, start_pos), device=tokens.device), mask]).type_as(h)

        for layer in self.layers:
            h = layer(h, start_pos, freqs_cis, mask)
        h = self.norm(h)
        output = self.output(h).float()
        return output


================================================
FILE: models/llama3/multimodal/__init__.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.


================================================
FILE: models/llama3/multimodal/encoder_utils.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

# Copyright (c) Meta Platforms, Inc. and its affiliates.
import math
from logging import getLogger

import torch
import torch.nn.functional as F

from .utils import get_negative_inf_value, to_2tuple

logger = getLogger()


def resize_local_position_embedding(orig_pos_embed, grid_size):
    """
    Resize position embedding for vision encoder.
    Original position embedding is [n_tiles * n_tiles + 1, dim]
    New position embedding will be [grid_size[0] * grid_size[1] + 1, dim]
    """
    new_grid_size = to_2tuple(grid_size)
    orig_grid_size = to_2tuple(int(math.sqrt(len(orig_pos_embed) - 1)))
    new_seq_len = new_grid_size[0] * new_grid_size[1] + 1

    new_pos_emb_tok, new_pos_emb_img = (
        orig_pos_embed[:1],
        orig_pos_embed[1:],
    )
    logger.info(f"resizing position embedding grid-size from {orig_grid_size} to {new_grid_size}")

    new_pos_emb_img = new_pos_emb_img.reshape(1, orig_grid_size[0], orig_grid_size[1], -1).permute(0, 3, 1, 2)

    new_pos_emb_img = F.interpolate(
        new_pos_emb_img,
        size=new_grid_size,
        mode="bilinear",
        align_corners=True,
    )
    new_pos_emb_img = new_pos_emb_img.permute(0, 2, 3, 1).reshape(1, new_grid_size[0] * new_grid_size[1], -1)[0]
    new_pos_embed = torch.cat([new_pos_emb_tok, new_pos_emb_img], dim=0)
    return new_pos_embed


def initialize_global_position_embedding_from_local(pos_and_cls_embed, grid_size, x_scale, y_scale):
    """
    Takes a local position embedding for vision encoder and uses it
    to initialize the global position embedding.
    Input: local position embedding of shape [grid_size[0] * grid_size[1] + 1, dim]
    Returns: global position embedding of shape [x_scale, y_scale, grid_size[0] * grid_size[1] + 1, dim]
    Here x_scale and y_scale are the number of tiles along x-axis and y-axis respectively.
    """
    pos_embed = pos_and_cls_embed[1:]
    cls_embed = pos_and_cls_embed[0].view(1, 1, 1, -1)
    grid_size = to_2tuple(grid_size)
    new_pos_emb_img = pos_embed.reshape(1, grid_size[0], grid_size[1], -1).permute(0, 3, 1, 2)
    new_grid_size = (x_scale * grid_size[0], y_scale * grid_size[1])
    new_pos_emb_img = F.interpolate(
        new_pos_emb_img,
        size=new_grid_size,
        mode="bilinear",
        align_corners=True,
    )
    new_pos_emb_img = new_pos_emb_img.permute(0, 2, 3, 1)
    new_pos_emb_img = new_pos_emb_img.view(x_scale, grid_size[0], y_scale, grid_size[1], -1)
    new_pos_emb_img = new_pos_emb_img.permute(0, 2, 1, 3, 4).contiguous()
    new_pos_emb_img = new_pos_emb_img.reshape(x_scale, y_scale, grid_size[0] * grid_size[1], -1)
    cls_embed = cls_embed.expand(x_scale, y_scale, -1, -1)
    pos_and_cls_embed = torch.cat([cls_embed, new_pos_emb_img], dim=2)
    return pos_and_cls_embed


def resize_global_position_embedding(pos_and_cls_embed, grid_size, x_scale, y_scale):
    """
    Takes a global position embedding for vision encoder and resizes it to new size.
    Input: global position embedding of shape [x_old, y_old, old_grid_size[0] * old_grid_size[1] + 1, dim]
    Returns: global position embedding of shape [x_scale, y_scale, grid_size[0] * grid_size[1] + 1, dim]
    Here x_scale and y_scale are the number of tiles along x-axis and y-axis respectively.
    """
    # first remove cls token
    pos_embed = pos_and_cls_embed[:, :, 1:]
    cls_embed = pos_and_cls_embed[:, :, 0].unsqueeze(2)

    xs_old, ys_old, ntok, dim = pos_embed.shape
    old_grid_size = int(math.sqrt(ntok))

    # move to correct form for interpolation
    pos_embed = pos_embed.view(xs_old, ys_old, old_grid_size, old_grid_size, dim)
    pos_embed = pos_embed.permute(0, 2, 1, 3, 4).contiguous()
    pos_embed = pos_embed.view(xs_old * old_grid_size, ys_old * old_grid_size, dim)
    pos_embed = pos_embed.unsqueeze(0)

    # interpolate
    new_size = (grid_size[0] * x_scale, grid_size[1] * y_scale)
    pos_embed = pos_embed.permute(0, 3, 1, 2)
    pos_embed_resized = F.interpolate(
        pos_embed,
        size=new_size,
        mode="bilinear",
        align_corners=True,
    )
    pos_embed = pos_embed_resized.permute(0, 2, 3, 1)[0]

    # move it back in place
    pos_embed = pos_embed.view(x_scale, grid_size[0], y_scale, grid_size[1], dim)
    pos_embed = pos_embed.permute(0, 2, 1, 3, 4).contiguous()
    pos_embed = pos_embed.view(x_scale, y_scale, grid_size[0] * grid_size[1], dim)

    # interpolate cls token
    cls_embed = cls_embed.permute(2, 3, 0, 1)
    cls_embed_resized = F.interpolate(
        cls_embed,
        size=(x_scale, y_scale),
        mode="bilinear",
        align_corners=True,
    )
    cls_embed = cls_embed_resized.permute(2, 3, 0, 1)
    # add cls token back in
    pos_and_cls_embed = torch.cat([cls_embed, pos_embed], dim=2)

    return pos_and_cls_embed


def build_encoder_attention_mask(
    x: torch.Tensor,
    ar: torch.Tensor,
    ntok: int,
    num_chunks: int,
    n_heads: int,
):
    """
    Build vision encoder attention mask that omits padding tokens.
    """
    masks = []
    for arx in ar:
        mask_i = torch.ones((num_chunks, x.shape[2], 1), dtype=x.dtype)
        mask_i[: arx[0] * arx[1], :ntok] = 0
        mask_i = mask_i.view(num_chunks * x.shape[2], -1)
        mask_i = mask_i @ mask_i.T * get_negative_inf_value(x.dtype)
        mask_i = mask_i.unsqueeze(0)
        masks.append(mask_i)
    masks = torch.stack(masks).to(x.device).expand(-1, n_heads, -1, -1)
    return masks


def expand_num_tokens_to_mult8(x):
    num_pad_tokens = 8 - (x.shape[-2] % 8)
    if num_pad_tokens == 0:
        return x, 0
    else:
        return (
            torch.cat(
                [
                    x,
                    torch.zeros(
                        (x.shape[0], x.shape[1], num_pad_tokens, x.shape[-1]),
                        dtype=x.dtype,
                        device=x.device,
                    ),
                ],
                dim=-2,
            ),
            num_pad_tokens,
        )


def contract_num_tokens_from_mult8(x, num_pad_tokens):
    if num_pad_tokens == 0:
        return x
    return x[:, :, :-num_pad_tokens]


================================================
FILE: models/llama3/multimodal/image_transform.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# top-level folder for each specific model found within the models/ directory at
# the top-level of this source tree.

import math

from collections import defaultdict

from logging import getLogger

from typing import Any, Optional, Set, Tuple

import torch
import torchvision.transforms as tv
from PIL import Image
from torchvision.transforms import functional as F

IMAGE_RES = 224

logger = getLogger()


class VariableSizeImageTransform(object):
    """
    This class accepts images of any size and dynamically resize, pads and chunks it
    based on the image aspect ratio and the number of image chunks we allow.

    The algorithm will NOT distort the image fit a certain aspect ratio, because
    that leads to a significant degradation in image quality.

    It can be summarized in 6 steps:
    1. Find all possible canvas combinations of max_num_chunks;
    2. Find the best canvas to fit the image;
    3. Resize without distortion
    4. Pad
    5. Normalize
    6. Chunk

    For example, if an input image is of size 300x800, patch_size of 224,
    and max_num_chunks = 8, it will find the closest aspect ratio that
    is allowed within 8 image chunks, with some restrictions.
    In this case, 2:4 = 2 horizontal patches and 4 vertical patches,
    giving a total of 8 chunks.

    If resize_to_max_canvas, the image will be resized (without distortion),
    to the largest possible resolution. In this case, 388:896, and padded to 448:896,
    where we maintain the original aspect ratio and pad with zeros value for the rest.
    This approach minimizes the amount of padding required for any arbitrary resolution.

    However, if limit_upscaling_to_patch_size is set to True,
    the upscaling will be limited to the patch size. In the example above,
    the image would remain 300x800 (no upscaling), and then padded to 448:896.

    The final output will therefore be of shape (8, 3, 224, 224), where 2x4
    patches are coming from the resizing and chunking.
    """

    def __init__(self, size: int = IMAGE_RES) -> None:
        self.size = size
        logger.info(f"VariableSizeImageTransform size: {self.size}")
        self.to_tensor = tv.ToTensor()
        self._mean = (0.48145466, 0.4578275, 0.40821073)
        self._std = (0.26862954, 0.26130258, 0.27577711)
        self.normalize = tv.Normalize(
            mean=self._mean,
            std=self._std,
            inplace=True,
        )
        self.resample = tv.InterpolationMode.BILINEAR

    @staticmethod
    def get_factors(n: int) -> Set[int]:
        """
        Calculate all factors of a given number, i.e. a dividor that leaves
        no remainder. For example, if n=12, it will return {1, 2, 3, 4, 6, 12}.

        Args:
            n (int): The number to find factors for.

        Returns:
            set: A set containing all factors of the number.
        """
        factors_set = set()

        for i in range(1, int(n**0.5) + 1):
            if n % i == 0:
                factors_set.add(i)
                factors_set.add(n // i)
        return factors_set

    def find_supported_resolutions(self, max_num_chunks: int, patch_size: int) -> torch.Tensor:
        """
        Computes all of the allowed resoltuions for a fixed number of chunks
        and patch_size. Useful for when dividing an image into chunks.

        Args:
            max_num_chunks (int): Maximum number of chunks for processing.
            patch_size (int): Size of the side of the patch.

        Returns:
            torch.Tensor: List of possible resolutions as tuples (height, width).

        Example:
            >>> max_num_chunks = 5
            >>> patch_size = 224
            >>> find_supported_resolutions(max_num_chunks, patch_size)
            tensor([(224, 896), (448, 448), (224, 224), (896, 224), (224, 672),
            (672, 224), (224, 448), (448, 224)])

            Given max_num_chunks=4, patch_size=224, it will create a dictionary:
            {
            0.25: [(1, 4)],
            1.0: [(2, 2), (1, 1)],
            4.0: [(4, 1)],
            0.33: [(1, 3)],
            3.0: [(3, 1)],
            0.5: [(1, 2)],
            2.0: [(2, 1)]
            }

            and return the resolutions multiplied by the patch_size:
            [(1*224, 4*224), (2*224, 2*224), ..., (2*224, 1*224)]
        """
        asp_dict = defaultdict(list)
        for chunk_size in range(max_num_chunks, 0, -1):
            _factors = sorted(self.get_factors(chunk_size))
            _asp_ratios = [(factor, chunk_size // factor) for factor in _factors]
            for height, width in _asp_ratios:
                ratio_float = height / width
                asp_dict[ratio_float].append((height, width))

        # get the resolutions multiplied by the patch_size
        possible_resolutions = []
        for key, value in asp_dict.items():
            for height, depth in value:
                possible_resolutions.append((height * patch_size, depth * patch_size))

        return possible_resolutions

    @staticmethod
    def get_max_res_without_distortion(
        image_size: Tuple[int, int],
        target_size: Tuple[int, int],
    ) -> Tuple[int, int]:
        """
        Determines the maximum resolution to which an image can be resized to without distorting its
        aspect ratio, based on the target resolution.

        Args:
            image_size (Tuple[int, int]): The original resolution of the image (height, width).
            target_resolution (Tuple[int, int]): The desired resolution to fit the image into (height, width).
        Returns:
            Tuple[int, int]: The optimal dimensions (height, width) to which the image should be resized.
        Example:
            >>> _get_max_res_without_distortion([200, 300], target_size = [450, 200])
            (134, 200)
            >>> _get_max_res_without_distortion([800, 600], target_size = [450, 1300])
            (450, 338)
        """

        original_width, original_height = image_size
        target_width, target_height = target_size

        scale_w = target_width / original_width
        scale_h = target_height / original_height

        if scale_w < scale_h:
            new_width = target_width
            new_height = min(math.floor(original_height * scale_w), target_height)
        else:
            new_height = target_height
            new_width = min(math.floor(original_width * scale_h), target_width)

        return new_width, new_height

    def _pad(self, image: Image.Image, target_size) -> Image.Image:
        new_width, new_height = target_size
        new_im = Image.new(mode="RGB", size=(new_width, new_height), color=(0, 0, 0))  # type: ignore
        new_im.paste(image)
        return new_im

    def _split(self, image: torch.Tensor, ncw: int, nch: int) -> torch.Tensor:
        # Split image into number of required tiles (width x height)
        num_channels, height, width = image.size()
        image = image.view(num_channels, nch, height // nch, ncw, width // ncw)
        # Permute dimensions to reorder the axes
        image = image.permute(1, 3, 0, 2, 4).contiguous()
        # Reshape into the desired output shape (batch_size * 4, num_channels, width/2, height/2)
        image = image.view(ncw * nch, num_channels, height // nch, width // ncw)
        return image

    def resize_without_distortion(
        self,
        image: torch.Tensor,
        target_size: Tuple[int, int],
        max_upscaling_size: Optional[int],
    ) -> torch.Tensor:
        """
        Used to resize an image to target_resolution, without distortion.

        If target_size requires upscaling the image, the user can set max_upscaling_size to
        limit the upscaling to a maximum size. In this case, since we rescale without distortion,
        modifying target_size works as a boundary for the image's largest side.

        Args:
            resample (str): Resampling method used when resizing images.
                Supports "nearest", "nearest_exact", "bilinear", "bicubic".
            max_upscaling_size (int): The maximum size to upscale the image to.
                If None, there is no limit.
        Examples:
        >>> target_size = (1000, 1200)
        >>> max_upscaling_size = 600
        >>> image_size = (400, 200)
        >>> resize_without_distortion(image_size, target_size, max_upscaling_size)
        (600, 300)  # new_size_without_distortion

        >>> target_size = (1000, 1200)
        >>> max_upscaling_size = 600
        >>> image_size = (2000, 200)
        >>> resize_without_distortion(image_size, target_size, max_upscaling_size)
        (1000, 100)  # new_size_without_distortion

        >>> target_size = (1000, 1200)
        >>> max_upscaling_size = 2000
        >>> image_size = (400, 200)
        >>> resize_without_distortion(image_size, target_size, max_upscaling_size)
        (1000, 500)  # new_size_without_distortion

        >>> target_size = (1000, 1200)
        >>> max_upscaling_size = None
        >>> image_size = (400, 200)
        >>> resize_without_distortion(image_size, target_size, max_upscaling_size)
        (1000, 500)  # new_size_without_distortion
        """

        image_width, image_height = image.size
        image_size = (image_width, image_height)

        # If target_size requires upscaling, we might want to limit the upscaling to max_upscaling_size
        if max_upscaling_size is not None:
            new_target_width = min(max(image_width, max_upscaling_size), target_size[0])
            new_target_height = min(max(image_height, max_upscaling_size), target_size[1])
            target_size = (new_target_width, new_target_height)

        # resize to target_size while preserving aspect ratio
        new_size_without_distortion = self.get_max_res_without_distortion(image_size, target_size)

        image = F.resize(
            image,
            (new_size_without_distortion[1], new_size_without_distortion[0]),
            interpolation=self.resample,
        )

        return image

    def get_best_fit(
        self,
        image_size: Tuple[int, int],
        possible_resolutions: torch.Tensor,
        resize_to_max_canvas: bool = False,
    ) -> Tuple[int, int]:
        """
        Determines the best canvas possible from a list of possible resolutions to, without distortion,
        resize an image to.

        For each possible resolution, calculates the scaling factors for
        width and height, and selects the smallest one, which is the limiting side.
        E.g. to match the canvas you can upscale height by 2x, and width by 1.5x,
        therefore, the maximum upscaling you can do is min(2, 1.5) = 1.5.

        If upscaling is possible (any of the scaling factors is greater than 1),
        then picks the smallest upscaling factor > 1, unless resize_to_max_canvas is True.

        If upscaling is not possible, then picks the largest scaling factor <= 1, i.e.
        reduce downscaling as much as possible.

        If there are multiple resolutions with the same max scale, we pick the one with the lowest area,
        to minimize padding. E.g., the same image can be upscaled to 224x224 and 224x448, but the latter
        has more padding.

        Args:
            image_size (Tuple[int, int]): A tuple containing the height and width of the image.
            possible_resolutions (torch.Tensor): A tensor of shape (N, 2) where each
                row represents a possible resolution (height, width).
            use_max_upscaling (bool): If True, will return the largest upscaling resolution.

        Returns:
            List[int]: The best resolution [height, width] for the given image.

        Example:
            >>> image_size = (200, 300)
            >>> possible_resolutions = torch.tensor([[224, 672],
            ...                                     [672, 224],
            ...                                     [224, 448],
            ...                                     [448, 224],
            ...                                     [224, 224]])
            >>> _get_smallest_upscaling_possibility(image_size, possible_resolutions)
            [224, 448]

            We have:
                scale_w = tensor([2.2400, 0.7467, 1.4933, 0.7467, 0.7467])
                scale_h = tensor([1.1200, 3.3600, 1.1200, 2.2400, 1.1200])
                scales = tensor([1.1200, 0.7467, 1.1200, 0.7467, 0.7467])
            Only one of the scales > 1:
                upscaling_possible = tensor([1.1200, 1.1200])
                smallest_rescale = tensor(1.1200)
            So we pick the resolution with the smallest smallest area:
                areas = tensor([150528, 100352]) # [672, 224], [224, 448]
                optimal_canvas = tensor([224, 448])
        """

        original_width, original_height = image_size

        # get all possible resolutions heights/widths
        target_widths, target_heights = (
            possible_resolutions[:, 0],
            possible_resolutions[:, 1],
        )

        # get scaling factors to resize the image without distortion
        scale_w = target_widths / original_width
        scale_h = target_heights / original_height

        # get the min scale between width and height (limiting side -> no distortion)
        scales = torch.where(scale_w > scale_h, scale_h, scale_w)

        # filter only scales that allow upscaling
        upscaling_options = scales[scales >= 1]
        if len(upscaling_options) > 0:

Download .txt

gitextract_941i2ph3/

├── .github/
│   ├── CODEOWNERS
│   └── workflows/
│       └── publish-to-test-pypi.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .ruff.toml
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── SECURITY.md
├── docs/
│   └── license_header.txt
├── models/
│   ├── __init__.py
│   ├── checkpoint.py
│   ├── cli/
│   │   ├── __init__.py
│   │   ├── describe.py
│   │   ├── download.py
│   │   ├── list.py
│   │   ├── llama.py
│   │   ├── prompt_format.py
│   │   ├── remove.py
│   │   ├── safety_models.py
│   │   ├── subcommand.py
│   │   ├── table.py
│   │   ├── utils.py
│   │   └── verify_download.py
│   ├── datatypes.py
│   ├── llama2/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   └── USE_POLICY.md
│   ├── llama3/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   ├── USE_POLICY.md
│   │   ├── __init__.py
│   │   ├── args.py
│   │   ├── chat_format.py
│   │   ├── generation.py
│   │   ├── model.py
│   │   ├── multimodal/
│   │   │   ├── __init__.py
│   │   │   ├── encoder_utils.py
│   │   │   ├── image_transform.py
│   │   │   ├── model.py
│   │   │   └── utils.py
│   │   ├── quantization/
│   │   │   └── loader.py
│   │   ├── requirements.txt
│   │   ├── scripts/
│   │   │   ├── __init__.py
│   │   │   ├── chat_completion.py
│   │   │   └── completion.py
│   │   ├── tests/
│   │   │   └── api/
│   │   │       ├── test_generation.py
│   │   │       ├── test_tokenizer.py
│   │   │       └── test_tool_utils.py
│   │   ├── tokenizer.model
│   │   ├── tokenizer.py
│   │   └── tool_utils.py
│   ├── llama3_1/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   ├── USE_POLICY.md
│   │   ├── eval_details.md
│   │   └── prompt_format.md
│   ├── llama3_2/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   ├── MODEL_CARD_VISION.md
│   │   ├── USE_POLICY.md
│   │   ├── eval_details.md
│   │   ├── text_prompt_format.md
│   │   └── vision_prompt_format.md
│   ├── llama3_3/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   ├── USE_POLICY.md
│   │   ├── eval_details.md
│   │   └── prompt_format.md
│   ├── llama4/
│   │   ├── LICENSE
│   │   ├── MODEL_CARD.md
│   │   ├── USE_POLICY.md
│   │   ├── __init__.py
│   │   ├── args.py
│   │   ├── chat_format.py
│   │   ├── datatypes.py
│   │   ├── ffn.py
│   │   ├── generation.py
│   │   ├── model.py
│   │   ├── moe.py
│   │   ├── preprocess.py
│   │   ├── prompt_format.md
│   │   ├── quantization/
│   │   │   ├── __init__.py
│   │   │   └── loader.py
│   │   ├── scripts/
│   │   │   ├── chat_completion.py
│   │   │   ├── completion.py
│   │   │   └── quantize.py
│   │   ├── tests/
│   │   │   ├── __init__.py
│   │   │   └── api/
│   │   │       ├── __init__.py
│   │   │       └── test_chat_format.py
│   │   ├── tokenizer.model
│   │   ├── tokenizer.py
│   │   └── vision/
│   │       ├── embedding.py
│   │       └── encoder.py
│   ├── quantize_impls.py
│   ├── sku_list.py
│   ├── sku_types.py
│   ├── tokenizer_utils.py
│   └── utils/
│       ├── __init__.py
│       ├── config.py
│       └── model_utils.py
├── pyproject.toml
└── requirements.txt

Download .txt

SYMBOL INDEX (503 symbols across 50 files)

FILE: models/checkpoint.py
  function map_mp_rank (line 18) | def map_mp_rank(old_mp_size: int, new_mp_size: int, new_mp_rank: int) ->...
  function maybe_reshard_state_dict (line 34) | def maybe_reshard_state_dict(
  function reshard_mp (line 100) | def reshard_mp(
  function convert_moe_weights (line 158) | def convert_moe_weights(state_dict: Dict[str, Any], num_experts: int) ->...

FILE: models/cli/describe.py
  class Describe (line 16) | class Describe(Subcommand):
    method __init__ (line 19) | def __init__(self, subparsers: argparse._SubParsersAction):
    method _add_arguments (line 30) | def _add_arguments(self):
    method _run_model_describe_cmd (line 39) | def _run_model_describe_cmd(self, args: argparse.Namespace) -> None:

FILE: models/cli/download.py
  class Download (line 37) | class Download(Subcommand):
    method __init__ (line 40) | def __init__(self, subparsers: argparse._SubParsersAction):
  function setup_download_parser (line 51) | def setup_download_parser(parser: argparse.ArgumentParser) -> None:
  class DownloadTask (line 101) | class DownloadTask:
  class DownloadError (line 111) | class DownloadError(Exception):
  class CustomTransferSpeedColumn (line 115) | class CustomTransferSpeedColumn(TransferSpeedColumn):
    method render (line 116) | def render(self, task):
  class ParallelDownloader (line 122) | class ParallelDownloader:
    method __init__ (line 123) | def __init__(
    method retry_with_exponential_backoff (line 148) | async def retry_with_exponential_backoff(self, task: DownloadTask, fun...
    method get_file_info (line 165) | async def get_file_info(self, client: httpx.AsyncClient, task: Downloa...
    method verify_file_integrity (line 195) | def verify_file_integrity(self, task: DownloadTask) -> bool:
    method download_chunk (line 200) | async def download_chunk(self, client: httpx.AsyncClient, task: Downlo...
    method prepare_download (line 223) | async def prepare_download(self, task: DownloadTask) -> None:
    method download_file (line 230) | async def download_file(self, task: DownloadTask) -> None:
    method has_disk_space (line 266) | def has_disk_space(self, tasks: list[DownloadTask]) -> bool:
    method download_all (line 286) | async def download_all(self, tasks: list[DownloadTask]) -> None:
  function _hf_download (line 318) | def _hf_download(
  function _meta_download (line 358) | def _meta_download(
  class ModelEntry (line 394) | class ModelEntry(BaseModel):
  class Manifest (line 401) | class Manifest(BaseModel):
  function _download_from_manifest (line 406) | def _download_from_manifest(manifest_file: str, max_concurrent_downloads...
  function run_download_cmd (line 448) | def run_download_cmd(args: argparse.Namespace, parser: argparse.Argument...

FILE: models/cli/list.py
  function _get_model_size (line 19) | def _get_model_size(model_dir):
  function _convert_to_model_descriptor (line 23) | def _convert_to_model_descriptor(model):
  function _run_model_list_downloaded_cmd (line 30) | def _run_model_list_downloaded_cmd() -> None:
  class List (line 54) | class List(Subcommand):
    method __init__ (line 57) | def __init__(self, subparsers: argparse._SubParsersAction):
    method _add_arguments (line 68) | def _add_arguments(self):
    method _run_model_list_cmd (line 87) | def _run_model_list_cmd(self, args: argparse.Namespace) -> None:

FILE: models/cli/llama.py
  class LlamaModelsCLIParser (line 19) | class LlamaModelsCLIParser:
    method __init__ (line 22) | def __init__(self):
    method parse_args (line 45) | def parse_args(self) -> argparse.Namespace:
    method run (line 51) | def run(self, args: argparse.Namespace) -> None:
  function main (line 55) | def main():

FILE: models/cli/prompt_format.py
  class PromptFormat (line 20) | class PromptFormat(Subcommand):
    method __init__ (line 23) | def __init__(self, subparsers: argparse._SubParsersAction):
    method _add_arguments (line 40) | def _add_arguments(self):
    method _run_model_template_cmd (line 55) | def _run_model_template_cmd(self, args: argparse.Namespace) -> None:
  function render_markdown_to_pager (line 112) | def render_markdown_to_pager(markdown_content: str):

FILE: models/cli/remove.py
  class Remove (line 17) | class Remove(Subcommand):
    method __init__ (line 20) | def __init__(self, subparsers: argparse._SubParsersAction):
    method _add_arguments (line 31) | def _add_arguments(self):
    method _run_model_remove_cmd (line 45) | def _run_model_remove_cmd(self, args: argparse.Namespace) -> None:

FILE: models/cli/safety_models.py
  class PromptGuardModel (line 16) | class PromptGuardModel(BaseModel):
    method descriptor (line 28) | def descriptor(self) -> str:
  function prompt_guard_model_skus (line 34) | def prompt_guard_model_skus():
  function prompt_guard_model_sku_map (line 48) | def prompt_guard_model_sku_map() -> dict[str, Any]:
  function prompt_guard_download_info_map (line 52) | def prompt_guard_download_info_map() -> dict[str, LlamaDownloadInfo]:

FILE: models/cli/subcommand.py
  class Subcommand (line 9) | class Subcommand:
    method __init__ (line 12) | def __init__(self, *args, **kwargs):
    method create (line 16) | def create(cls, *args, **kwargs):
    method _add_arguments (line 19) | def _add_arguments(self):

FILE: models/cli/table.py
  function print_table (line 14) | def print_table(rows, headers=None, separate_rows: bool = False, sort_by...

FILE: models/cli/utils.py
  function print_subcommand_description (line 9) | def print_subcommand_description(parser, subparsers):

FILE: models/cli/verify_download.py
  class VerificationResult (line 21) | class VerificationResult:
  class VerifyDownload (line 29) | class VerifyDownload(Subcommand):
    method __init__ (line 32) | def __init__(self, subparsers: argparse._SubParsersAction):
  function setup_verify_download_parser (line 43) | def setup_verify_download_parser(parser: argparse.ArgumentParser) -> None:
  function calculate_sha256 (line 52) | def calculate_sha256(filepath: Path, chunk_size: int = 8192) -> str:
  function load_checksums (line 60) | def load_checksums(checklist_path: Path) -> dict[str, str]:
  function verify_files (line 72) | def verify_files(model_dir: Path, checksums: dict[str, str], console: Co...
  function run_verify_cmd (line 107) | def run_verify_cmd(args: argparse.Namespace, parser: argparse.ArgumentPa...

FILE: models/datatypes.py
  class Role (line 21) | class Role(Enum):
  class BuiltinTool (line 28) | class BuiltinTool(Enum):
  class ToolCall (line 39) | class ToolCall(BaseModel):
    method validate_field (line 53) | def validate_field(cls, v):
  class ToolPromptFormat (line 62) | class ToolPromptFormat(Enum):
  class StopReason (line 87) | class StopReason(Enum):
  class ToolParamDefinition (line 93) | class ToolParamDefinition(BaseModel):
  class ToolDefinition (line 100) | class ToolDefinition(BaseModel):
    method validate_field (line 107) | def validate_field(cls, v):
  class RawMediaItem (line 116) | class RawMediaItem(BaseModel):
    method serialize_data (line 123) | def serialize_data(self, data: Optional[bytes], _info):
    method validate_data (line 130) | def validate_data(cls, v):
  class RawTextItem (line 136) | class RawTextItem(BaseModel):
  class RawMessage (line 146) | class RawMessage(BaseModel):
  class GenerationResult (line 158) | class GenerationResult(BaseModel):
  class QuantizationMode (line 176) | class QuantizationMode(str, Enum):

FILE: models/llama3/args.py
  class QuantizationScheme (line 13) | class QuantizationScheme(Enum):
  class QuantizationArgs (line 18) | class QuantizationArgs:
    method __init__ (line 23) | def __init__(self, **kwargs):
  class LoRAArgs (line 33) | class LoRAArgs:
  class ModelArgs (line 39) | class ModelArgs:
    method __init__ (line 62) | def __init__(self, **kwargs):

FILE: models/llama3/chat_format.py
  class VisionInput (line 31) | class VisionInput:
  class LLMInput (line 37) | class LLMInput:
  function role_str (line 42) | def role_str(role: Role) -> str:
  class ChatFormat (line 52) | class ChatFormat:
    method __init__ (line 55) | def __init__(self, tokenizer: Tokenizer):
    method _encode_header (line 61) | def _encode_header(self, role: str) -> List[int]:
    method encode_content (line 69) | def encode_content(self, content: RawContent) -> LLMInput:
    method _encode_content (line 73) | def _encode_content(self, content: RawContent, bos: bool = False) -> T...
    method encode_message (line 108) | def encode_message(
    method encode_dialog_prompt (line 146) | def encode_dialog_prompt(
    method decode_assistant_message (line 166) | def decode_assistant_message(self, tokens: List[int], stop_reason: Sto...
    method decode_assistant_message_from_content (line 171) | def decode_assistant_message_from_content(self, content: str, stop_rea...
    method _model_input_from_tokens_images (line 236) | def _model_input_from_tokens_images(self, tokens: List[int], images: L...
  function create_vision_mask (line 250) | def create_vision_mask(

FILE: models/llama3/generation.py
  function is_xccl_available (line 33) | def is_xccl_available():
  class Llama3 (line 39) | class Llama3:
    method build (line 41) | def build(
    method __init__ (line 147) | def __init__(self, model: Transformer | CrossAttentionTransformer, tok...
    method generate (line 154) | def generate(
    method completion (line 302) | def completion(
    method chat_completion (line 324) | def chat_completion(
  function sample_top_p (line 348) | def sample_top_p(probs, p):

FILE: models/llama3/model.py
  class RMSNorm (line 31) | class RMSNorm(torch.nn.Module):
    method __init__ (line 32) | def __init__(self, dim: int, eps: float = 1e-6):
    method _norm (line 37) | def _norm(self, x):
    method forward (line 40) | def forward(self, x):
  function apply_scaling (line 45) | def apply_scaling(freqs: torch.Tensor) -> torch.Tensor:
  function precompute_freqs_cis (line 65) | def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0, use...
  function reshape_for_broadcast (line 75) | def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
  function apply_rotary_emb (line 83) | def apply_rotary_emb(
  function repeat_kv (line 96) | def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
  class Attention (line 108) | class Attention(nn.Module):
    method __init__ (line 109) | def __init__(self, args: ModelArgs):
    method forward (line 164) | def forward(
  class FeedForward (line 205) | class FeedForward(nn.Module):
    method __init__ (line 206) | def __init__(
    method forward (line 224) | def forward(self, x):
  class TransformerBlock (line 228) | class TransformerBlock(nn.Module):
    method __init__ (line 229) | def __init__(self, layer_id: int, args: ModelArgs):
    method forward (line 245) | def forward(
  class Transformer (line 257) | class Transformer(nn.Module):
    method __init__ (line 258) | def __init__(self, params: ModelArgs):
    method forward (line 281) | def forward(self, tokens: torch.Tensor, start_pos: int):

FILE: models/llama3/multimodal/encoder_utils.py
  function resize_local_position_embedding (line 20) | def resize_local_position_embedding(orig_pos_embed, grid_size):
  function initialize_global_position_embedding_from_local (line 49) | def initialize_global_position_embedding_from_local(pos_and_cls_embed, g...
  function resize_global_position_embedding (line 77) | def resize_global_position_embedding(pos_and_cls_embed, grid_size, x_sca...
  function build_encoder_attention_mask (line 128) | def build_encoder_attention_mask(
  function expand_num_tokens_to_mult8 (line 150) | def expand_num_tokens_to_mult8(x):
  function contract_num_tokens_from_mult8 (line 171) | def contract_num_tokens_from_mult8(x, num_pad_tokens):

FILE: models/llama3/multimodal/image_transform.py
  class VariableSizeImageTransform (line 26) | class VariableSizeImageTransform(object):
    method __init__ (line 61) | def __init__(self, size: int = IMAGE_RES) -> None:
    method get_factors (line 75) | def get_factors(n: int) -> Set[int]:
    method find_supported_resolutions (line 94) | def find_supported_resolutions(self, max_num_chunks: int, patch_size: ...
    method get_max_res_without_distortion (line 144) | def get_max_res_without_distortion(
    method _pad (line 179) | def _pad(self, image: Image.Image, target_size) -> Image.Image:
    method _split (line 185) | def _split(self, image: torch.Tensor, ncw: int, nch: int) -> torch.Ten...
    method resize_without_distortion (line 195) | def resize_without_distortion(
    method get_best_fit (line 259) | def get_best_fit(
    method __call__ (line 358) | def __call__(

FILE: models/llama3/multimodal/model.py
  function reduce_from_tensor_model_parallel_region (line 41) | def reduce_from_tensor_model_parallel_region(input_):
  function gather_from_tensor_model_parallel_region (line 48) | def gather_from_tensor_model_parallel_region(input_):
  function _get_full_row_masked_out_mask (line 67) | def _get_full_row_masked_out_mask(
  class LayerNorm (line 83) | class LayerNorm(nn.LayerNorm):
    method forward (line 86) | def forward(self, x: torch.Tensor):
  class ColumnParallelConv2dPatch (line 91) | class ColumnParallelConv2dPatch(torch.nn.Module):
    method __init__ (line 104) | def __init__(
    method forward (line 122) | def forward(self, x: torch.Tensor) -> torch.Tensor:
  class ImageFeedForward (line 130) | class ImageFeedForward(torch.nn.Module):
    method __init__ (line 131) | def __init__(
    method forward (line 157) | def forward(self, x):
  class ImageAttention (line 166) | class ImageAttention(nn.Module):
    method __init__ (line 167) | def __init__(
    method forward (line 215) | def forward(
  class ImageTransformerBlock (line 243) | class ImageTransformerBlock(nn.Module):
    method __init__ (line 244) | def __init__(
    method forward (line 274) | def forward(
  class ImageTransformer (line 286) | class ImageTransformer(nn.Module):
    method __init__ (line 287) | def __init__(
    method forward (line 312) | def forward(self, x: torch.Tensor, return_intermediate=None, mask=None):
  class VisionEncoder (line 323) | class VisionEncoder(nn.Module):
    method __init__ (line 324) | def __init__(
    method load_hook (line 392) | def load_hook(
    method apply_positional_embedding (line 433) | def apply_positional_embedding(self, x, ar):
    method apply_class_embedding (line 445) | def apply_class_embedding(self, x):
    method forward (line 456) | def forward(self, images: torch.Tensor, ar: torch.Tensor) -> torch.Ten...
  class Attention (line 508) | class Attention(nn.Module):
    method __init__ (line 511) | def __init__(self, args: ModelArgs):
    method setup_cache (line 574) | def setup_cache(self, max_batch_size: int, dtype: torch.dtype):
    method forward (line 598) | def forward(
  class FeedForward (line 639) | class FeedForward(nn.Module):
    method __init__ (line 640) | def __init__(
    method forward (line 670) | def forward(self, x):
  class TransformerBlock (line 679) | class TransformerBlock(nn.Module):
    method __init__ (line 680) | def __init__(self, layer_id: int, args: ModelArgs):
    method setup_cache (line 711) | def setup_cache(self, max_batch_size: int, dtype: torch.dtype):
    method forward (line 714) | def forward(
  class TilePositionEmbedding (line 742) | class TilePositionEmbedding(nn.Module):
    method __init__ (line 743) | def __init__(
    method load_hook (line 759) | def load_hook(
    method _dynamic_resize (line 780) | def _dynamic_resize(embed: torch.Tensor, num_tiles: int):
    method forward (line 794) | def forward(self, x: torch.Tensor, ar: torch.Tensor, num_tiles: int = ...
  function _noinit (line 810) | def _noinit(x):
  class CrossAttention (line 814) | class CrossAttention(torch.nn.Module):
    method __init__ (line 817) | def __init__(
    method _compute_xattn_kv_cache (line 889) | def _compute_xattn_kv_cache(self, xattn_tokens: torch.Tensor) -> torch...
    method compute_xattn_kv_cache (line 909) | def compute_xattn_kv_cache(self, xattn_tokens: torch.Tensor) -> torch....
    method forward (line 912) | def forward(
  class CrossAttentionTransformerBlock (line 937) | class CrossAttentionTransformerBlock(torch.nn.Module):
    method __init__ (line 940) | def __init__(
    method compute_xattn_kv_cache (line 980) | def compute_xattn_kv_cache(self, xattn_tokens: torch.Tensor) -> torch....
    method forward (line 983) | def forward(
  class DummyCrossAttentionTransformerBlock (line 1003) | class DummyCrossAttentionTransformerBlock:
    method __call__ (line 1006) | def __call__(
  class DummySelfAttentionTransformerBlock (line 1015) | class DummySelfAttentionTransformerBlock:
    method __call__ (line 1018) | def __call__(
  class CrossAttentionTransformerVision (line 1027) | class CrossAttentionTransformerVision(torch.nn.Module):
    method __init__ (line 1028) | def __init__(self, args: ModelArgs) -> None:
    method forward (line 1054) | def forward(self, images: torch.Tensor, aspect_ratios: torch.Tensor) -...
  class CrossAttentionTransformerText (line 1065) | class CrossAttentionTransformerText(torch.nn.Module):
    method __init__ (line 1068) | def __init__(self, args: ModelArgs) -> None:
    method _init_fusion_schedule (line 1145) | def _init_fusion_schedule(
    method get_partially_trainable_embedding (line 1155) | def get_partially_trainable_embedding(self, x):
    method forward (line 1168) | def forward(
    method setup_cache (line 1208) | def setup_cache(self, max_batch_size: int, device: torch.device, dtype...
    method _get_xattn_mask (line 1228) | def _get_xattn_mask(
  class CrossAttentionTransformer (line 1264) | class CrossAttentionTransformer(torch.nn.Module):
    method __init__ (line 1265) | def __init__(self, args: ModelArgs) -> None:
    method setup_cache (line 1279) | def setup_cache(self, max_batch_size: int, device: torch.device, dtype...
    method compute_vision_tokens_masks (line 1282) | def compute_vision_tokens_masks(
    method forward (line 1353) | def forward(
  function _stack_images (line 1374) | def _stack_images(
  function _pad_masks (line 1403) | def _pad_masks(

FILE: models/llama3/multimodal/utils.py
  function get_negative_inf_value (line 13) | def get_negative_inf_value(dtype):
  function to_2tuple (line 17) | def to_2tuple(x):

FILE: models/llama3/quantization/loader.py
  function swiglu_wrapper (line 30) | def swiglu_wrapper(
  function convert_to_quantized_model (line 38) | def convert_to_quantized_model(
  function convert_to_fp8_quantized_model (line 53) | def convert_to_fp8_quantized_model(
  class Int8DynActInt4WeightLinearLoRA (line 99) | class Int8DynActInt4WeightLinearLoRA(Int8DynActInt4WeightLinear):
    method __init__ (line 115) | def __init__(
    method load_hook (line 149) | def load_hook(
    method forward (line 165) | def forward(self, input_: torch.Tensor) -> torch.Tensor:
  class Int8WeightEmbedding (line 173) | class Int8WeightEmbedding(torch.nn.Embedding):
    method __init__ (line 182) | def __init__(
    method load_hook (line 193) | def load_hook(
  class Int8WeightLinear (line 209) | class Int8WeightLinear(torch.nn.Linear):
    method __init__ (line 218) | def __init__(self, in_features: int, out_features: int, bias: bool = T...
    method load_hook (line 223) | def load_hook(
  function _prepare_model_int4_weight_int8_dynamic_activation (line 239) | def _prepare_model_int4_weight_int8_dynamic_activation(
  function convert_to_int4_quantized_model (line 287) | def convert_to_int4_quantized_model(

FILE: models/llama3/scripts/chat_completion.py
  function get_device (line 27) | def get_device():
  function run_main (line 37) | def run_main(
  function main (line 120) | def main():

FILE: models/llama3/scripts/completion.py
  function get_device (line 28) | def get_device():
  function run_main (line 38) | def run_main(
  function main (line 92) | def main():

FILE: models/llama3/tests/api/test_generation.py
  function get_device (line 21) | def get_device():
  function build_generator (line 32) | def build_generator(env_var: str, device: str):
  class TestTextModelInference (line 43) | class TestTextModelInference(unittest.TestCase):
    method setUpClass (line 47) | def setUpClass(cls):
    method test_run_generation (line 50) | def test_run_generation(self):
  class TestTextModelInferenceOnDevice (line 79) | class TestTextModelInferenceOnDevice(TestTextModelInference):
  class TestVisionModelInference (line 83) | class TestVisionModelInference(unittest.TestCase):
    method setUpClass (line 87) | def setUpClass(cls):
    method test_run_generation (line 92) | def test_run_generation(self):
  class TestVisionModelInferenceOnDevice (line 132) | class TestVisionModelInferenceOnDevice(TestVisionModelInference):

FILE: models/llama3/tests/api/test_tokenizer.py
  class TokenizerTests (line 18) | class TokenizerTests(TestCase):
    method setUp (line 19) | def setUp(self):
    method test_special_tokens (line 23) | def test_special_tokens(self):
    method test_encode (line 29) | def test_encode(self):
    method test_decode (line 35) | def test_decode(self):
    method test_encode_message (line 43) | def test_encode_message(self):
    method test_encode_dialog (line 65) | def test_encode_dialog(self):

FILE: models/llama3/tests/api/test_tool_utils.py
  class TestToolUtils (line 16) | class TestToolUtils(unittest.TestCase):
    method test_maybe_extract_custom_tool_call (line 17) | def test_maybe_extract_custom_tool_call(self):
  class TestPythonListCheck (line 25) | class TestPythonListCheck(unittest.TestCase):
    method test_valid_list_with_single_function_call (line 26) | def test_valid_list_with_single_function_call(self):
    method test_valid_list_with_multiple_function_calls (line 30) | def test_valid_list_with_multiple_function_calls(self):
    method test_invalid_empty_list (line 36) | def test_invalid_empty_list(self):
    method test_invalid_list_with_non_function_call (line 40) | def test_invalid_list_with_non_function_call(self):
    method test_invalid_list_with_positional_args (line 44) | def test_invalid_list_with_positional_args(self):
    method test_invalid_nested_list (line 48) | def test_invalid_nested_list(self):
    method test_invalid_dict (line 52) | def test_invalid_dict(self):
    method test_invalid_syntax (line 56) | def test_invalid_syntax(self):
    method test_valid_list_with_boolean_args (line 60) | def test_valid_list_with_boolean_args(self):
    method test_valid_list_with_numeric_args (line 64) | def test_valid_list_with_numeric_args(self):
    method test_invalid_bare_function_call (line 68) | def test_invalid_bare_function_call(self):
    method test_invalid_extra_char_function_call (line 72) | def test_invalid_extra_char_function_call(self):
  class TestParsePythonList (line 77) | class TestParsePythonList(unittest.TestCase):
    method test_single_function_call (line 78) | def test_single_function_call(self):
    method test_multiple_function_calls (line 83) | def test_multiple_function_calls(self):
    method test_function_call_with_numeric_args (line 93) | def test_function_call_with_numeric_args(self):
    method test_function_call_with_mixed_type_args (line 98) | def test_function_call_with_mixed_type_args(self):
    method test_function_call_with_empty_args (line 108) | def test_function_call_with_empty_args(self):
    method test_function_call_with_string_containing_spaces (line 113) | def test_function_call_with_string_containing_spaces(self):
    method test_function_names_with_underscores_lists_and_dicts (line 118) | def test_function_names_with_underscores_lists_and_dicts(self):

FILE: models/llama3/tokenizer.py
  class Tokenizer (line 46) | class Tokenizer:
    method get_instance (line 58) | def get_instance(cls):
    method __init__ (line 65) | def __init__(self, model_path: Path):
    method encode (line 118) | def encode(
    method decode (line 174) | def decode(self, t: Sequence[int]) -> str:
    method _split_whitespaces_or_nonwhitespaces (line 188) | def _split_whitespaces_or_nonwhitespaces(s: str, max_consecutive_slice...

FILE: models/llama3/tool_utils.py
  function is_json (line 18) | def is_json(s):
  function is_valid_python_list (line 28) | def is_valid_python_list(input_string):
  function parse_python_list_for_function_calls (line 67) | def parse_python_list_for_function_calls(input_string):
  class ToolUtils (line 96) | class ToolUtils:
    method is_builtin_tool_call (line 98) | def is_builtin_tool_call(message_body: str) -> bool:
    method maybe_extract_builtin_tool_call (line 103) | def maybe_extract_builtin_tool_call(message_body: str) -> Optional[Tup...
    method maybe_extract_custom_tool_call (line 116) | def maybe_extract_custom_tool_call(message_body: str) -> Optional[Tupl...
    method encode_tool_call (line 149) | def encode_tool_call(t: ToolCall, tool_prompt_format: ToolPromptFormat...

FILE: models/llama4/args.py
  class QuantizationScheme (line 14) | class QuantizationScheme(Enum):
  class QuantizationArgs (line 18) | class QuantizationArgs(BaseModel):
  class LoRAArgs (line 24) | class LoRAArgs(BaseModel):
  class MoEArgs (line 29) | class MoEArgs(BaseModel):
  class Size (line 39) | class Size(BaseModel):
  class VisionArgs (line 44) | class VisionArgs(BaseModel):
  class ModelArgs (line 58) | class ModelArgs(BaseModel):
    method validate (line 93) | def validate(self) -> "ModelArgs":

FILE: models/llama4/chat_format.py
  function role_str (line 36) | def role_str(role: Role) -> str:
  class TransformedImage (line 47) | class TransformedImage:
  function convert_image_to_rgb (line 53) | def convert_image_to_rgb(image: PIL_Image.Image, bg: Tuple[int, int, int...
  class ChatFormat (line 62) | class ChatFormat:
    method __init__ (line 65) | def __init__(
    method _encode_header (line 85) | def _encode_header(self, role: str) -> List[int]:
    method encode_content (line 95) | def encode_content(self, content: RawContent) -> LLMInput:
    method _encode_image (line 99) | def _encode_image(
    method _encode_content (line 144) | def _encode_content(self, content: RawContent, bos: bool = False) -> T...
    method encode_message (line 191) | def encode_message(
    method encode_dialog_prompt (line 224) | def encode_dialog_prompt(
    method decode_assistant_message (line 243) | def decode_assistant_message(self, tokens: List[int], stop_reason: Sto...
    method decode_assistant_message_from_content (line 248) | def decode_assistant_message_from_content(self, content: str, stop_rea...
    method _model_input_from_tokens_images (line 314) | def _model_input_from_tokens_images(self, tokens: List[int], images: L...

FILE: models/llama4/datatypes.py
  class MaskedEmbedding (line 15) | class MaskedEmbedding:
  class LLMInput (line 21) | class LLMInput:
  class TransformerInput (line 38) | class TransformerInput:
  class LLMOutput (line 54) | class LLMOutput:

FILE: models/llama4/ffn.py
  class FeedForward (line 16) | class FeedForward(nn.Module):
    method __init__ (line 17) | def __init__(
    method load_hook (line 31) | def load_hook(
    method forward (line 47) | def forward(self, x):

FILE: models/llama4/generation.py
  class Llama4 (line 36) | class Llama4:
    method build (line 38) | def build(
    method __init__ (line 112) | def __init__(self, model: Transformer, tokenizer: Tokenizer, args: Mod...
    method generate (line 119) | def generate(
    method completion (line 247) | def completion(
    method chat_completion (line 269) | def chat_completion(
  function sample_top_p (line 292) | def sample_top_p(probs, p):

FILE: models/llama4/model.py
  function rmsnorm (line 27) | def rmsnorm(x, eps):
  class RMSNorm (line 34) | class RMSNorm(torch.nn.Module):
    method __init__ (line 35) | def __init__(self, dim: int, eps: float = 1e-6):
    method forward (line 40) | def forward(self, x):
  function apply_scaling (line 44) | def apply_scaling(freqs: torch.Tensor, scale_factor: float, high_freq_fa...
  function precompute_freqs_cis (line 64) | def precompute_freqs_cis(
  function reshape_for_broadcast (line 81) | def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
  function apply_rotary_emb (line 89) | def apply_rotary_emb(
  class Attention (line 102) | class Attention(nn.Module):
    method __init__ (line 105) | def __init__(
    method load_hook (line 176) | def load_hook(
    method forward (line 199) | def forward(
  class TransformerBlock (line 251) | class TransformerBlock(nn.Module):
    method __init__ (line 252) | def __init__(self, layer_id: int, args: ModelArgs):
    method load_hook (line 290) | def load_hook(
    method forward (line 317) | def forward(
  class Transformer (line 337) | class Transformer(nn.Module):
    method __init__ (line 338) | def __init__(self, args: ModelArgs, **kwargs) -> None:
    method load_hook (line 376) | def load_hook(
    method forward (line 390) | def forward(self, model_input: TransformerInput) -> TransformerOutput:
  function create_chunked_attention_mask (line 431) | def create_chunked_attention_mask(seq_len: int, attention_chunk_size: in...

FILE: models/llama4/moe.py
  class Experts (line 22) | class Experts(nn.Module):
    method __init__ (line 23) | def __init__(
    method load_hook (line 65) | def load_hook(
    method forward (line 83) | def forward(
    method batched_swiglu (line 97) | def batched_swiglu(self, x: Tensor, w1: Tensor, w3: Tensor, w2: Tensor...
  class MoE (line 102) | class MoE(torch.nn.Module):
    method __init__ (line 121) | def __init__(
    method load_hook (line 160) | def load_hook(
    method forward (line 175) | def forward(self, x_bsD: Tensor) -> Tensor:  # noqa: N803
  function divide_exact (line 213) | def divide_exact(numerator: int, denominator: int) -> int:

FILE: models/llama4/preprocess.py
  class ResizeNormalizeImageTransform (line 22) | class ResizeNormalizeImageTransform:
    method __init__ (line 23) | def __init__(
    method __call__ (line 45) | def __call__(self, image: Image.Image) -> torch.Tensor:
  class VariableSizeImageTransform (line 49) | class VariableSizeImageTransform(object):
    method __init__ (line 84) | def __init__(self, size: int = IMAGE_RES) -> None:
    method get_factors (line 97) | def get_factors(n: int) -> Set[int]:
    method find_supported_resolutions (line 116) | def find_supported_resolutions(self, max_num_chunks: int, patch_size: ...
    method get_max_res_without_distortion (line 166) | def get_max_res_without_distortion(
    method _pad (line 201) | def _pad(self, image: Image.Image, target_size) -> Image.Image:
    method _split (line 207) | def _split(self, image: torch.Tensor, ncw: int, nch: int) -> torch.Ten...
    method resize_without_distortion (line 217) | def resize_without_distortion(
    method get_best_fit (line 284) | def get_best_fit(
    method __call__ (line 383) | def __call__(

FILE: models/llama4/quantization/loader.py
  function swiglu_wrapper_no_reduce (line 24) | def swiglu_wrapper_no_reduce(
  function experts_batched_swiglu_wrapper (line 33) | def experts_batched_swiglu_wrapper(
  function convert_to_quantized_model (line 46) | def convert_to_quantized_model(
  function logging_callbacks (line 174) | def logging_callbacks(

FILE: models/llama4/scripts/chat_completion.py
  function run_main (line 24) | def run_main(
  function main (line 109) | def main():

FILE: models/llama4/scripts/completion.py
  function run_main (line 24) | def run_main(
  function main (line 71) | def main():

FILE: models/llama4/scripts/quantize.py
  function ffn_quantize (line 45) | def ffn_quantize(
  function main (line 214) | def main():

FILE: models/llama4/tests/api/test_chat_format.py
  class TestChatFormatArgumentsJson (line 17) | class TestChatFormatArgumentsJson(unittest.TestCase):
    method setUp (line 20) | def setUp(self):
    method test_arguments_json_included_in_custom_tool_call (line 26) | def test_arguments_json_included_in_custom_tool_call(self):
    method test_arguments_json_included_in_builtin_tool_call (line 57) | def test_arguments_json_included_in_builtin_tool_call(self):
    method test_arguments_json_included_in_code_interpreter_call (line 94) | def test_arguments_json_included_in_code_interpreter_call(self):
    method test_arguments_json_with_complex_arguments (line 131) | def test_arguments_json_with_complex_arguments(self):
    method test_no_tool_calls_when_no_tools_detected (line 166) | def test_no_tool_calls_when_no_tools_detected(self):

FILE: models/llama4/tokenizer.py
  function get_reserved_special_tokens (line 43) | def get_reserved_special_tokens(name, count, start_index=0):
  class Tokenizer (line 113) | class Tokenizer:
    method get_instance (line 125) | def get_instance(cls):
    method __init__ (line 132) | def __init__(self, model_path: Path):
    method encode (line 181) | def encode(
    method decode (line 237) | def decode(self, t: Sequence[int]) -> str:
    method _split_whitespaces_or_nonwhitespaces (line 251) | def _split_whitespaces_or_nonwhitespaces(s: str, max_consecutive_slice...

FILE: models/llama4/vision/embedding.py
  class PixelShuffle (line 20) | class PixelShuffle(nn.Module):
    method __init__ (line 21) | def __init__(self, ps_ratio):
    method forward (line 25) | def forward(self, x):
  function pixel_shuffle_op (line 36) | def pixel_shuffle_op(input_x, ps_ratio):
  class SimpleMLP (line 50) | class SimpleMLP(torch.nn.Module):
    method __init__ (line 51) | def __init__(
    method forward (line 76) | def forward(self, x):
  class PixelShuffleMLP (line 83) | class PixelShuffleMLP(torch.nn.Module):
    method __init__ (line 84) | def __init__(
    method forward (line 108) | def forward(self, encoded_patches: torch.Tensor) -> torch.Tensor:
  class VisionEmbeddings (line 113) | class VisionEmbeddings(torch.nn.Module):
    method __init__ (line 114) | def __init__(self, args: VisionArgs):
    method load_hook (line 138) | def load_hook(
    method _get_empty_sequence (line 154) | def _get_empty_sequence(self, h):
    method forward (line 165) | def forward(
  function scatter_embeddings (line 180) | def scatter_embeddings(image_batch, image_mask, h_image, encoded_patches...

FILE: models/llama4/vision/encoder.py
  class LayerNorm (line 23) | class LayerNorm(nn.LayerNorm):
    method forward (line 26) | def forward(self, x: torch.Tensor):
  class ColumnParallelConv2dPatch (line 31) | class ColumnParallelConv2dPatch(torch.nn.Module):
    method __init__ (line 44) | def __init__(
    method forward (line 62) | def forward(self, x: torch.Tensor) -> torch.Tensor:
  class _FeedForward (line 69) | class _FeedForward(torch.nn.Module):
    method __init__ (line 70) | def __init__(
    method forward (line 96) | def forward(self, x):
  class _TransformerBlock (line 103) | class _TransformerBlock(nn.Module):
    method __init__ (line 104) | def __init__(
    method attention (line 141) | def attention(
    method forward (line 148) | def forward(
  class _Transformer (line 162) | class _Transformer(nn.Module):
    method __init__ (line 163) | def __init__(
    method forward (line 190) | def forward(self, x: torch.Tensor, return_intermediate=None, mask=None...
  class PackingIndex (line 201) | class PackingIndex:
  class VisionEncoder (line 225) | class VisionEncoder(nn.Module):
    method __init__ (line 226) | def __init__(
    method get_rope_freqs (line 307) | def get_rope_freqs(self, dim, theta=10000):
    method compute_rope_freqs (line 312) | def compute_rope_freqs(self, freqs, t):
    method load_hook (line 317) | def load_hook(
    method apply_class_embedding (line 367) | def apply_class_embedding(self, x):
    method forward (line 378) | def forward(self, images: torch.Tensor) -> torch.Tensor:

FILE: models/quantize_impls.py
  class Fp8ScaledWeights (line 26) | class Fp8ScaledWeights:
    method __class__ (line 30) | def __class__(self) -> Type[nn.parameter.Parameter]:
    method grad_fn (line 34) | def grad_fn(self) -> None:
  class Fp8RowwiseWeights (line 40) | class Fp8RowwiseWeights(
  class Int4ScaledWeights (line 50) | class Int4ScaledWeights:
    method __class__ (line 54) | def __class__(self) -> Type[nn.parameter.Parameter]:
    method grad_fn (line 58) | def grad_fn(self) -> None:
  class Int4Weights (line 64) | class Int4Weights(
  function int4_row_quantize (line 74) | def int4_row_quantize(
  function pack_int4 (line 107) | def pack_int4(x: torch.Tensor) -> torch.Tensor:
  function bmm_nt (line 121) | def bmm_nt(
  function ffn_swiglu (line 135) | def ffn_swiglu(
  function quantize_fp8 (line 163) | def quantize_fp8(
  function quantize_int4 (line 190) | def quantize_int4(
  function load_fp8 (line 215) | def load_fp8(
  function load_int4 (line 241) | def load_int4(
  function fc_dynamic (line 259) | def fc_dynamic(
  function ffn_swiglu_dynamic (line 278) | def ffn_swiglu_dynamic(

FILE: models/sku_list.py
  function resolve_model (line 22) | def resolve_model(descriptor: str) -> Model | None:
  function all_registered_models (line 29) | def all_registered_models() -> list[Model]:
  function llama2_family (line 41) | def llama2_family() -> list[Model]:
  function llama3_family (line 48) | def llama3_family() -> list[Model]:
  function llama3_1_family (line 55) | def llama3_1_family() -> list[Model]:
  function llama3_2_family (line 62) | def llama3_2_family() -> list[Model]:
  function llama3_3_family (line 69) | def llama3_3_family() -> list[Model]:
  function llama4_family (line 75) | def llama4_family() -> list[Model]:
  function llama4_base_models (line 82) | def llama4_base_models() -> list[Model]:
  function llama4_instruct_models (line 101) | def llama4_instruct_models() -> list[Model]:
  function llama2_base_models (line 129) | def llama2_base_models() -> list[Model]:
  function llama3_base_models (line 188) | def llama3_base_models() -> list[Model]:
  function llama3_1_base_models (line 229) | def llama3_1_base_models() -> list[Model]:
  function llama3_2_base_models (line 327) | def llama3_2_base_models() -> list[Model]:
  function llama2_instruct_models (line 410) | def llama2_instruct_models() -> list[Model]:
  function llama3_instruct_models (line 469) | def llama3_instruct_models() -> list[Model]:
  function llama3_1_instruct_models (line 510) | def llama3_1_instruct_models() -> list[Model]:
  function arch_args_1b (line 608) | def arch_args_1b() -> dict:
  function arch_args_3b (line 623) | def arch_args_3b() -> dict:
  function llama3_2_quantized_models (line 638) | def llama3_2_quantized_models() -> list[Model]:
  function llama3_2_instruct_models (line 707) | def llama3_2_instruct_models() -> list[Model]:
  function llama3_3_instruct_models (line 769) | def llama3_3_instruct_models() -> list[Model]:
  function safety_models (line 793) | def safety_models() -> list[Model]:
  class LlamaDownloadInfo (line 920) | class LlamaDownloadInfo:
  function llama_meta_net_info (line 926) | def llama_meta_net_info(model: Model) -> LlamaDownloadInfo:
  function llama_meta_pth_size (line 1005) | def llama_meta_pth_size(model: Model) -> int:

FILE: models/sku_types.py
  class CheckpointQuantizationFormat (line 14) | class CheckpointQuantizationFormat(Enum):
  class ModelFamily (line 26) | class ModelFamily(Enum):
  class CoreModelId (line 36) | class CoreModelId(Enum):
  function is_multimodal (line 88) | def is_multimodal(model_id) -> bool:
  function model_family (line 100) | def model_family(model_id) -> ModelFamily:
  class Model (line 160) | class Model(BaseModel):
    method model_family (line 175) | def model_family(self) -> ModelFamily:
    method descriptor (line 179) | def descriptor(self, shorten_default_variant: bool = True) -> str:
    method is_instruct_model (line 185) | def is_instruct_model(self) -> bool:
    method is_featured (line 190) | def is_featured(self) -> bool:
    method max_seq_length (line 200) | def max_seq_length(self) -> int:

FILE: models/tokenizer_utils.py
  function load_bpe_file (line 15) | def load_bpe_file(model_path: Path) -> dict[bytes, int]:

FILE: models/utils/model_utils.py
  function model_local_dir (line 13) | def model_local_dir(descriptor: str) -> str:

Download .json

Condensed preview — 105 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (6,947K chars).

[
  {
    "path": ".github/CODEOWNERS",
    "chars": 235,
    "preview": "# Each line is a file pattern followed by one or more owners.\n\n# These owners will be the default owners for everything "
  },
  {
    "path": ".github/workflows/publish-to-test-pypi.yml",
    "chars": 2143,
    "preview": "name: Publish Python 🐍 distribution 📦 to TestPyPI\n\non:\n  repository_dispatch:  # on trigger from llama-stack\n    types: "
  },
  {
    "path": ".gitignore",
    "chars": 53,
    "preview": "__pycache__\ndist\n*.egg-info\nbuild\n.DS_Store\n.vscode/\n"
  },
  {
    "path": ".pre-commit-config.yaml",
    "chars": 2326,
    "preview": "exclude: 'build/'\n\ndefault_language_version:\n    python: python3\n\nrepos:\n-   repo: https://github.com/pre-commit/pre-com"
  },
  {
    "path": ".ruff.toml",
    "chars": 1370,
    "preview": "# Suggested config from pytorch that we can adapt\nlint.select = [\"B\", \"C\", \"E\" , \"F\" , \"N\", \"W\", \"B9\"]\n\nline-length = 12"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 3537,
    "preview": "# Code of Conduct\n\n## Our Pledge\n\nIn the interest of fostering an open and welcoming environment, we as\ncontributors and"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 1339,
    "preview": "# Contributing to Llama-Models\nWe want to make contributing to this project as easy and transparent as\npossible.\n\n## Pul"
  },
  {
    "path": "LICENSE",
    "chars": 78,
    "preview": "https://github.com/meta-llama/llama-models/blob/main/README.md#llama-models-1\n"
  },
  {
    "path": "MANIFEST.in",
    "chars": 643,
    "preview": "include pyproject.toml\ninclude README.md\ninclude models/llama3/tokenizer.model\ninclude models/llama4/tokenizer.model\ninc"
  },
  {
    "path": "README.md",
    "chars": 10054,
    "preview": "<p align=\"center\">\n  <img src=\"/Llama_Repo.jpeg\" width=\"400\"/>\n</p>\n\n<p align=\"center\">\n        🤗 <a href=\"https://huggi"
  },
  {
    "path": "SECURITY.md",
    "chars": 136,
    "preview": "# Security Policy\n\n## Reporting a Vulnerability\n\nPlease report vulnerabilities to our bug bounty program at https://bugb"
  },
  {
    "path": "docs/license_header.txt",
    "chars": 265,
    "preview": "Copyright (c) Meta Platforms, Inc. and affiliates.\nAll rights reserved.\n\nThis source code is licensed under the terms de"
  },
  {
    "path": "models/__init__.py",
    "chars": 276,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/checkpoint.py",
    "chars": 6621,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/__init__.py",
    "chars": 276,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/describe.py",
    "chars": 2373,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/download.py",
    "chars": 18484,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/list.py",
    "chars": 3730,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/llama.py",
    "chars": 1834,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/prompt_format.py",
    "chars": 4748,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/remove.py",
    "chars": 2449,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/safety_models.py",
    "chars": 2198,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/subcommand.py",
    "chars": 552,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/table.py",
    "chars": 1172,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/utils.py",
    "chars": 601,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/cli/verify_download.py",
    "chars": 4508,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/datatypes.py",
    "chars": 5513,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama2/LICENSE",
    "chars": 6917,
    "preview": "LLAMA 2 COMMUNITY LICENSE AGREEMENT\nLlama 2 Version Release Date: July 18, 2023\n\n\"Agreement\" means the terms and conditi"
  },
  {
    "path": "models/llama2/MODEL_CARD.md",
    "chars": 7622,
    "preview": "# **Model Details**\n\nMeta developed and released the Llama 2 family of large language models (LLMs), a collection of pre"
  },
  {
    "path": "models/llama2/USE_POLICY.md",
    "chars": 4748,
    "preview": "# Llama 2 Acceptable Use Policy\n\nMeta is committed to promoting safe and fair use of its tools and features, including L"
  },
  {
    "path": "models/llama3/LICENSE",
    "chars": 12299,
    "preview": "META LLAMA 3 COMMUNITY LICENSE AGREEMENT\n\nMeta Llama 3 Version Release Date: April 18, 2024\n“Agreement” means the terms "
  },
  {
    "path": "models/llama3/MODEL_CARD.md",
    "chars": 22590,
    "preview": "## Model Details\n\nMeta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of p"
  },
  {
    "path": "models/llama3/USE_POLICY.md",
    "chars": 4753,
    "preview": "# Meta Llama 3 Acceptable Use Policy\n\nMeta is committed to promoting safe and fair use of its tools and features, includ"
  },
  {
    "path": "models/llama3/__init__.py",
    "chars": 276,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/args.py",
    "chars": 2399,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/chat_format.py",
    "chars": 9613,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/generation.py",
    "chars": 14315,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/model.py",
    "chars": 11383,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/multimodal/__init__.py",
    "chars": 276,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/multimodal/encoder_utils.py",
    "chars": 6384,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/multimodal/image_transform.py",
    "chars": 16512,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/multimodal/model.py",
    "chars": 51036,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/multimodal/utils.py",
    "chars": 484,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/quantization/loader.py",
    "chars": 11861,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/requirements.txt",
    "chars": 83,
    "preview": "blobfile\nfairscale\njinja2\njson-strong-typing\ntiktoken\ntorch\npydantic\npydantic_core\n"
  },
  {
    "path": "models/llama3/scripts/__init__.py",
    "chars": 276,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/scripts/chat_completion.py",
    "chars": 4193,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/scripts/completion.py",
    "chars": 2513,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/tests/api/test_generation.py",
    "chars": 4236,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/tests/api/test_tokenizer.py",
    "chars": 3429,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/tests/api/test_tool_utils.py",
    "chars": 5690,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/tokenizer.model",
    "chars": 2183982,
    "preview": "IQ== 0\nIg== 1\nIw== 2\nJA== 3\nJQ== 4\nJg== 5\nJw== 6\nKA== 7\nKQ== 8\nKg== 9\nKw== 10\nLA== 11\nLQ== 12\nLg== 13\nLw== 14\nMA== 15\nMQ"
  },
  {
    "path": "models/llama3/tokenizer.py",
    "chars": 7079,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3/tool_utils.py",
    "chars": 7183,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama3_1/LICENSE",
    "chars": 7589,
    "preview": " LLAMA 3.1 COMMUNITY LICENSE AGREEMENT\n\n Llama 3.1 Version Release Date: July 23, 2024\n\n“Agreement” means the terms and "
  },
  {
    "path": "models/llama3_1/MODEL_CARD.md",
    "chars": 26727,
    "preview": "## Model Information\n\nThe Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pret"
  },
  {
    "path": "models/llama3_1/USE_POLICY.md",
    "chars": 4663,
    "preview": "**Llama 3.1** **Acceptable Use Policy**\n\nMeta is committed to promoting safe and fair use of its tools and features, inc"
  },
  {
    "path": "models/llama3_1/eval_details.md",
    "chars": 12713,
    "preview": "\n# Llama 3 Evaluation Details\n\nThis document contains some additional context on the settings and methodology for how we"
  },
  {
    "path": "models/llama3_1/prompt_format.md",
    "chars": 11773,
    "preview": "\n\n# Llama 3.1 - Prompt Formats\n## Tokens\nHere is a list of special tokens that are supported by Llama 3.1:\n- `<|begin_of"
  },
  {
    "path": "models/llama3_2/LICENSE",
    "chars": 7592,
    "preview": "LLAMA 3.2 COMMUNITY LICENSE AGREEMENT\n\nLlama 3.2 Version Release Date: September 25, 2024\n\n“Agreement” means the terms a"
  },
  {
    "path": "models/llama3_2/MODEL_CARD.md",
    "chars": 25400,
    "preview": "## Model Information\n\nThe Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretraine"
  },
  {
    "path": "models/llama3_2/MODEL_CARD_VISION.md",
    "chars": 20377,
    "preview": "## Model Information\n\nThe Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pret"
  },
  {
    "path": "models/llama3_2/USE_POLICY.md",
    "chars": 6007,
    "preview": "**Llama 3.2** **Acceptable Use Policy**\n\nMeta is committed to promoting safe and fair use of its tools and features, inc"
  },
  {
    "path": "models/llama3_2/eval_details.md",
    "chars": 12874,
    "preview": "\n\n# Llama 3 Evaluation Details\n\nThis document contains some additional context on the settings and methodology for how w"
  },
  {
    "path": "models/llama3_2/text_prompt_format.md",
    "chars": 9613,
    "preview": "## User and assistant conversation\n\nHere is a regular multi-turn user assistant conversation and how its formatted.\n\n###"
  },
  {
    "path": "models/llama3_2/vision_prompt_format.md",
    "chars": 5047,
    "preview": "## User and assistant conversation\n\nHere is a regular multi-turn user assistant conversation and how its formatted.\n\n###"
  },
  {
    "path": "models/llama3_3/LICENSE",
    "chars": 7821,
    "preview": "**LLAMA 3.3 COMMUNITY LICENSE AGREEMENT**\n\n Llama 3.3 Version Release Date: December 6, 2024\n\n“**Agreement**” means the "
  },
  {
    "path": "models/llama3_3/MODEL_CARD.md",
    "chars": 17352,
    "preview": "## Model Information\n\nThe Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned g"
  },
  {
    "path": "models/llama3_3/USE_POLICY.md",
    "chars": 6006,
    "preview": "**Llama 3.3** **Acceptable Use Policy**\n\nMeta is committed to promoting safe and fair use of its tools and features, inc"
  },
  {
    "path": "models/llama3_3/eval_details.md",
    "chars": 13154,
    "preview": "\n\n# Llama 3 Evaluation Details\n\nThis document contains some additional context on the settings and methodology for how w"
  },
  {
    "path": "models/llama3_3/prompt_format.md",
    "chars": 16057,
    "preview": "\n\n# Llama 3.3 - Prompt Formats\n## Tokens\nHere is a list of special tokens that are supported by Llama 3.3:\n- `<|begin_of"
  },
  {
    "path": "models/llama4/LICENSE",
    "chars": 7526,
    "preview": "LLAMA 4 COMMUNITY LICENSE AGREEMENT\n\nLlama 4 Version Effective Date: April 5, 2025\n\n“Agreement” means the terms and cond"
  },
  {
    "path": "models/llama4/MODEL_CARD.md",
    "chars": 23708,
    "preview": "## Model Information\n\nThe Llama 4 collection of models are natively multimodal AI models that enable text and multimodal"
  },
  {
    "path": "models/llama4/USE_POLICY.md",
    "chars": 5812,
    "preview": "**Llama 4** **Acceptable Use Policy**\n\nMeta is committed to promoting safe and fair use of its tools and features, inclu"
  },
  {
    "path": "models/llama4/__init__.py",
    "chars": 276,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/args.py",
    "chars": 3458,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/chat_format.py",
    "chars": 12165,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/datatypes.py",
    "chars": 1721,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/ffn.py",
    "chars": 1995,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/generation.py",
    "chars": 12124,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/model.py",
    "chars": 16309,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/moe.py",
    "chars": 6928,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/preprocess.py",
    "chars": 17223,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/prompt_format.md",
    "chars": 48956,
    "preview": "\n\n# Llama 4 - Prompt Formats\n## Tokens\nHere is a list of special tokens that are supported by Llama 4:\n- `<|begin_of_tex"
  },
  {
    "path": "models/llama4/quantization/__init__.py",
    "chars": 477,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/quantization/loader.py",
    "chars": 7898,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/scripts/chat_completion.py",
    "chars": 4004,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/scripts/completion.py",
    "chars": 2085,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/scripts/quantize.py",
    "chars": 8544,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/tests/__init__.py",
    "chars": 276,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/tests/api/__init__.py",
    "chars": 276,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/tests/api/test_chat_format.py",
    "chars": 9020,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/tokenizer.model",
    "chars": 3622230,
    "preview": "wA== 0\nwQ== 1\n9Q== 2\n9g== 3\n9w== 4\n+A== 5\n+Q== 6\n+g== 7\n+w== 8\n/A== 9\n/Q== 10\n/g== 11\n/w== 12\nIQ== 13\nIg== 14\nIw== 15\nJA"
  },
  {
    "path": "models/llama4/tokenizer.py",
    "chars": 9205,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/vision/embedding.py",
    "chars": 7255,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/llama4/vision/encoder.py",
    "chars": 14811,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/quantize_impls.py",
    "chars": 9382,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/sku_list.py",
    "chars": 34287,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/sku_types.py",
    "chars": 7684,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/tokenizer_utils.py",
    "chars": 1124,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/utils/__init__.py",
    "chars": 276,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/utils/config.py",
    "chars": 481,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "models/utils/model_utils.py",
    "chars": 544,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the t"
  },
  {
    "path": "pyproject.toml",
    "chars": 1421,
    "preview": "[build-system]\nrequires = [\"setuptools>=61.0\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"llama_models\"\n"
  },
  {
    "path": "requirements.txt",
    "chars": 418,
    "preview": "# This file was autogenerated by uv via the following command:\n#    uv export --frozen --no-hashes --no-emit-project --o"
  }
]

About this extraction

This page contains the full source code of the meta-llama/llama-models GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 105 files (6.3 MB), approximately 1.7M tokens, and a symbol index with 503 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo