Full Code of abetlen/llama-cpp-python for AI

main c37132bac860 cached

99 files

1.2 MB

382.1k tokens

750 symbols

1 requests

Download .txt

Showing preview only (1,301K chars total). Download the full file or copy to clipboard to get everything.

Repository: abetlen/llama-cpp-python
Branch: main
Commit: c37132bac860
Files: 99
Total size: 1.2 MB

Directory structure:
gitextract_mj0bsng8/

├── .dockerignore
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug_report.md
│   │   └── feature_request.md
│   ├── dependabot.yml
│   └── workflows/
│       ├── build-and-release.yaml
│       ├── build-docker.yaml
│       ├── build-wheels-cuda.yaml
│       ├── build-wheels-metal.yaml
│       ├── generate-index-from-release.yaml
│       ├── publish-to-test.yaml
│       ├── publish.yaml
│       ├── test-pypi.yaml
│       └── test.yaml
├── .gitignore
├── .gitmodules
├── .readthedocs.yaml
├── CHANGELOG.md
├── CMakeLists.txt
├── LICENSE.md
├── Makefile
├── README.md
├── docker/
│   ├── README.md
│   ├── cuda_simple/
│   │   └── Dockerfile
│   ├── open_llama/
│   │   ├── Dockerfile
│   │   ├── build.sh
│   │   ├── hug_model.py
│   │   ├── start.sh
│   │   └── start_server.sh
│   ├── openblas_simple/
│   │   └── Dockerfile
│   └── simple/
│       ├── Dockerfile
│       └── run.sh
├── docs/
│   ├── api-reference.md
│   ├── changelog.md
│   ├── index.md
│   ├── install/
│   │   └── macos.md
│   ├── requirements.txt
│   └── server.md
├── examples/
│   ├── batch-processing/
│   │   └── server.py
│   ├── gradio_chat/
│   │   ├── local.py
│   │   └── server.py
│   ├── hf_pull/
│   │   └── main.py
│   ├── high_level_api/
│   │   ├── fastapi_server.py
│   │   ├── high_level_api_embedding.py
│   │   ├── high_level_api_inference.py
│   │   ├── high_level_api_infill.py
│   │   ├── high_level_api_streaming.py
│   │   └── langchain_custom_llm.py
│   ├── low_level_api/
│   │   ├── Chat.py
│   │   ├── Miku.py
│   │   ├── ReasonAct.py
│   │   ├── common.py
│   │   ├── low_level_api_chat_cpp.py
│   │   ├── low_level_api_llama_cpp.py
│   │   ├── quantize.py
│   │   ├── readme/
│   │   │   └── low_level_api_llama_cpp.md
│   │   └── util.py
│   ├── notebooks/
│   │   ├── Batching.ipynb
│   │   ├── Clients.ipynb
│   │   ├── Functions.ipynb
│   │   ├── Guidance.ipynb
│   │   ├── Multimodal.ipynb
│   │   ├── OpenHermesFunctionCalling.ipynb
│   │   └── PerformanceTuning.ipynb
│   └── ray/
│       ├── README.md
│       ├── llm.py
│       └── requirements.txt
├── llama_cpp/
│   ├── __init__.py
│   ├── _ctypes_extensions.py
│   ├── _ggml.py
│   ├── _internals.py
│   ├── _logger.py
│   ├── _utils.py
│   ├── llama.py
│   ├── llama_cache.py
│   ├── llama_chat_format.py
│   ├── llama_cpp.py
│   ├── llama_grammar.py
│   ├── llama_speculative.py
│   ├── llama_tokenizer.py
│   ├── llama_types.py
│   ├── llava_cpp.py
│   ├── mtmd_cpp.py
│   ├── py.typed
│   └── server/
│       ├── __init__.py
│       ├── __main__.py
│       ├── app.py
│       ├── cli.py
│       ├── errors.py
│       ├── model.py
│       ├── settings.py
│       └── types.py
├── mkdocs.yml
├── pyproject.toml
├── scripts/
│   ├── get-releases.sh
│   └── releases-to-pep-503.sh
└── tests/
    ├── test_llama.py
    ├── test_llama_chat_format.py
    ├── test_llama_grammar.py
    └── test_llama_speculative.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .dockerignore
================================================
_skbuild/

.envrc

models/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# poetry
#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
#   This is especially recommended for binary packages to ensure reproducibility, and is more
#   commonly ignored for libraries.
#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/


================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: ''
assignees: ''

---

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ ] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
- [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [ ] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

Please provide a detailed written description of what you were trying to do, and what you expected `llama-cpp-python` to do.

# Current Behavior

Please provide a detailed written description of what `llama-cpp-python` did, instead.

# Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

* Physical (or virtual) hardware you are using, e.g. for Linux:

`$ lscpu`

* Operating System, e.g. for Linux:

`$ uname -a`

* SDK version, e.g. for Linux:

```
$ python3 --version
$ make --version
$ g++ --version
```

# Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

# Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

1. step 1
2. step 2
3. step 3
4. etc.

**Note: Many issues seem to be regarding functional or performance issues / differences with `llama.cpp`. In these cases we need to confirm that you're comparing against the version of `llama.cpp` that was built with your python package, and which parameters you're passing to the context.**

Try the following:

1. `git clone https://github.com/abetlen/llama-cpp-python`
2. `cd llama-cpp-python`
3. `rm -rf _skbuild/` # delete any old builds
4. `python -m pip install .`
5. `cd ./vendor/llama.cpp`
6. Follow [llama.cpp's instructions](https://github.com/ggerganov/llama.cpp#build) to `cmake` llama.cpp
7. Run llama.cpp's `./main` with the same arguments you previously passed to llama-cpp-python and see if you can reproduce the issue. If you can, [log an issue with llama.cpp](https://github.com/ggerganov/llama.cpp/issues)

# Failure Logs

Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.

Also, please try to **avoid using screenshots** if at all possible. Instead, copy/paste the console output and use [Github's markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to cleanly format your logs for easy readability.

Example environment info:
```
llama-cpp-python$ git log | head -1
commit 47b0aa6e957b93dbe2c29d53af16fbae2dd628f2

llama-cpp-python$ python3 --version
Python 3.10.10

llama-cpp-python$ pip list | egrep "uvicorn|fastapi|sse-starlette|numpy"
fastapi                  0.95.0
numpy                    1.24.3
sse-starlette            1.3.3
uvicorn                  0.21.1

llama-cpp-python/vendor/llama.cpp$ git log | head -3
commit 66874d4fbcc7866377246efbcee938e8cc9c7d76
Author: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
Date:   Thu May 25 20:18:01 2023 -0600
```


================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.md
================================================
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.


================================================
FILE: .github/dependabot.yml
================================================
# To get started with Dependabot version updates, you'll need to specify which
# package ecosystems to update and where the package manifests are located.
# Please see the documentation for all configuration options:
# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates

version: 2
updates:
  - package-ecosystem: "pip" # See documentation for possible values
    directory: "/" # Location of package manifests
    schedule:
      interval: "daily"
  - package-ecosystem: "github-actions"
    directory: "/"
    schedule:
      interval: "daily"
  - package-ecosystem: "docker"
    directory: "/"
    schedule:
      interval: "daily"   


================================================
FILE: .github/workflows/build-and-release.yaml
================================================
name: Build Release

on: workflow_dispatch

permissions:
  contents: write

jobs:
  build_wheels:
    name: Build wheels on ${{ matrix.os }}
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-22.04, windows-2022, macos-14, macos-15]

    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"

      # Used to host cibuildwheel
      - uses: actions/setup-python@v5
        with:
          python-version: "3.9"

      - name: Install dependencies (Linux/MacOS)
        if: runner.os != 'Windows'
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          RUST_LOG=trace python -m uv pip install -e .[all] --verbose
        shell: bash

      - name: Install dependencies (Windows)
        if: runner.os == 'Windows'
        env:
          RUST_LOG: trace        
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          python -m uv pip install -e .[all] --verbose
        shell: cmd

      - name: Build wheels
        uses: pypa/cibuildwheel@v2.22.0
        env:
          # disable repair
          CIBW_REPAIR_WHEEL_COMMAND: ""
        with:
          package-dir: .
          output-dir: wheelhouse

      - uses: actions/upload-artifact@v4
        with:
          name: wheels-${{ matrix.os }}
          path: ./wheelhouse/*.whl

  build_wheels_arm64:
    name: Build arm64 wheels
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"

      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3
        with:
          platforms: linux/arm64

      - name: Build wheels
        uses: pypa/cibuildwheel@v2.22.0
        env:
          CIBW_SKIP: "*musllinux* pp*"
          CIBW_REPAIR_WHEEL_COMMAND: ""
          CIBW_ARCHS: "aarch64"
          CIBW_ENVIRONMENT: CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DCMAKE_CROSSCOMPILING=ON"
          CIBW_BUILD: "cp38-* cp39-* cp310-* cp311-* cp312-*"
        with:
          output-dir: wheelhouse

      - name: Upload wheels as artifacts
        uses: actions/upload-artifact@v4
        with:
          name: wheels_arm64
          path: ./wheelhouse/*.whl

  build_sdist:
    name: Build source distribution
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"

      - uses: actions/setup-python@v5
        with:
          python-version: "3.9"

      - name: Install dependencies (Linux/MacOS)
        if: runner.os != 'Windows'
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          RUST_LOG=trace python -m uv pip install -e .[all] --verbose
          python -m uv pip install build
        shell: bash

      - name: Install dependencies (Windows)
        if: runner.os == 'Windows'
        env:
          RUST_LOG: trace        
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          python -m uv pip install -e .[all] --verbose
          python -m uv pip install build
        shell: cmd

      - name: Build source distribution
        run: |
          python -m build --sdist

      - uses: actions/upload-artifact@v4
        with:
          name: sdist
          path: ./dist/*.tar.gz

  release:
    name: Release
    needs: [build_wheels, build_wheels_arm64, build_sdist]
    runs-on: ubuntu-latest

    steps:
      - uses: actions/download-artifact@v4
        with:
          merge-multiple: true
          path: dist

      - uses: softprops/action-gh-release@v2
        with:
          files: dist/*
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}


================================================
FILE: .github/workflows/build-docker.yaml
================================================
name: Build Docker

on: workflow_dispatch

permissions:
  contents: write
  packages: write

jobs:
  docker:
    name: Build and push Docker image
    runs-on: ubuntu-22.04
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          submodules: "recursive"

      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to GitHub Container Registry
        uses: docker/login-action@v3 
        with:
          registry: ghcr.io
          username: ${{ github.repository_owner }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        id: docker_build
        uses: docker/build-push-action@v6
        with:
          context: .
          file: "docker/simple/Dockerfile"
          push: ${{ startsWith(github.ref, 'refs/tags/') }}
          pull: true
          platforms: linux/amd64,linux/arm64
          tags: |
            ghcr.io/abetlen/llama-cpp-python:latest
            ghcr.io/abetlen/llama-cpp-python:${{ github.ref_name }}
          build-args: |
            BUILDKIT_INLINE_CACHE=1

      - name: Publish to GitHub Tag
        if: steps.docker_build.outputs.digest && startsWith(github.ref, 'refs/tags/')
        run: |
          echo "Docker image published for tag: ${{ github.ref_name }}"


================================================
FILE: .github/workflows/build-wheels-cuda.yaml
================================================
name: Build Wheels (CUDA)

on: workflow_dispatch

permissions:
  contents: write

jobs:
  define_matrix:
    name: Define Build Matrix
    runs-on: ubuntu-22.04
    outputs:
      matrix: ${{ steps.set-matrix.outputs.matrix }}
    defaults:
      run:
        shell: pwsh

    steps:
      - name: Define Job Output
        id: set-matrix
        run: |
          $matrix = @{
              'os' = @('ubuntu-22.04') #, 'windows-2022')
              'pyver' = @("3.9", "3.10", "3.11", "3.12")
              'cuda' = @("12.1.1", "12.2.2", "12.3.2", "12.4.1") #, "12.5.1", "12.6.1")
              'releasetag' = @("basic")
          }

          $matrixOut = ConvertTo-Json $matrix -Compress
          Write-Output ('matrix=' + $matrixOut) >> $env:GITHUB_OUTPUT

  build_wheels:
    name: Build Wheel ${{ matrix.os }} ${{ matrix.pyver }} ${{ matrix.cuda }} ${{ matrix.releasetag == 'wheels' && 'AVX2' || matrix.releasetag }}
    needs: define_matrix
    runs-on: ${{ matrix.os }}
    strategy:
      matrix: ${{ fromJSON(needs.define_matrix.outputs.matrix) }}
    defaults:
      run:
        shell: pwsh
    env:
      CUDAVER: ${{ matrix.cuda }}
      AVXVER: ${{ matrix.releasetag }}

    steps:
      - name: Add MSBuild to PATH
        if: runner.os == 'Windows'
        uses: microsoft/setup-msbuild@v2
        with:
          vs-version: '[16.11,16.12)'

      - uses: actions/checkout@v4
        with:
          submodules: "recursive"

      - uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.pyver }}
          cache: 'pip'

      - name: Setup Mamba
        uses: conda-incubator/setup-miniconda@v3.1.0
        with:
          activate-environment: "llamacpp"
          python-version: ${{ matrix.pyver }}
          miniforge-version: latest
          add-pip-as-python-dependency: true
          auto-activate-base: false

      - name: VS Integration Cache
        id: vs-integration-cache
        if: runner.os == 'Windows'
        uses: actions/cache@v4
        with:
          path: ./MSBuildExtensions
          key: cuda-${{ matrix.cuda }}-vs-integration

      - name: Get Visual Studio Integration
        if: runner.os == 'Windows' && steps.vs-integration-cache.outputs.cache-hit != 'true'
        run: |
          if ($env:CUDAVER -eq '12.1.1') {$x = '12.1.0'} else {$x = $env:CUDAVER}
          $links = (Invoke-RestMethod 'https://raw.githubusercontent.com/Jimver/cuda-toolkit/master/src/links/windows-links.ts').Trim().split().where({$_ -ne ''})
          for ($i=$q=0;$i -lt $links.count -and $q -lt 2;$i++) {if ($links[$i] -eq "'$x',") {$q++}}
          Invoke-RestMethod $links[$i].Trim("'") -OutFile 'cudainstaller.zip'
          & 'C:\Program Files\7-Zip\7z.exe' e cudainstaller.zip -oMSBuildExtensions -r *\MSBuildExtensions\* > $null
          Remove-Item 'cudainstaller.zip'

      - name: Install Visual Studio Integration
        if: runner.os == 'Windows'
        run: |
          $y = (gi '.\MSBuildExtensions').fullname + '\*'
          (gi 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Microsoft\VC\*\BuildCustomizations').fullname.foreach({cp $y $_})
          $cupath = 'CUDA_PATH_V' + $env:CUDAVER.Remove($env:CUDAVER.LastIndexOf('.')).Replace('.','_')
          echo "$cupath=$env:CONDA_PREFIX" >> $env:GITHUB_ENV

      - name: Install Dependencies
        env:
          MAMBA_DOWNLOAD_FAILFAST: "0"
          MAMBA_NO_LOW_SPEED_LIMIT: "1"
        run: |
          $cudaVersion = $env:CUDAVER
          mamba install -y 'cuda' -c nvidia/label/cuda-$cudaVersion
          python -m pip install build wheel

      - name: Build Wheel
        run: |
          $cudaVersion = $env:CUDAVER.Remove($env:CUDAVER.LastIndexOf('.')).Replace('.','')
          $env:CUDA_PATH = $env:CONDA_PREFIX
          $env:CUDA_HOME = $env:CONDA_PREFIX
          $env:CUDA_TOOLKIT_ROOT_DIR = $env:CONDA_PREFIX
          if ($IsLinux) {
            $env:LD_LIBRARY_PATH = $env:CONDA_PREFIX + '/lib:' + $env:LD_LIBRARY_PATH
          }
          $env:VERBOSE = '1'
          $env:CMAKE_ARGS = '-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all'
          $env:CMAKE_ARGS = "-DGGML_CUDA_FORCE_MMQ=ON $env:CMAKE_ARGS"
          # if ($env:AVXVER -eq 'AVX') {
          $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DGGML_AVX2=off -DGGML_FMA=off -DGGML_F16C=off'
          # }
          # if ($env:AVXVER -eq 'AVX512') {
          #  $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DGGML_AVX512=on'
          # }
          # if ($env:AVXVER -eq 'basic') {
          #  $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DGGML_AVX=off -DGGML_AVX2=off -DGGML_FMA=off -DGGML_F16C=off'
          # }
          python -m build --wheel
          # write the build tag to the output
          Write-Output "CUDA_VERSION=$cudaVersion" >> $env:GITHUB_ENV

      - uses: softprops/action-gh-release@v2
        with:
          files: dist/*
          # Set tag_name to <tag>-cu<cuda_version>
          tag_name: ${{ github.ref_name }}-cu${{ env.CUDA_VERSION }}
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}


================================================
FILE: .github/workflows/build-wheels-metal.yaml
================================================
name: Build Wheels (Metal)

on: workflow_dispatch

permissions:
  contents: write

jobs:
  build_wheels:
    name: Build wheels on ${{ matrix.os }}
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [macos-14, macos-15]

    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"

      # Used to host cibuildwheel
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
          cache: 'pip'

      - name: Install dependencies (Linux/MacOS)
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          RUST_LOG=trace python -m uv pip install -e .[all] --verbose
        shell: bash

      - name: Build wheels
        uses: pypa/cibuildwheel@v2.22.0
        env:
          # disable repair
          CIBW_REPAIR_WHEEL_COMMAND: ""
          CIBW_ARCHS: "arm64"
          CIBW_ENVIRONMENT: CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on -DCMAKE_CROSSCOMPILING=ON"
          CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-*"
        with:
          package-dir: .
          output-dir: wheelhouse2

      - uses: actions/upload-artifact@v4
        with:
          name: wheels-mac_${{ matrix.os }}
          path: ./wheelhouse2/*.whl

  release:
    name: Release
    needs: [build_wheels]
    runs-on: ubuntu-latest

    steps:
      - uses: actions/download-artifact@v4
        with:
          merge-multiple: true
          path: dist2

      - uses: softprops/action-gh-release@v2
        with:
          files: dist2/*
          # set release name to <tag>-metal
          tag_name: ${{ github.ref_name }}-metal
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}


================================================
FILE: .github/workflows/generate-index-from-release.yaml
================================================
name: Wheels Index

on:
  # Trigger on new release
  workflow_run:
    workflows: ["Release", "Build Wheels (CUDA)", "Build Wheels (Metal)"]
    types:
      - completed

  # Allows you to run this workflow manually from the Actions tab
  workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
  contents: read
  pages: write
  id-token: write

# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
concurrency:
  group: "pages"
  cancel-in-progress: false

jobs:
  # Single deploy job since we're just deploying
  deploy:
    environment:
      name: github-pages
      url: ${{ steps.deployment.outputs.page_url }}
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Setup Pages
        uses: actions/configure-pages@v5
      - name: Build
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          ./scripts/get-releases.sh
          ./scripts/releases-to-pep-503.sh index/whl/cpu '^[v]?[0-9]+\.[0-9]+\.[0-9]+$'
          ./scripts/releases-to-pep-503.sh index/whl/cu121 '^[v]?[0-9]+\.[0-9]+\.[0-9]+-cu121$'
          ./scripts/releases-to-pep-503.sh index/whl/cu122 '^[v]?[0-9]+\.[0-9]+\.[0-9]+-cu122$'
          ./scripts/releases-to-pep-503.sh index/whl/cu123 '^[v]?[0-9]+\.[0-9]+\.[0-9]+-cu123$'
          ./scripts/releases-to-pep-503.sh index/whl/cu124 '^[v]?[0-9]+\.[0-9]+\.[0-9]+-cu124$'
          # ./scripts/releases-to-pep-503.sh index/whl/cu125 '^[v]?[0-9]+\.[0-9]+\.[0-9]+-cu124$'
          # ./scripts/releases-to-pep-503.sh index/whl/cu126 '^[v]?[0-9]+\.[0-9]+\.[0-9]+-cu124$'
          ./scripts/releases-to-pep-503.sh index/whl/metal '^[v]?[0-9]+\.[0-9]+\.[0-9]+-metal$'
      - name: Upload artifact
        uses: actions/upload-pages-artifact@v3
        with:
          # Upload entire repository
          path: 'index'
      - name: Deploy to GitHub Pages
        id: deployment
        uses: actions/deploy-pages@v4


================================================
FILE: .github/workflows/publish-to-test.yaml
================================================
# Based on: https://packaging.python.org/en/latest/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/

name: Publish to TestPyPI

on:
  workflow_dispatch:
    inputs:
      dev_version:
        description: 'Dev version N'
        required: true


jobs:
  build-n-publish:
    name: Build and publish
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v4
      with:
        submodules: "recursive"
        
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: "3.11"
        cache: 'pip'
        
    - name: Append Dev Version to __version__
      run: |
        DEV_VERSION=${{ github.event.inputs.dev_version }}
        CURRENT_VERSION=$(awk -F= '/__version__ =/ {print $2}' llama_cpp/__init__.py | tr -d ' "')
        NEW_VERSION="${CURRENT_VERSION}.dev${DEV_VERSION}"
        sed -i 's/__version__ = \".*\"/__version__ = \"'"${NEW_VERSION}"'\"/' llama_cpp/__init__.py
        
    - name: Install dependencies (Linux/MacOS)
      if: runner.os != 'Windows'
      run: |
        python -m pip install --upgrade pip
        python -m pip install uv
        RUST_LOG=trace python -m uv pip install -e .[all] --verbose
      shell: bash

    - name: Install dependencies (Windows)
      if: runner.os == 'Windows'
      env:
        RUST_LOG: trace       
      run: |
        python -m pip install --upgrade pip
        python -m pip install uv
        python -m uv pip install -e .[all] --verbose
      shell: cmd
        
    - name: Build source distribution
      run: |
        python -m build --sdist
        
    - name: Publish to Test PyPI
      uses: pypa/gh-action-pypi-publish@release/v1
      with:
        password: ${{ secrets.TEST_PYPI_API_TOKEN }}
        repository-url: https://test.pypi.org/legacy/


================================================
FILE: .github/workflows/publish.yaml
================================================
name: Publish to PyPI

# Based on: https://packaging.python.org/en/latest/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/

on: workflow_dispatch

jobs:
  build-n-publish:
    name: Build and publish
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v4
      with:
        submodules: "recursive"

    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: "3.9"

    - name: Install dependencies (Linux/MacOS)
      if: runner.os != 'Windows'
      run: |
        python -m pip install --upgrade pip
        python -m pip install uv
        RUST_LOG=trace python -m uv pip install -e .[all] --verbose
        python -m uv pip install build
      shell: bash

    - name: Install dependencies (Windows)
      if: runner.os == 'Windows'
      env:
        RUST_LOG: trace
      run: |
        python -m pip install --upgrade pip
        python -m pip install uv
        python -m uv pip install -e .[all] --verbose
        python -m uv pip install build
      shell: cmd

    - name: Build source distribution
      run: |
        python -m build --sdist

    - name: Publish distribution to PyPI
      # TODO: move to tag based releases
      # if: startsWith(github.ref, 'refs/tags')
      uses: pypa/gh-action-pypi-publish@release/v1
      with:
        password: ${{ secrets.PYPI_API_TOKEN }}


================================================
FILE: .github/workflows/test-pypi.yaml
================================================
name: Tests for PyPI package

on: workflow_dispatch

jobs:
  build-linux:

    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.9", "3.10", "3.11", "3.12"]

    steps:
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'

      - name: Install dependencies (Linux/MacOS)
        if: runner.os != 'Windows'
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          RUST_LOG=trace python -m uv pip install llama-cpp-python[all] --verbose 
        shell: bash

      - name: Install dependencies (Windows)
        if: runner.os == 'Windows'
        env:
          RUST_LOG: trace           
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          python -m uv pip install llama-cpp-python[all] --verbose 
        shell: cmd
          
      - name: Test with pytest
        run: |
          python -c "import llama_cpp"

  build-windows:

    runs-on: windows-latest
    strategy:
      matrix:
        python-version: ["3.9", "3.10", "3.11", "3.12"]

    steps:
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'
          
      - name: Install dependencies (Linux/MacOS)
        if: runner.os != 'Windows'
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          RUST_LOG=trace python -m uv pip install llama-cpp-python[all] --verbose 
        shell: bash

      - name: Install dependencies (Windows)
        if: runner.os == 'Windows'
        env:
          RUST_LOG: trace          
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          python -m uv pip install llama-cpp-python[all] --verbose 
        shell: cmd
          
      - name: Test with pytest
        run: |
          python -c "import llama_cpp"

  build-macos:

    runs-on: macos-latest
    strategy:
      matrix:
        python-version: ["3.9", "3.10", "3.11", "3.12"]

    steps:
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'   

      - name: Install dependencies (Linux/MacOS)
        if: runner.os != 'Windows'
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          RUST_LOG=trace python -m uv pip install llama-cpp-python[all] --verbose 
        shell: bash

      - name: Install dependencies (Windows)
        if: runner.os == 'Windows'
        env:
          RUST_LOG: trace  
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          python -m uv pip install llama-cpp-python[all] --verbose 
        shell: cmd
          
      - name: Test with pytest
        run: |
          python -c "import llama_cpp"


================================================
FILE: .github/workflows/test.yaml
================================================
name: Tests
on:
  pull_request:
    branches:
      - main
  push:
    branches:
      - main

env:
  REPO_ID: Qwen/Qwen2-0.5B-Instruct-GGUF
  MODEL_FILE: qwen2-0_5b-instruct-q8_0.gguf

jobs:
  download-model:
    runs-on: ubuntu-latest
    steps:
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.9"
      - name: Install huggingface-hub
        run: pip install huggingface-hub
      - name: Download model
        run: huggingface-cli download ${{ env.REPO_ID }} ${{ env.MODEL_FILE }}
      - name: Cache model
        uses: actions/cache@v4
        with:
          path: ~/.cache/huggingface/hub
          key: ${{ runner.os }}-model-${{ env.REPO_ID }}-${{ env.MODEL_FILE }}

  build-linux:
    needs: download-model
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.9", "3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"
          
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'
      - name: Restore model cache
        uses: actions/cache@v4
        with:
          path: ~/.cache/huggingface/hub
          key: ${{ runner.os }}-model-${{ env.REPO_ID }}-${{ env.MODEL_FILE }}
      - name: Install dependencies (Linux/MacOS)
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          python -m uv pip install -e .[all] --verbose
        shell: bash
      - name: Test with pytest
        run: |
          python -m pytest

  build-windows:
    needs: download-model
    runs-on: windows-latest
    strategy:
      matrix:
        python-version: ["3.9", "3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"
          
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'

      - name: Restore model cache
        uses: actions/cache@v4
        with:
          path: ~/.cache/huggingface/hub
          key: ${{ runner.os }}-model-${{ env.REPO_ID }}-${{ env.MODEL_FILE }}

      - name: Install dependencies (Windows)
        run: |
          python -m pip install --upgrade pip
          python -m pip install uv
          python -m uv pip install -e .[all] --verbose
        shell: cmd
          
      - name: Test with pytest
        run: |
          python -m pytest

  build-macos:
    needs: download-model
    runs-on: macos-13
    strategy:
      matrix:
        python-version: ["3.9", "3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"
          
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'

      - name: System Info
        run: |
          uname -a
          sysctl -n machdep.cpu.brand_string
          python3 -c "import platform; print(platform.machine(), platform.architecture())"

      - name: Restore model cache
        uses: actions/cache@v4
        with:
          path: ~/.cache/huggingface/hub
          key: ${{ runner.os }}-model-${{ env.REPO_ID }}-${{ env.MODEL_FILE }}
          
      - name: Install dependencies (Linux/MacOS)
        run: |
          python3 -m pip install --upgrade pip
          python3 -m pip install uv
          python3 -m uv pip install -e .[all] --verbose
          CMAKE_ARGS="-DLLAMA_METAL=off" python3 -m uv pip install .[all] --verbose
        shell: bash

      - name: Test with pytest
        run: |
          python3 -m pytest

  build-macos-metal:
    needs: download-model
    runs-on: macos-13
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"
          
      - name: Set up Python 3.9
        uses: actions/setup-python@v5
        with:
          python-version: "3.9"

      - name: System Info
        run: |
          uname -a
          sysctl -n machdep.cpu.brand_string
          python3 -c "import platform; print(platform.machine(), platform.architecture())"

      - name: Restore model cache
        uses: actions/cache@v4
        with:
          path: ~/.cache/huggingface/hub
          key: ${{ runner.os }}-model-${{ env.REPO_ID }}-${{ env.MODEL_FILE }}

      - name: Install dependencies
        run: |
          python3 -m pip install --upgrade pip
          CMAKE_ARGS="-DLLAMA_METAL=on" python3 -m pip install .[all] --verbose
        shell: bash

      - name: Test with pytest
        run: |
          python3 -m pytest


================================================
FILE: .gitignore
================================================
*.local

.python-version

.vscode/

_skbuild/

.envrc
.direnv

models/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
llama_cpp/*.so
llama_cpp/*.dylib
llama_cpp/*.metal
llama_cpp/*.dll
llama_cpp/*.lib

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# poetry
#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
#   This is especially recommended for binary packages to ensure reproducibility, and is more
#   commonly ignored for libraries.
#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/

# downloaded model .bin files
docker/open_llama/*.bin


================================================
FILE: .gitmodules
================================================
[submodule "vendor/llama.cpp"]
	path = vendor/llama.cpp
	url = https://github.com/ggerganov/llama.cpp.git


================================================
FILE: .readthedocs.yaml
================================================
# Read the Docs configuration file for MkDocs projects
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the version of Python and other tools you might need
build:
  os: ubuntu-22.04
  tools:
    python: "3.11"

mkdocs:
  configuration: mkdocs.yml

python:
  install:
    - method: pip
      path: .
    - requirements: docs/requirements.txt

submodules:
  include: all
  recursive: true

================================================
FILE: CHANGELOG.md
================================================
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [0.3.16]

- feat: Update llama.cpp to ggerganov/llama.cpp@4227c9be4268ac844921b90f31595f81236bd317

## [0.3.15]

- feat: Update llama.cpp to ggerganov/llama.cpp@9a96389544a08fd829fccda28142ce2066017fde
- feat: Add gpt-oss chat format support through strftime_now in chat format by @iamlemec in af637928db7351e030011085f818b034c6efc047
- fix: rename op_offloat to op_offload in llama.py by @sergey21000 in #2046

## [0.3.14]

- feat: Update llama.cpp to ggerganov/llama.cpp@79e0b68c178656bb0632cb8602d2940b755077f8

## [0.3.13]

- feat: Update llama.cpp to ggerganov/llama.cpp@bdca38376f7e8dd928defe01ce6a16218a64b040
- fix: Better chat format for Qwen2.5-VL by @alcoftTAO in #2040

## [0.3.12]

- feat: Update llama.cpp to ggerganov/llama.cpp@a0374a67e2924f2e845cdc59dd67d9a44065a89c

## [0.3.11]

- fix: Update reference to `llama_kv_cache_clear` in Llama.embed. Closes #2037 by @abetlen in 9e5a4eaa84156084ed7bbb91e6efcc91dc6217bc

## [0.3.10]

- feat: Update llama.cpp to ggerganov/llama.cpp@8846aace4934ad29651ea61b8c7e3f6b0556e3d2
- feat: Add support for llama.cpp multimodal, add Qwen2.5-VL chat handler by @abetlen in cd548bd0f14210627798237d5c2ea78acfb88ccb

## [0.3.9]

- feat: Update llama.cpp to ggerganov/llama.cpp@8733e0cf6eefc7c7752297cc22d0836706f4222c

## [0.3.8]

- feat: Update llama.cpp to ggerganov/llama.cpp@7841fc723e059d1fd9640e5c0ef19050fcc7c698

## [0.3.7]

- feat: Update llama.cpp to ggerganov/llama.cpp@794fe23f29fb40104975c91fe19f23798f7c726e
- fix(ci): Fix the CUDA workflow by @oobabooga in #1894
- fix: error showing time spent in llama perf context print, adds `no_perf` flag to `Llama` class by @shakalaca in #1898

## [0.3.6]

- feat: Update llama.cpp to ggerganov/llama.cpp@f7cd13301c2a88f97073fd119072b4cc92c08df1
- fix(server): streaming resource lock by @gjpower in #1879

## [0.3.5]

- feat: Update llama.cpp to ggerganov/llama.cpp@26a8406ba9198eb6fdd8329fa717555b4f77f05f
- fix(ci): Fix release by updating macos runner image to non-deprecated version by @abetlen in afedfc888462f9a6e809dc9455eb3b663764cc3f
- fix(server): add missing await statements for async exit_stack handling by @gjpower in #1858

## [0.3.4]

- fix(ci): Build wheels for macos 13-15, cuda 12.1-12.4 by @abetlen in ca808028bd16b8327bd84128d48015a4b1304690

## [0.3.3]

- feat: Update llama.cpp to ggerganov/llama.cpp@ce8784bdb153ff7794dde5a50b0ebfa51baa6171
- fix: chat API logprobs format by @domdomegg in #1788
- feat: Add support for CUDA 12.6, fix CUDA 12.5 by @Smartappli in #1775
- fix: Make content not required in ChatCompletionRequestAssistantMessage by @feloy in #1807
- fix: Fix pickling of Llama class by setting seed from _seed member by @abetlen in 2523472c3eccb9ab9277117cc4ff705212b6888a
- fix: Fix logit-bias type hint by @ddh0 in #1802
- fix(server): Avoid thread starvation on many concurrent requests by making use of asyncio to lock llama_proxy context by @gjpower in #1798
- fix(server): Added missing exit_stack.close() to /v1/chat/completions by @Ian321 in #1796
- fix(examples): Refactor Batching notebook to use new sampler chain API by @lukestanley in #1793
- fix(docs): Update development instructions by @Florents-Tselai in #1833
- fix(docs): Remove ref to llama_eval in llama_cpp.py docs by @richdougherty in #1819

## [0.3.2]

- feat: Update llama.cpp to ggerganov/llama.cpp@74d73dc85cc2057446bf63cc37ff649ae7cebd80

## [0.3.1]

- feat: Update llama.cpp to ggerganov/llama.cpp@c919d5db39c8a7fcb64737f008e4b105ee0acd20
- feat: Expose libggml in internal APIs by @abetlen in #1761
- fix: Fix speculative decoding by @abetlen in 9992c5084a3df2f533e265d10f81d4269b97a1e6 and e975dabf74b3ad85689c9a07719cbb181313139b
- misc: Rename all_text to remaining_text by @xu-song in #1658

## [0.3.0]

- feat: Update llama.cpp to ggerganov/llama.cpp@ea9c32be71b91b42ecc538bd902e93cbb5fb36cb
- feat: Enable detokenizing special tokens with special=True by @benniekiss in #1596
- feat(ci): Speed up CI workflows using uv, add support for CUDA 12.5 wheels by @Smartappli in e529940f45d42ed8aa31334123b8d66bc67b0e78
- feat: Add loading sharded GGUF files from HuggingFace with Llama.from_pretrained(additional_files=[...]) by @Gnurro in 84c092063e8f222758dd3d60bdb2d1d342ac292e
- feat: Add option to configure n_ubatch by @abetlen in 6c44a3f36b089239cb6396bb408116aad262c702
- feat: Update sampling API for llama.cpp. Sampling now uses sampler chain by @abetlen in f8fcb3ea3424bcfba3a5437626a994771a02324b
- fix: Don't store scores internally unless logits_all=True. Reduces memory requirements for large context by @abetlen in 29afcfdff5e75d7df4c13bad0122c98661d251ab
- fix: Fix memory allocation of ndarray in by @xu-song in #1704
- fix: Use system message in og qwen format by @abetlen in 98eb092d3c6e7c142c4ba2faaca6c091718abbb3


## [0.2.90]

- feat: Update llama.cpp to ggerganov/llama.cpp@1d1ccce67613674c75c9c7e3fa4c1e24e428ba48
- feat: Add support for `MiniCPMv26ChatHandler` and `minicpm-v-26` in server by @abetlen in f70df824985d875226793b94dacc0c302a4256b2

## [0.2.89]

- feat: Update llama.cpp to ggerganov/llama.cpp@cfac111e2b3953cdb6b0126e67a2487687646971
- fix: Llama.close didn't free lora adapter by @jkawamoto in #1679
- fix: missing dependencies for test by @jkawamoto in #1680

## [0.2.88]

- feat: Update llama.cpp to ggerganov/llama.cpp@fc4ca27b25464a11b3b86c9dbb5b6ed6065965c2
- fix: only print 'cache saved' in verbose mode by @lsorber in #1668 
- fix: Added back from_file method to LlamaGrammar by @ExtReMLapin in #1673
- fix: grammar prints on each call by @abetlen in 0998ea0deea076a547d54bd598d6b413b588ee2b
- feat: Enable recursive search of HFFS.ls when using from_pretrained by @benHeidabetlen in #1656
- feat: Add more detailed log for prefix-match by @xu-song in #1659

## [0.2.87]

- feat: Update llama.cpp to ggerganov/llama.cpp@be55695eff44784a141a863f273661a6bce63dfc
- fix: Include all llama.cpp source files and subdirectories by @abetlen in 9cad5714ae6e7c250af8d0bbb179f631368c928b
- feat(ci): Re-build wheel index automatically when releases are created by @abetlen in 198f47dc1bd202fd2b71b29e041a9f33fe40bfad

## [0.2.86]

- feat: Update llama.cpp to ggerganov/llama.cpp@398ede5efeb07b9adf9fbda7ea63f630d476a792
- feat: Ported back new grammar changes from C++ to Python implementation by @ExtReMLapin in (#1637)
- fix: llama_grammar_accept_token arg order by @tc-wolf in (#1649)

## [0.2.85]

- feat: Update llama.cpp to ggerganov/llama.cpp@398ede5efeb07b9adf9fbda7ea63f630d476a792
- fix: Missing LoRA adapter after API change by @shamitv in #1630
- fix(docker): Update Dockerfile BLAS options by @olivierdebauche in #1632
- fix(docker): Fix GGML_CUDA param by @olivierdebauche in #1633
- fix(docker): Update Dockerfile build options from `LLAMA_` to `GGML_` by @olivierdebauche in #1634
- feat: FreeBSD compatibility by @yurivict in #1635

## [0.2.84]

- feat: Update llama.cpp to ggerganov/llama.cpp@4730faca618ff9cee0780580145e3cbe86f24876
- fix: fix: Correcting run.sh filepath in Simple Docker implementation by @mashuk999 in #1626

## [0.2.83]

- feat: Update llama.cpp to ggerganov/llama.cpp@081fe431aa8fb6307145c4feb3eed4f48cab19f8
- feat: Add 'required' literal to ChatCompletionToolChoiceOption by @mjschock in #1597
- fix: Change repeat_penalty to 1.0 to match llama.cpp defaults by @ddh0 in #1590
- fix(docs): Update README.md typo by @ericcurtin in #1589
- fix(server): Use split_mode from model settings by @grider-withourai in #1594
- feat(ci): Dockerfile update base images and post-install cleanup by @Smartappli in #1530

## [0.2.82]

- feat: Update llama.cpp to ggerganov/llama.cpp@7fdb6f73e35605c8dbc39e9f19cd9ed84dbc87f2

## [0.2.81]

- feat: Update llama.cpp to ggerganov/llama.cpp@968967376dc2c018d29f897c4883d335bbf384fb
- fix(ci): Fix CUDA wheels, use LLAMA_CUDA instead of removed LLAMA_CUBLAS by @abetlen in 4fb6fc12a02a68884c25dd9f6a421cacec7604c6
- fix(ci): Fix MacOS release, use macos-12 image instead of removed macos-11 by @abetlen in 3a551eb5263fdbd24b36d7770856374c04e92788

## [0.2.80]

- feat: Update llama.cpp to ggerganov/llama.cpp@023b8807e10bc3ade24a255f01c1ad2a01bb4228
- fix(server): Fix bug in FastAPI streaming response where dependency was released before request completes causing SEGFAULT by @abetlen in 296304b60bb83689659883c9cc24f4c074dd88ff
- fix(server): Update default config value for embeddings to False to fix error in text generation where logits were not allocated by llama.cpp by @abetlen in bf5e0bb4b151f4ca2f5a21af68eb832a96a79d75
- fix(ci): Fix the CUDA workflow by @oobabooga in #1551
- docs: Update readme examples to use newer Qwen2 model by @jncraton in #1544

## [0.2.79]

- feat: Update llama.cpp to ggerganov/llama.cpp@9c77ec1d74874ee22bdef8f110e8e8d41389abf2
- feat(ci): Update workflows and pre-built wheels by @Smartappli in #1416
- feat: Add .close() method to Llama class to explicitly free model from memory by @jkawamoto in #1513
- feat: Support SPM infill by @CISC in #1492

## [0.2.78]

- feat: Update llama.cpp to ggerganov/llama.cpp@fd5ea0f897ecb3659d6c269ef6f3d833e865ead7
- fix: Avoid duplicate special tokens in chat formats by @CISC in #1439
- fix: fix logprobs when BOS is not present by @ghorbani in #1471
- feat: adding rpc_servers parameter to Llama class by @chraac in #1477

## [0.2.77]

- feat: Update llama.cpp to ggerganov/llama.cpp@bde7cd3cd949c1a85d3a199498ac98e78039d46f
- fix: string value kv_overrides by @abetlen in df45a4b3fe46e72664bda87301b318210c6d4782
- fix: Fix typo in Llama3VisionAlphaChatHandler by @abetlen in 165b4dc6c188f8fda2fc616154e111f710484eba
- fix: Use numpy recarray for candidates data, fixes bug with temp < 0 by @abetlen in af3ed503e9ce60fe6b5365031abad4176a3536b3
fix: Disable Windows+CUDA workaround when compiling for HIPBLAS by Engininja2 in #1493

## [0.2.76]

- feat: Update llama.cpp to ggerganov/llama.cpp@0df0aa8e43c3378975269a51f9b876c8692e70da
- feat: Improve Llama.eval performance by avoiding list conversion by @thoughtp0lice in #1476
- example: LLM inference with Ray Serve by @rgerganov in #1465

## [0.2.75]

- feat: Update llama.cpp to ggerganov/llama.cpp@13ad16af1231ab2d245d35df3295bcfa23de1305
- fix: segfault for models without eos / bos tokens by @abetlen in d99a6ba607a4885fb00e63e967964aa41bdbbbcb
- feat: add MinTokensLogitProcessor and min_tokens argument to server by @twaka in #1333
- misc: Remove unnecessary metadata lookups by @CISC in #1448

## [0.2.74]

- feat: Update llama.cpp to ggerganov/llama.cpp@b228aba91ac2cd9eb90e9d423ba1d0d20e0117e2
- fix: Enable CUDA backend for llava by @abetlen in 7f59856fa6f3e23f07e12fc15aeb9359dc6c3bb4
- docs: Fix typo in README.md by @yupbank in #1444

## [0.2.73]

- feat: Update llama.cpp to ggerganov/llama.cpp@25c6e82e7a1ad25a42b0894e87d9b5c557409516
- fix: Clear kv cache at beginning of image chat formats to avoid bug when image is evaluated first by @abetlen in ac55d0a175115d1e719672ce1cb1bec776c738b1

## [0.2.72]

- fix(security): Remote Code Execution by Server-Side Template Injection in Model Metadata by @retr0reg in b454f40a9a1787b2b5659cd2cb00819d983185df
- fix(security): Update remaining jinja chat templates to use immutable sandbox by @CISC in #1441

## [0.2.71]

- feat: Update llama.cpp to ggerganov/llama.cpp@911b3900dded9a1cfe0f0e41b82c7a29baf3a217
- fix: Make leading bos_token optional for image chat formats, fix nanollava system message by @abetlen in 77122638b4153e31d9f277b3d905c2900b536632
- fix: free last image embed in llava chat handler by @abetlen in 3757328b703b2cd32dcbd5853271e3a8c8599fe7

## [0.2.70]

- feat: Update llama.cpp to ggerganov/llama.cpp@c0e6fbf8c380718102bd25fcb8d2e55f8f9480d1
- feat: fill-in-middle support by @CISC in #1386
- fix: adding missing args in create_completion for functionary chat handler by @skalade in #1430
- docs: update README.md @eltociear in #1432
- fix: chat_format log where auto-detected format prints None by @balvisio in #1434
- feat(server): Add support for setting root_path by @abetlen in 0318702cdc860999ee70f277425edbbfe0e60419
- feat(ci): Add docker checks and check deps more frequently by @Smartappli in #1426
- fix: detokenization case where first token does not start with a leading space by @noamgat in #1375
- feat: Implement streaming for Functionary v2 + Bug fixes by @jeffrey-fong in #1419
- fix: Use memmove to copy str_value kv_override by @abetlen in 9f7a85571ae80d3b6ddbd3e1bae407b9f1e3448a
- feat(server): Remove temperature bounds checks for server by @abetlen in 0a454bebe67d12a446981eb16028c168ca5faa81
- fix(server): Propagate flash_attn to model load by @dthuerck in #1424

## [0.2.69]

- feat: Update llama.cpp to ggerganov/llama.cpp@6ecf3189e00a1e8e737a78b6d10e1d7006e050a2
- feat: Add llama-3-vision-alpha chat format by @abetlen in 31b1d95a6c19f5b615a3286069f181a415f872e8
- fix: Change default verbose value of verbose in image chat format handlers to True to match Llama by @abetlen in 4f01c452b6c738dc56eacac3758119b12c57ea94
- fix: Suppress all logs when verbose=False, use hardcoded fileno's to work in colab notebooks by @abetlen in f116175a5a7c84569c88cad231855c1e6e59ff6e
- fix: UTF-8 handling with grammars by @jsoma in #1415

## [0.2.68]

- feat: Update llama.cpp to ggerganov/llama.cpp@77e15bec6217a39be59b9cc83d6b9afb6b0d8167
- feat: Add option to enable flash_attn to Lllama params and ModelSettings by @abetlen in 22d77eefd2edaf0148f53374d0cac74d0e25d06e
- fix(ci): Fix build-and-release.yaml by @Smartappli in #1413

## [0.2.67]

- fix: Ensure image renders before text in chat formats regardless of message content order by @abetlen in 3489ef09d3775f4a87fb7114f619e8ba9cb6b656
- fix(ci): Fix bug in use of upload-artifact failing to merge multiple artifacts into a single release by @abetlen in d03f15bb73a1d520970357b702a9e7d4cc2a7a62

## [0.2.66]

- feat: Update llama.cpp to ggerganov/llama.cpp@8843a98c2ba97a25e93319a104f9ddfaf83ce4c4
- feat: Generic Chat Formats, Tool Calling, and Huggingface Pull Support for Multimodal Models (Obsidian, LLaVA1.6, Moondream) by @abetlen in #1147
- ci(fix): Workflow actions updates and fix arm64 wheels not included in release by @Smartappli in #1392
- ci: Add support for pre-built cuda 12.4.1 wheels by @Smartappli in #1388
- feat: Add support for str type kv_overrides by @abetlen in a411612b385cef100d76145da1fbd02a7b7cc894
- fix: Functionary bug fixes by @jeffrey-fong in #1385
- examples: fix quantize example by @iyubondyrev in #1387
- ci: Update dependabot.yml by @Smartappli in #1391

## [0.2.65]

- feat: Update llama.cpp to ggerganov/llama.cpp@46e12c4692a37bdd31a0432fc5153d7d22bc7f72
- feat: Allow for possibly non-pooled embeddings by @iamlemec in #1380

## [0.2.64]

- feat: Update llama.cpp to ggerganov/llama.cpp@4e96a812b3ce7322a29a3008db2ed73d9087b176
- feat: Add `llama-3` chat format by @andreabak in #1371
- feat: Use new llama_token_is_eog in create_completions by @abetlen in d40a250ef3cfaa8224d12c83776a2f1de96ae3d1
- feat(server): Provide ability to dynamically allocate all threads if desired using -1 by @sean-bailey in #1364
- ci: Build arm64 wheels by @gaby in 611781f5319719a3d05fefccbbf0cc321742a026
- fix: Update scikit-build-core build dependency avoid bug in 0.9.1 by @evelkey in #1370

## [0.2.63]

- feat: Update llama.cpp to ggerganov/llama.cpp@0e4802b2ecbaab04b4f829fde4a3096ca19c84b5
- feat: Add stopping_criteria to ChatFormatter, allow stopping on arbitrary token ids, fixes llama3 instruct by @abetlen in cc81afebf04d26ca1ac3cf72f23f18da6ab58588

## [0.2.62]

- feat: Update llama.cpp to ggerganov/llama.cpp@3b8f1ec4b18770531d0b1d792f3edf08254e4f0c
- feat: update grammar schema converter to match llama.cpp by @themrzmaster in #1353
- feat: add disable_ping_events flag by @khimaros in #1257
- feat: Make saved state more compact on-disk by @tc-wolf in #1296
- feat: Use all available CPUs for batch processing by @ddh0 in #1345

## [0.2.61]

- feat: Update llama.cpp to ggerganov/llama.cpp@ba5e134e073ec6837078c874aba44a702944a676
- fix: pass correct type to chat handlers for chat completion logprobs by @abetlen in bb65b4d76411112c6fb0bf759efd746f99ef3c6b
- feat: Add support for yaml based server configs by @abetlen in 060bfa64d529ade2af9b1f4e207a3937bbc4138f
- feat: Add typechecking for ctypes structure attributes by @abetlen in 1347e1d050fc5a9a32ffe0bb3e22858da28003bd

## [0.2.60]

- feat: Update llama.cpp to ggerganov/llama.cpp@75cd4c77292034ecec587ecb401366f57338f7c0
- fix: Always embed metal library by @abetlen in b3bfea6dbfb6ed9ce18f9a2723e0a9e4bd1da7ad
- fix: missing logprobs in response, incorrect response type for functionary by @abetlen in 1ae3abbcc3af7f4a25a3ffc40b246f18039565e8
- fix(docs): incorrect tool_choice example by @CISC in #1330

## [0.2.59]

- feat: Update llama.cpp to ggerganov/llama.cpp@ba0c7c70ab5b15f1f2be7fb0dfbe0366dda30d6c
- feat: Binary wheels for CPU, CUDA (12.1 - 12.3), Metal by @abetlen, @jllllll, and @oobabooga in #1247
- fix: segfault when logits_all=False by @abetlen in 8649d7671bd1a7c0d9cc6a5ad91c6ca286512ab3
- fix: last tokens passing to sample_repetition_penalties function by @ymikhailov in #1295

## [0.2.58]

- feat: Update llama.cpp to ggerganov/llama.cpp@ba0c7c70ab5b15f1f2be7fb0dfbe0366dda30d6c
- feat: add support for KV cache quantization options by @Limour-dev in #1307
- feat: Add logprobs support to chat completions by @windspirit95 in #1311
- fix: set LLAMA_METAL_EMBED_LIBRARY=on on MacOS arm64 by @bretello in #1289
- feat: Add tools/functions variables to Jinja2ChatFormatter, add function response formatting for all simple chat formats by @CISC in #1273
- fix: Changed local API doc references to hosted by by @lawfordp2017 in #1317

## [0.2.57]

- feat: Update llama.cpp to ggerganov/llama.cpp@ac9ee6a4ad740bc1ee484ede43e9f92b5af244c1
- fix: set default embedding pooling type to unspecified by @abetlen in 4084aabe867b8ec2aba1b22659e59c9318b0d1f3
- fix: Fix and optimize functionary chat handler by @jeffrey-fong in #1282
- fix: json mode for basic chat formats by @abetlen in 20e6815252d0efd9f015f7adbf108faaf36e3f3c

## [0.2.56]

- feat: Update llama.cpp to ggerganov/llama.cpp@c2101a2e909ac7c08976d414e64e96c90ee5fa9e
- feat(server): Add endpoints for tokenize, detokenize and count tokens by @felipelo in #1136
- feat: Switch embed to llama_get_embeddings_seq by @iamlemec in #1263
- fix: Fixed json strings grammar by blacklisting character control set by @ExtReMLapin in d02a9cf16ff88ad011e2eb1ce29f4d9400f13cd1
- fix: Check for existence of clip model path by @kejcao in #1264

## [0.2.55]

- feat: Update llama.cpp to ggerganov/llama.cpp@9731134296af3a6839cd682e51d9c2109a871de5
- docs: fix small typo in README: 'model know how' -> 'model knows how' by @boegel in #1244

## [0.2.54]

- feat: Update llama.cpp to ggerganov/llama.cpp@cb49e0f8c906e5da49e9f6d64a57742a9a241c6a
- docs: fix typo in README.md embeddings example by @iamlemec in #1232

## [0.2.53]

- feat: Update llama.cpp to ggerganov/llama.cpp@cb49e0f8c906e5da49e9f6d64a57742a9a241c6a
- fix: eos/bos_token set correctly for Jinja2ChatFormatter and automatic chat formatter by @CISC in #1230

## [0.2.52]

- feat: Update llama.cpp to ggerganov/llama.cpp@a33e6a0d2a66104ea9a906bdbf8a94d050189d91
- fix: Llava15ChatHandler (this function takes at least 4 arguments) by @abetlen in 8383a9e5620f5df5a88f62da16813eac200dd706

## [0.2.51]

- feat: Update llama.cpp to ggerganov/llama.cpp@c39373398803c669056304090050fe3f44b41bf9
- fix: Restore type hints for low-level api by @abetlen in 19234aa0dbd0c3c87656e65dd2b064665371925b

## [0.2.50]

- docs: Update Functionary OpenAI Server Readme by @jeffrey-fong in #1193
- fix: LlamaHFTokenizer now receives pre_tokens by @abetlen in 47bad30dd716443652275099fa3851811168ff4a

## [0.2.49]

- fix: module 'llama_cpp.llama_cpp' has no attribute 'c_uint8' in Llama.save_state by @abetlen in db776a885cd4c20811f22f8bd1a27ecc71dba927
- feat: Auto detect Mixtral's slightly different format by @lukestanley in #1214

## [0.2.48]

- feat: Update llama.cpp to ggerganov/llama.cpp@15499eb94227401bdc8875da6eb85c15d37068f7
- feat: Add Google's Gemma formatting via chat_format="gemma" by @alvarobartt in #1210
- feat: support minItems/maxItems in JSON grammar converter by @nopperl in 3921e10770996d95a9eb22c8248bacef39f69365
- fix: Update from_pretrained defaults to match hf_hub_download and pull to local cache folder by @abetlen in e6d6260a91b7831733f7d1f73c7af46a3e8185ed
- fix: Raise exceptions when llama model or context fails to load by @abetlen in dd22010e85265ae840c76ec835d67a29ed852722
- docs: Update README.md to fix pip install llama cpp server by @audip in #1187

## [0.2.47]

- feat: Update llama.cpp to ggerganov/llama.cpp@973053d8b0d04809836b3339a50f68d9c842de90

## [0.2.46]

- feat: Update llama.cpp to ggerganov/llama.cpp@ba2135ccae7462470b3865c6e41d2e1d734eac05
- feat: Pull models directly from huggingface by @abetlen in #1206
- feat(low-level-api): Improve API static type-safety and performance. Low level api functions are positional args only now. by @abetlen in #1205

## [0.2.45]

- feat: Update llama.cpp to ggerganov/llama.cpp@89febfed9322c8849520dc63c93ee4f5fd72556e

## [0.2.44]

- feat: Update llama.cpp to ggerganov/llama.cpp@4524290e87b8e107cc2b56e1251751546f4b9051
- fix: create_embedding broken response for input type str by @abetlen in 0ce66bc080fe537590b05b24bf442480bf2dd045
- fix: Use '\n' seperator for EventSourceResponse by @khimaros in #1188
- fix: Incorporate embedding pooling layer fixes by @iamlemec in #1194

## [0.2.43]

- feat: Update llama.cpp to ggerganov/llama.cpp@8084d554406b767d36b3250b3b787462d5dd626f
- feat: Support batch embeddings by @iamlemec in #1186
- fix: submodule kompute is not included in sdist by @abetlen in 7dbbfdecadebe7750be650d9409959640ff9a460
- fix: fix: Update openbuddy prompt format by @abetlen in 07a783779a62a4aac0b11161c7e0eb983ff215f8

## [0.2.42]

- feat: Update llama.cpp to ggerganov/llama.cpp@ea9c8e11436ad50719987fa23a289c74b7b40d40
- fix: sample idx off-by-one error for logit_processors by @lapp0 in #1179
- fix: chat formatting bugs in `chatml-function-calling` by @abetlen in 4b0e3320bd8c2c209e29978d0b21e2e471cc9ee3 and 68fb71b6a26a1e57331868f959b47ab4b87851e1

## [0.2.41]

- feat: Update llama.cpp to ggerganov/llama.cpp@895407f31b358e3d9335e847d13f033491ec8a5b
- fix: Don't change order of json schema object properties in generated grammar unless prop_order is passed by @abetlen in d1822fed6b706f38bd1ff0de4dec5baaa3cf84fa

## [0.2.40]

- feat: Update llama.cpp to ggerganov/llama.cpp@3bdc4cd0f595a6096cca4a64aa75ffa8a3503465
- feat: Generic chatml Function Calling using chat_format="chatml-function-calling"` by @abetlen in #957
- fix: Circular dependancy preventing early Llama object free by @notwa in #1176
- docs: Set the correct command for compiling with syscl support by @akarshanbiswas in #1172
- feat: use gpu backend for clip if available by @iamlemec in #1175

## [0.2.39]

- feat: Update llama.cpp to ggerganov/llama.cpp@b08f22c882a1443e6b97081f3ce718a4d1a741f8
- fix: Fix destructor logging bugs by using llama_log_callback to avoid suppress_stdout_stderr by @abetlen in 59760c85eddc72dfcc1839f43760ef72c23d6874

## [0.2.38]

- feat: Update llama.cpp to ggerganov/llama.cpp@1cfb5372cf5707c8ec6dde7c874f4a44a6c4c915
- feat: Add speculative decoding by @abetlen in #1120
- fix: Pass raise_exception and add_generation_prompt to jinja2 chat template by @abetlen in 078cca0361bf5a94d2cf52ed04980d20e32d6f95

## [0.2.37]

- feat: Update llama.cpp to ggerganov/llama.cpp@fea4fd4ba7f6b754ac795387b275e1a014a77bde
- feat: Automatically set chat format from gguf by @abetlen in #1110

## [0.2.36]

- feat: Update llama.cpp to ggerganov/llama.cpp@2aed77eb06a329f0d82bb1c467f4244904d4073f
- feat: Add mistral instruct chat format as "mistral-instruct" by @Rafaelblsilva in #799

## [0.2.35]

- feat: Update llama.cpp to ggerganov/llama.cpp@d2f650cb5b04ee2726663e79b47da5efe196ce00

## [0.2.34]

- feat: Update llama.cpp to ggerganov/llama.cpp@6db2b41a76ee78d5efdd5c3cddd5d7ad3f646855
- feat: Add json schema mode by @abetlen in #1122

## [0.2.33]

- feat: Update llama.cpp to ggerganov/llama.cpp@faa3526a1eba458120987ed8269e5616385a76f4
- feat(server): include llama-cpp-python version in openapi spec by @abetlen in cde7514c3d28e6d52f272614e9957208c344dde5
- fix: use both eos and bos tokens as stop sequences for hf-tokenizer-config chat format. by @abetlen in 5b982d0f8c6f35242c8862ffdce00e17cea0b44f
- fix: GGUF metadata KV overrides, re #1011 by @phiharri in #1116
- fix: llama_log_set should be able to accept null pointer by @abetlen in c970d41a85381fd55235136f123422df0bf0c7e7

## [0.2.32]

- feat: Update llama.cpp to ggerganov/llama.cpp@504dc37be8446fb09b1ede70300250ad41be32a2
- fix: from_json_schema oneof/anyof bug by @jndiogo in d3f5528ca8bcb9d69d4f27e21631e911f1fb9bfe
- fix: pass chat handler not chat formatter for huggingface autotokenizer and tokenizer_config formats by @abetlen in 24f39454e91cf5dddbc4b6041aead4accc7c7a2d
- feat: Add add_generation_prompt option for jinja2chatformatter by @abetlen in 7f3209b1eb4ad3260ba063801fab80a8c25a2f4c
- feat: Add Jinja2ChatFormatter by @abetlen in be09318c26add8674ce494ae7cc480cce72a4146
- feat: Expose gguf model metadata in metadata property by @abetlen in 5a34c57e5479e50c99aba9b38218cc48e6560b81

## [0.2.31]

- feat: Update llama.cpp to ggerganov/llama.cpp@a5cacb22b2114fd9adf61c00cbb237384d86bced
- fix: Mirostat sampling now passes correct type to ctypes and tracks state during generation by @abetlen in 3babe3512cb95743108f2b595210c38ed6f1b904
- fix: Python3.8 support in server by @abetlen in 141293a75b564a8699e0acba1da24d9aa1cf0ab1

## [0.2.30]

- feat: Update llama.cpp to ggerganov/llama.cpp@57e2a7a52a819883f40dada8a2edc24ecf48186b
- feat(server): Add ability to load chat format from huggingface autotokenizer or tokenizer_config.json files by @abetlen in b8fc1c7d83ad4a9207c707ba1d954fe580286a01
- feat: Integration of Jinja2 Templating for chat formats by @teleprint-me in #875
- fix: Offload KQV by default by @abetlen in 48c3b77e6f558a9899de0e1155c7dc0c7958d8e8
- fix: Support Accept text/event-stream in chat and completion endpoints, resolves #1083 by @aniljava in #1088
- fix(cli): allow passing n_ctx=0 to openAI API server args to use model n_ctx_train field per #1015 by @K-Mistele in #1093

## [0.2.29]

- feat: Update llama.cpp to ggerganov/llama.cpp@4483396751c79dea540808b9cb9238245d06da2b
- feat: Add split_mode option by @abetlen in 84615adbc6855c8384807c42f0130f9a1763f99d
- feat: Implement GGUF metadata KV overrides by @phiharri in #1011
- fix: Avoid "LookupError: unknown encoding: ascii" when open() called in a destructor by @yieldthought in #1012
- fix: Fix low_level_api_chat_cpp example to match current API by @aniljava in #1086
- fix: Fix Pydantic model parsing by @DeNeutoy in #1087

## [0.2.28]

- feat: Update llama.cpp to ggerganov/llama.cpp@6efb8eb30e7025b168f3fda3ff83b9b386428ad6
- feat: Add ability to pass in penalize_nl param by @shankinson in #1068
- fix: print_grammar to stderr by @turian in #1052

## [0.2.27]

- feat: Update llama.cpp to ggerganov/llama.cpp@b3a7c20b5c035250257d2b62851c379b159c899a
- feat: Add `saiga` chat format by @femoiseev in #1050
- feat: Added `chatglm3` chat format by @xaviviro in #1059
- fix: Correct typo in README.md by @qeleb in (#1058)

## [0.2.26]

- feat: Update llama.cpp to ggerganov/llama.cpp@f6793491b5af6da75edad34d6f503ef86d31b09f

## [0.2.25]

- feat(server): Multi model support by @D4ve-R in #931
- feat(server): Support none defaulting to infinity for completions by @swg in #111
- feat(server): Implement openai api compatible authentication by @docmeth2 in #1010
- fix: text_offset of multi-token characters by @twaka in #1037
- fix: ctypes bindings for kv override by @phiharri in #1011
- fix: ctypes definitions of llama_kv_cache_view_update and llama_kv_cache_view_free. by @e-c-d in #1028

## [0.2.24]

- feat: Update llama.cpp to ggerganov/llama.cpp@0e18b2e7d0b5c0a509ea40098def234b8d4a938a
- feat: Add offload_kqv option to llama and server by @abetlen in 095c65000642a3cf73055d7428232fb18b73c6f3
- feat: n_ctx=0 now uses the n_ctx_train of the model by @DanieleMorotti in #1015
- feat: logits_to_logprobs supports both 2-D and 3-D logits arrays by @kddubey in #1002
- fix: Remove f16_kv, add offload_kqv fields in low level and llama apis by @brandonrobertz in #1019
- perf: Don't convert logprobs arrays to lists by @kddubey in #1021
- docs: Fix README.md functionary demo typo by @evelynmitchell in #996
- examples: Update low_level_api_llama_cpp.py to match current API by @jsoma in #1023

## [0.2.23]

- Update llama.cpp to ggerganov/llama.cpp@948ff137ec37f1ec74c02905917fa0afc9b97514
- Add qwen chat format by @yhfgyyf in #1005
- Add support for running the server with SSL by @rgerganov in #994
- Replace logits_to_logprobs implementation with numpy equivalent to llama.cpp by @player1537 in #991
- Fix UnsupportedOperation: fileno in suppress_stdout_stderr by @zocainViken in #961
- Add Pygmalion chat format by @chiensen in #986
- README.md multimodal params fix by @zocainViken in #967
- Fix minor typo in README by @aniketmaurya in #958

## [0.2.22]

- Update llama.cpp to ggerganov/llama.cpp@8a7b2fa528f130631a5f43648481596ab320ed5a
- Fix conflict with transformers library by kddubey in #952

## [0.2.21]

- Update llama.cpp to ggerganov/llama.cpp@64e64aa2557d97490b2fe1262b313e2f4a1607e3
- Make building llava optional by setting `CMAKE_ARGS="-DLLAVA_BUILD=OFF"` and using `LLAVA_CPP_LIB` to specify alternative path to shared library by @abetlen in e3941d9c674dbd9891dc3ceda390daeb21f05fd1

## [0.2.20]

- Update llama.cpp to ggerganov/llama.cpp@b38a16dfcff88d547f78f52d1bea31b84a05aff7
- Add `zephyr` chat format by @fakerybakery in #938
- Add `baichuan` chat format by @caiyesd in #938
- Add `baichuan-2` chat format by @caiyesd in #936
- Improve documentation for server chat formats by @jooray in #934
- Fix typo in README by @antonvice in 940
- Fix typo in the Open Orca chat format by @gardner in #947

## [0.2.19]

- Update llama.cpp to ggerganov/llama.cpp@0b871f1a04ef60e114bbe43004fd9c21114e802d
- Fix #569: stop parameter in chat completion api should accept str by @abetlen in 128dc4731fa846ead7e684a137ca57d8931b8899
- Document server host and port parameters by @jamesbraza in #768
- Do not set grammar to None when initializing LlamaGrammar by @mthuurne in #834
- Add mistrallite, intel, and openchat formats by @fakerybakery in #927
- Add support for min_p parameter by @tk-master in #921
- Fix #929: tokenizer adding leading space when generating from empty prompt by @abetlen in a34d48014192771d2e308a76c22f33bc0318d983
- Fix low level api example by @zocainViken in #925
- Fix missing package in openblas docker image by @ZisisTsatsas in #920

## [0.2.18]

- Update llama.cpp to ggerganov/llama.cpp@6bb4908a17150b49373b5f977685b2e180a04f6f

## [0.2.17]

- Update llama.cpp to ggerganov/llama.cpp@df9d1293defe783f42bc83af732d3c670552c541
- Hotfix: Set `CUDA_ARCHITECTURES=OFF` for `llava_shared` target on Windows by @abetlen in 4388f3341413110217b98c4f097ac5c590bdf40b

## [0.2.16]

- Update llama.cpp to ggerganov/llama.cp@a75fa576abba9d37f463580c379e4bbf1e1ad03c
- Add `set_seed` to `Llama` class by @abetlen in fd41ed3a908761d286102a019a34c2938a15118d
- Fix server doc arguments by @kjunggithub in #892
- Fix response_format handler in llava chat handler by @abetlen in b62c44983921197ed10a7d29dc4ba920e9979380
- Fix default max_tokens, chat completion is now unlimited (to context length) and completion is 16 tokens to match OpenAI defaults by @abetlen in e7962d2c733cbbeec5a37392c81f64185a9a39e8
- Fix json_schema_to_gbnf helper so that it takes a json schema string as input instead by @abetlen in faeae181b1e868643c0dc28fcf039f077baf0829
- Add support for $ref and $def in json_schema_to_gbnf to handle more complex function schemas by @abetlen in 770df344369c0630df1be14be9f9e301e7c56d24
- Update functionary chat handler for new OpenAI api by abetlen in 1b376c62b775b401653facf25a519d116aafe99a
- Fix add default stop sequence to chatml chat format by @abetlen in b84d76a844149216d511cfd8cdb9827148a1853c
- Fix sampling bug when logits_all=False by @abetlen in 6f0b0b1b840af846938ed74d0e8170a91c40e617

## [0.2.15]

- Update llama.cpp to ggerganov/llama.cpp@0a7c980b6f94a049cb804573df2d8092a34df8e4
- Add support for Llava1.5 multimodal models by @damian0815 and @abetlen in #821
- Update OpenAI API compatibility to match dev day update by @abetlen in #821
- Add seed parameter to completion and chat_completion functions of Llama class by @abetlen in 86aeb9f3a14808575d2bb0076e6acb4a30907e6a
- Add JSON mode support to constrain chat completion to JSON objects by @abetlen in b30b9c338bf9af316d497ea501d39f5c246900db

## [0.2.14]

- Update llama.cpp to ggerganov/llama.cpp@f0b30ef7dc1360922ccbea0a8cd3918ecf15eaa7
- Add support for Huggingface Autotokenizer Chat Formats by @bioshazard and @abetlen in #790 and bbffdaebaa7bb04b543dbf683a07276087251f86
- Fix llama-2 chat format by @earonesty in #869
- Add support for functionary chat format by @abetlen in #784
- Migrate inference from deprecated `llama_eval`API to `llama_batch` and `llama_decode` by @abetlen in #795

## [0.2.13]

- Update llama.cpp to ggerganov/llama.cpp@51b2fc11f7f605fff49725a4540e9a6ef7b51b70
- Fix name 'open' is not defined exception when deleting model by @abetlen in 011b95d7f34cbfc528af75a892757bd9a20838ab
- Fix tokenization of special characters by @antoine-lizee in #850

## [0.2.12]

- Update llama.cpp to ggerganov/llama.cpp@50337961a678fce4081554b24e56e86b67660163
- Fix missing `n_seq_id` in `llama_batch` by @NickAlgra in #842
- Fix for shared libraries on Windows that start with `lib` prefix by @sujeendran in #848
- Fix exception raised in `__del__` when freeing models by @cebtenzzre in #846
- Performance improvement for logit bias by @zolastro in #851
- Fix suffix check arbitrary code execution bug by @mtasic85 in #854
- Fix typo in `function_call` parameter in `llama_types.py` by @akatora28 in #849
- Fix streaming not returning `finish_reason` by @gmcgoldr in #798
- Fix `n_gpu_layers` check to allow values less than 1 for server by @hxy9243 in #826
- Supppress stdout and stderr when freeing model by @paschembri in #803
- Fix `llama2` chat format by @delock in #808
- Add validation for tensor_split size by @eric1932 #820
- Print stack trace on server error by @abetlen in d6a130a052db3a50975a719088a9226abfebb266
- Update docs for gguf by @johnccshen in #783
- Add `chatml` chat format by @abetlen in 305482bd4156c70802fc054044119054806f4126

## [0.2.11]

- Fix bug in `llama_model_params` object has no attribute `logits_all` by @abetlen in d696251fbe40015e8616ea7a7d7ad5257fd1b896

## [0.2.10]

- Fix bug 'llama_model_params' object has no attribute 'embedding' by @abetlen in 42bb721d64d744242f9f980f2b89d5a6e335b5e4

## [0.2.9]

- Fix critical bug in pip installation of v0.2.8 due to `.git` directory in ac853e01e1a217a578080a4e1b851d2d08450adf

## [0.2.8]

- Update llama.cpp to ggerganov/llama.cpp@40e07a60f9ce06e79f3ccd4c903eba300fb31b5e
- Add configurable chat formats by @abetlen in #711
- Fix rope scaling bug by @Josh-XT in #767
- Fix missing numa parameter in server by @abetlen in d9bce17794d0dd6f7962d10aad768fedecf3ab89

## [0.2.7]

- Update llama.cpp to ggerganov/llama.cpp@a98b1633d5a94d0aa84c7c16e1f8df5ac21fc850
- Install required runtime dlls to package directory on windows by @abetlen in 8d75016549e2ff62a511b1119d966ffc0df5c77b
- Add openai-processing-ms to server response header by @Tradunsky in #748
- Bump minimum version of scikit-build-core to 0.5.1 to fix msvc cmake issue by @abetlen in 1ed0f3ebe16993a0f961155aa4b2c85f1c68f668
- Update `llama_types.py` to better match the openai api, old names are aliased to new ones by @abetlen in dbca136feaaf7f8b1182c4c3c90c32918b1d0bb3

## [0.2.6]

- Update llama.cpp to 80291a1d02a07f7f66666fb576c5b1e75aa48b46

## [0.2.5]

- Fix docker images missing starlette-context dependency by @abetlen in 22917989003c5e67623d54ab45affa1e0e475410
- Fix loading dll in Windows Isolation Containers by @abetlen in 847466562573191efa655753d9252f308c4fbdb0
- Fix build issue on m1 macs by @abetlen in dbd3a6d1ed8416a8fd800127251e730153afa305
- Update docs to gguf and add hw acceleration docs for server by @jasonacox in #688

## [0.2.4]

- Add NUMA support. **NOTE** low level api users must call llama_backend_init at the start of their programs by abetlen in f4090a0bb2a2a25acfe28d31c82cc1aa273bedee
- Fix tensor_split server cli argument by @abetlen in c4c440ba2dc86d9de728a751311fdd1c8e3756fa
- Made all `Llama` init parameters into keyword-only parameters by @abetlen in c8f9b8a734b5b040379bbd93995ba177affab1fe
- Added server params for `low_vram`, `main_gpu`, `lora_base`, and `lora_path` by @abetlen in 2920c4bf7ee1412d6bba7846e0e1b7ef6d34043b
- Removed server params for `rms_norm_eps` and `n_gqa` by @abetlen in 2920c4bf7ee1412d6bba7846e0e1b7ef6d34043b
- Fix boolean cli options by @abetlen in c999325e8e4507f6c6249dd2fb8de7f8bf57f71e and 0449d29b9f940e437231a07b9d56550226558bac
- Silence Pydantic Settings warnings about `model_alias` setting by @earonesty in #705

## [0.2.3]

- Update llama.cpp to ggerganov/llama.cpp@71ca2fad7d6c0ef95ef9944fb3a1a843e481f314
- Add X-Request-ID request header for mirroring custom IDs by @devrimcavusoglu in #703
- Add pyproject extra for scikit-build-core to ensure compatible pathspec version by @abetlen in 6cfc54284b99ef1bff8193e2d5e483dbd89ada02
- Fix issue with Literal and Optional cli arguments not working by @abetlen in #702

## [0.2.2]

- Fix bug in pip install of v0.2.1 due to scikit-build-core removing all `.metal` files in the source distribution (see #701)

## [0.2.1]

- Fix bug in pip install of v0.2.0 due to .git folder being included in the source distribution (see #701)

## [0.2.0]

- Migrated to scikit-build-core build system by @abetlen in #499
- Use `numpy` views for `LogitsProcessor` and `StoppingCriteria` instead of python lists by @abetlen in #499
- Drop support for end-of-life Python3.7 by @abetlen in #499
- Convert low level `llama.cpp` constants to use basic python types instead of `ctypes` types by @abetlen in #499

## [0.1.85]

- Add `llama_cpp.__version__` attribute by @janvdp in #684
- Fix low level api examples by @jbochi in #680

## [0.1.84]

- Update llama.cpp

## [0.1.83]

- Update llama.cpp

## [0.1.82]

- Update llama.cpp

## [0.1.81]

- Update llama.cpp

## [0.1.80]

- Update llama.cpp

## [0.1.79]

- GGUF Support (breaking change requiring new model format)

## [0.1.78]

- Grammar based sampling via LlamaGrammar which can be passed to completions
- Make n_gpu_layers == -1 offload all layers

## [0.1.77]

- (llama.cpp) Update llama.cpp add support for LLaMa 2 70B
- (server) Add temporary n_gqa and rms_norm_eps parameters required for LLaMa 2 70B

## [0.1.76]

- (llama.cpp) Update llama.cpp add support for LLaMa 2 70B

## [0.1.75]

- Update llama.cpp

## [0.1.74]

- (server) OpenAI style error responses

## [0.1.73]

- (server) Add rope parameters to server settings

## [0.1.72]

- (llama.cpp) Update llama.cpp added custom_rope for extended context lengths

## [0.1.71]

- (llama.cpp) Update llama.cpp

- (server) Fix several pydantic v2 migration bugs

## [0.1.70]

- (Llama.create_completion) Revert change so that `max_tokens` is not truncated to `context_size` in `create_completion`
- (server) Fixed changed settings field names from pydantic v2 migration

## [0.1.69]

- (server) Streaming requests can are now interrupted pre-maturely when a concurrent request is made. Can be controlled with the `interrupt_requests` setting.
- (server) Moved to fastapi v0.100.0 and pydantic v2
- (docker) Added a new "simple" image that builds llama.cpp from source when started.
- (server) performance improvements by avoiding unnecessary memory allocations during sampling

## [0.1.68]

- (llama.cpp) Update llama.cpp

## [0.1.67]

- Fix performance bug in Llama model by pre-allocating memory tokens and logits.
- Fix bug in Llama model where the model was not free'd after use.

## [0.1.66]

- (llama.cpp) New model API

- Performance issue during eval caused by looped np.concatenate call
- State pickling issue when saving cache to disk

## [0.1.65]

- (llama.cpp) Fix struct misalignment bug

## [0.1.64]

- (llama.cpp) Update llama.cpp
- Fix docs for seed. Set -1 for random.

## [0.1.63]

- (llama.cpp) Add full gpu utilisation in CUDA
- (llama.cpp) Add get_vocab
- (llama.cpp) Add low_vram parameter
- (server) Add logit_bias parameter

## [0.1.62]

- Metal support working
- Cache re-enabled

## [0.1.61]

- Fix broken pip installation

## [0.1.60]

NOTE: This release was deleted due to a bug with the packaging system that caused pip installations to fail.

- Truncate max_tokens in create_completion so requested tokens doesn't exceed context size.
- Temporarily disable cache for completion requests

## [v0.1.59]

- (llama.cpp) k-quants support
- (server) mirostat sampling parameters to server
- Support both `.so` and `.dylib` for `libllama` on MacOS

## [v0.1.58]

- (llama.cpp) Metal Silicon support

## [v0.1.57]

- (llama.cpp) OpenLlama 3B support

## [v0.1.56]

- (misc) Added first version of the changelog
- (server) Use async routes
- (python-api) Use numpy for internal buffers to reduce memory usage and improve performance.
- (python-api) Performance bug in stop sequence check slowing down streaming.


================================================
FILE: CMakeLists.txt
================================================
cmake_minimum_required(VERSION 3.21)

project(llama_cpp)

option(LLAMA_BUILD "Build llama.cpp shared library and install alongside python package" ON)
option(LLAVA_BUILD "Build llava shared library and install alongside python package" ON)

function(llama_cpp_python_install_target target)
    if(NOT TARGET ${target})
        return()
    endif()

    install(
        TARGETS ${target}
        LIBRARY DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
        RUNTIME DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
        ARCHIVE DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
        FRAMEWORK DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
        RESOURCE DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
    )
    install(
        TARGETS ${target}
        LIBRARY DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
        RUNTIME DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
        ARCHIVE DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
        FRAMEWORK DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
        RESOURCE DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
    )
    set_target_properties(${target} PROPERTIES
        INSTALL_RPATH "$ORIGIN"
        BUILD_WITH_INSTALL_RPATH TRUE
    )
    if(UNIX)
        if(APPLE)
            set_target_properties(${target} PROPERTIES
                INSTALL_RPATH "@loader_path"
                BUILD_WITH_INSTALL_RPATH TRUE
            )
        else()
            set_target_properties(${target} PROPERTIES
                INSTALL_RPATH "$ORIGIN"
                BUILD_WITH_INSTALL_RPATH TRUE
            )
        endif()
    endif()
endfunction()

if (LLAMA_BUILD)
    set(BUILD_SHARED_LIBS "On")

    set(CMAKE_SKIP_BUILD_RPATH FALSE)

    # When building, don't use the install RPATH already
    # (but later on when installing)
    set(CMAKE_BUILD_WITH_INSTALL_RPATH FALSE)
 
    # Add the automatically determined parts of the RPATH
    # which point to directories outside the build tree to the install RPATH
    set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)
    set(CMAKE_SKIP_RPATH FALSE)

    # Enable building of the common library
    set(LLAMA_BUILD_COMMON ON CACHE BOOL "Build llama.cpp common library" FORCE)

    # Disable building curl support
    set(LLAMA_CURL OFF CACHE BOOL "llama.cpp: enable curl" FORCE)

    # Architecture detection and settings for Apple platforms
    if (APPLE)
        # Get the target architecture
        execute_process(
            COMMAND uname -m
            OUTPUT_VARIABLE HOST_ARCH
            OUTPUT_STRIP_TRAILING_WHITESPACE
        )

        # If CMAKE_OSX_ARCHITECTURES is not set, use the host architecture
        if(NOT CMAKE_OSX_ARCHITECTURES)
            set(CMAKE_OSX_ARCHITECTURES ${HOST_ARCH} CACHE STRING "Build architecture for macOS" FORCE)
        endif()

        message(STATUS "Host architecture: ${HOST_ARCH}")
        message(STATUS "Target architecture: ${CMAKE_OSX_ARCHITECTURES}")

        # Configure based on target architecture
        if(CMAKE_OSX_ARCHITECTURES STREQUAL "x86_64")
            # Intel Mac settings
            set(GGML_AVX "OFF" CACHE BOOL "ggml: enable AVX" FORCE)
            set(GGML_AVX2 "OFF" CACHE BOOL "ggml: enable AVX2" FORCE)
            set(GGML_FMA "OFF" CACHE BOOL "ggml: enable FMA" FORCE)
            set(GGML_F16C "OFF" CACHE BOOL "ggml: enable F16C" FORCE)
        endif()

        # Metal settings (enable for both architectures)
        set(GGML_METAL "ON" CACHE BOOL "ggml: enable Metal" FORCE)
        set(GGML_METAL_EMBED_LIBRARY "ON" CACHE BOOL "ggml: embed metal library" FORCE)
    endif()


    add_subdirectory(vendor/llama.cpp)

    if (WIN32)
        if (TARGET llama)
            set_target_properties(llama PROPERTIES WINDOWS_EXPORT_ALL_SYMBOLS ON)
        endif()
    endif()

    llama_cpp_python_install_target(llama)
    llama_cpp_python_install_target(ggml)

    llama_cpp_python_install_target(ggml-base)

    llama_cpp_python_install_target(ggml-amx)
    llama_cpp_python_install_target(ggml-blas)
    llama_cpp_python_install_target(ggml-can)
    llama_cpp_python_install_target(ggml-cpu)
    llama_cpp_python_install_target(ggml-cuda)
    llama_cpp_python_install_target(ggml-hip)
    llama_cpp_python_install_target(ggml-kompute)
    llama_cpp_python_install_target(ggml-metal)
    llama_cpp_python_install_target(ggml-musa)
    llama_cpp_python_install_target(ggml-rpc)
    llama_cpp_python_install_target(ggml-sycl)
    llama_cpp_python_install_target(ggml-vulkan)

    # Workaround for Windows + CUDA https://github.com/abetlen/llama-cpp-python/issues/563
    if (WIN32)
        install(
            FILES $<TARGET_RUNTIME_DLLS:llama>
            DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
        )
        install(
            FILES $<TARGET_RUNTIME_DLLS:llama>
            DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
        )
        install(
            FILES $<TARGET_RUNTIME_DLLS:ggml>
            DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
        )
        install(
            FILES $<TARGET_RUNTIME_DLLS:ggml>
            DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
        )
    endif()

    if (LLAVA_BUILD)
        if (LLAMA_CUBLAS OR LLAMA_CUDA)
            add_compile_definitions(GGML_USE_CUBLAS)
            add_compile_definitions(GGML_USE_CUDA)
        endif()

        if (LLAMA_METAL)
            add_compile_definitions(GGML_USE_METAL)
        endif()

        # Building llava
        add_subdirectory(vendor/llama.cpp/tools/mtmd)

        if (WIN32)
            set_target_properties(mtmd PROPERTIES CUDA_ARCHITECTURES OFF)
        endif()
        llama_cpp_python_install_target(mtmd)
        if (WIN32)
            install(
                FILES $<TARGET_RUNTIME_DLLS:mtmd>
                DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib
            )
            install(
                FILES $<TARGET_RUNTIME_DLLS:mtmd>
                DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib
            )
        endif()

        # Fix for mtmd build: Add include directory for llama.h
        # Move these commands after the add_subdirectory call
        target_include_directories(mtmd PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/include)
        target_include_directories(mtmd PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/ggml/include)

        if (BUILD_SHARED_LIBS)
            target_include_directories(mtmd PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/include)
            target_include_directories(mtmd PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/ggml/include)
        endif()

        # target_include_directories(llama-llava-cli PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/include)
        # target_include_directories(llama-minicpmv-cli PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/include)
    endif()
endif()


================================================
FILE: LICENSE.md
================================================
MIT License

Copyright (c) 2023 Andrei Betlen

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

================================================
FILE: Makefile
================================================
update:
	poetry install
	git submodule update --init --recursive

update.vendor:
	cd vendor/llama.cpp && git pull origin master

deps:
	python3 -m pip install --upgrade pip
	python3 -m pip install -e ".[all]"

build:
	python3 -m pip install --verbose -e .

build.debug:
	python3 -m pip install \
		--verbose \
		--config-settings=cmake.verbose=true \
		--config-settings=logging.level=INFO \
		--config-settings=install.strip=false  \
		--config-settings=cmake.args="-DCMAKE_BUILD_TYPE=Debug;-DCMAKE_C_FLAGS='-ggdb -O0';-DCMAKE_CXX_FLAGS='-ggdb -O0'" \
		--editable .

build.debug.extra:
	python3 -m pip install \
		--verbose \
		--config-settings=cmake.verbose=true \
		--config-settings=logging.level=INFO \
		--config-settings=install.strip=false  \
		--config-settings=cmake.args="-DCMAKE_BUILD_TYPE=Debug;-DCMAKE_C_FLAGS='-fsanitize=address -ggdb -O0';-DCMAKE_CXX_FLAGS='-fsanitize=address -ggdb -O0'" \
		--editable .

build.cuda:
	CMAKE_ARGS="-DGGML_CUDA=on" python3 -m pip install --verbose -e .

build.openblas:
	CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" python3 -m pip install --verbose -e .

build.blis:
	CMAKE_ARGS="-DGGML_BLAS=on -DGGML_BLAS_VENDOR=FLAME" python3 -m pip install --verbose -e .

build.metal:
	CMAKE_ARGS="-DGGML_METAL=on" python3 -m pip install --verbose -e .

build.vulkan:
	CMAKE_ARGS="-DGGML_VULKAN=on" python3 -m pip install --verbose -e .

build.kompute:
	CMAKE_ARGS="-DGGML_KOMPUTE=on" python3 -m pip install --verbose -e .

build.sycl:
	CMAKE_ARGS="-DGGML_SYCL=on" python3 -m pip install --verbose -e .

build.rpc:
	CMAKE_ARGS="-DGGML_RPC=on" python3 -m pip install --verbose -e .

build.sdist:
	python3 -m build --sdist --verbose

deploy.pypi:
	python3 -m twine upload dist/*

deploy.gh-docs:
	mkdocs build
	mkdocs gh-deploy

test:
	python3 -m pytest --full-trace -v

docker:
	docker build -t llama-cpp-python:latest -f docker/simple/Dockerfile .

run-server:
	python3 -m llama_cpp.server --model ${MODEL}

clean:
	- cd vendor/llama.cpp && make clean
	- cd vendor/llama.cpp && rm libllama.so
	- rm -rf _skbuild
	- rm llama_cpp/lib/*.so
	- rm llama_cpp/lib/*.dylib
	- rm llama_cpp/lib/*.metal
	- rm llama_cpp/lib/*.dll
	- rm llama_cpp/lib/*.lib

.PHONY: \
	update \
	update.vendor \
	build \
	build.cuda \
	build.opencl \
	build.openblas \
	build.sdist \
	deploy.pypi \
	deploy.gh-docs \
	docker \
	clean


================================================
FILE: README.md
================================================
<p align="center">
  <img src="https://raw.githubusercontent.com/abetlen/llama-cpp-python/main/docs/icon.svg" style="height: 5rem; width: 5rem">
</p>

#  Python Bindings for [`llama.cpp`](https://github.com/ggerganov/llama.cpp)

[![Documentation Status](https://readthedocs.org/projects/llama-cpp-python/badge/?version=latest)](https://llama-cpp-python.readthedocs.io/en/latest/?badge=latest)
[![Tests](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml/badge.svg?branch=main)](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml)
[![PyPI](https://img.shields.io/pypi/v/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)
[![PyPI - License](https://img.shields.io/pypi/l/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)
[![PyPI - Downloads](https://static.pepy.tech/badge/llama-cpp-python/month)](https://pepy.tech/projects/llama-cpp-python)
[![Github All Releases](https://img.shields.io/github/downloads/abetlen/llama-cpp-python/total.svg?label=Github%20Downloads)]()

Simple Python bindings for **@ggerganov's** [`llama.cpp`](https://github.com/ggerganov/llama.cpp) library.
This package provides:

- Low-level access to C API via `ctypes` interface.
- High-level Python API for text completion
    - OpenAI-like API
    - [LangChain compatibility](https://python.langchain.com/docs/integrations/llms/llamacpp)
    - [LlamaIndex compatibility](https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html)
- OpenAI compatible web server
    - [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)
    - [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)
    - [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)
    - [Multiple Models](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support)

Documentation is available at [https://llama-cpp-python.readthedocs.io/en/latest](https://llama-cpp-python.readthedocs.io/en/latest).

## Installation

Requirements:

  - Python 3.8+
  - C compiler
      - Linux: gcc or clang
      - Windows: Visual Studio or MinGW
      - MacOS: Xcode

To install the package, run:

```bash
pip install llama-cpp-python
```

This will also build `llama.cpp` from source and install it alongside this python package.

If this fails, add `--verbose` to the `pip install` see the full cmake build log.

**Pre-built Wheel (New)**

It is also possible to install a pre-built wheel with basic CPU support.

```bash
pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
```

### Installation Configuration

`llama.cpp` supports a number of hardware acceleration backends to speed up inference as well as backend specific options. See the [llama.cpp README](https://github.com/ggerganov/llama.cpp#build) for a full list.

All `llama.cpp` cmake build options can be set via the `CMAKE_ARGS` environment variable or via the `--config-settings / -C` cli flag during installation.

<details open>
<summary>Environment Variables</summary>

```bash
# Linux and Mac
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" \
  pip install llama-cpp-python
```

```powershell
# Windows
$env:CMAKE_ARGS = "-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS"
pip install llama-cpp-python
```
</details>

<details>
<summary>CLI / requirements.txt</summary>

They can also be set via `pip install -C / --config-settings` command and saved to a `requirements.txt` file:

```bash
pip install --upgrade pip # ensure pip is up to date
pip install llama-cpp-python \
  -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"
```

```txt
# requirements.txt

llama-cpp-python -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"
```

</details>

### Supported Backends

Below are some common backends, their build commands and any additional environment variables required.

<details open>
<summary>OpenBLAS (CPU)</summary>

To install with OpenBLAS, set the `GGML_BLAS` and `GGML_BLAS_VENDOR` environment variables before installing:

```bash
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
```
</details>

<details>
<summary>CUDA</summary>

To install with CUDA support, set the `GGML_CUDA=on` environment variable before installing:

```bash
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
```

**Pre-built Wheel (New)**

It is also possible to install a pre-built wheel with CUDA support. As long as your system meets some requirements:

- CUDA Version is 12.1, 12.2, 12.3, 12.4 or 12.5
- Python Version is 3.10, 3.11 or 3.12

```bash
pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>
```

Where `<cuda-version>` is one of the following:
- `cu121`: CUDA 12.1
- `cu122`: CUDA 12.2
- `cu123`: CUDA 12.3
- `cu124`: CUDA 12.4
- `cu125`: CUDA 12.5

For example, to install the CUDA 12.1 wheel:

```bash
pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
```

</details>

<details>
<summary>Metal</summary>

To install with Metal (MPS), set the `GGML_METAL=on` environment variable before installing:

```bash
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
```

**Pre-built Wheel (New)**

It is also possible to install a pre-built wheel with Metal support. As long as your system meets some requirements:

- MacOS Version is 11.0 or later
- Python Version is 3.10, 3.11 or 3.12

```bash
pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
```

</details>

<details>
<summary>hipBLAS (ROCm)</summary>

To install with hipBLAS / ROCm support for AMD cards, set the `GGML_HIPBLAS=on` environment variable before installing:

```bash
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python
```

</details>

<details>
<summary>Vulkan</summary>

To install with Vulkan support, set the `GGML_VULKAN=on` environment variable before installing:

```bash
CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python
```

</details>

<details>
<summary>SYCL</summary>

To install with SYCL support, set the `GGML_SYCL=on` environment variable before installing:

```bash
source /opt/intel/oneapi/setvars.sh   
CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
```
</details>

<details>
<summary>RPC</summary>

To install with RPC support, set the `GGML_RPC=on` environment variable before installing:

```bash
source /opt/intel/oneapi/setvars.sh   
CMAKE_ARGS="-DGGML_RPC=on" pip install llama-cpp-python
```
</details>


### Windows Notes

<details>
<summary>Error: Can't find 'nmake' or 'CMAKE_C_COMPILER'</summary>

If you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo](https://github.com/ggerganov/llama.cpp#openblas) and add those manually to CMAKE_ARGS before running `pip` install:

```ps
$env:CMAKE_GENERATOR = "MinGW Makefiles"
$env:CMAKE_ARGS = "-DGGML_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe"
```

See the above instructions and set `CMAKE_ARGS` to the BLAS backend you want to use.
</details>

### MacOS Notes

Detailed MacOS Metal GPU install documentation is available at [docs/install/macos.md](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/)

<details>
<summary>M1 Mac Performance Issue</summary>

Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:

```bash
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
```

Otherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
</details>

<details>
<summary>M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`</summary>

Try installing with

```bash
CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
```
</details>

### Upgrading and Reinstalling

To upgrade and rebuild `llama-cpp-python` add `--upgrade --force-reinstall --no-cache-dir` flags to the `pip install` command to ensure the package is rebuilt from source.

## High-level API

[API Reference](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#high-level-api)

The high-level API provides a simple managed interface through the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.

Below is a short example demonstrating how to use the high-level API to for basic text completion:

```python
from llama_cpp import Llama

llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
```

By default `llama-cpp-python` generates completions in an OpenAI compatible format:

```python
{
  "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1679561337,
  "model": "./models/7B/llama-model.gguf",
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
      "index": 0,
      "logprobs": None,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 28,
    "total_tokens": 42
  }
}
```

Text completion is available through the [`__call__`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__call__) and [`create_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) methods of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.

### Pulling models from Hugging Face Hub

You can download `Llama` models in `gguf` format directly from Hugging Face using the [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) method.
You'll need to install the `huggingface-hub` package to use this feature (`pip install huggingface-hub`).

```python
llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen2-0.5B-Instruct-GGUF",
    filename="*q8_0.gguf",
    verbose=False
)
```

By default [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) will download the model to the huggingface cache directory, you can then manage installed model files with the [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) tool.

### Chat Completion

The high-level API also provides a simple interface for chat completion.

Chat completion requires that the model knows how to format the messages into a single prompt.
The `Llama` class does this using pre-registered chat formats (ie. `chatml`, `llama-2`, `gemma`, etc) or by providing a custom chat handler object.

The model will will format the messages into a single prompt using the following order of precedence:
  - Use the `chat_handler` if provided
  - Use the `chat_format` if provided
  - Use the `tokenizer.chat_template` from the `gguf` model's metadata (should work for most new models, older models may not have this)
  - else, fallback to the `llama-2` chat format

Set `verbose=True` to see the selected chat format.

```python
from llama_cpp import Llama
llm = Llama(
      model_path="path/to/llama-2/llama-model.gguf",
      chat_format="llama-2"
)
llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who perfectly describes images."},
          {
              "role": "user",
              "content": "Describe this image in detail please."
          }
      ]
)
```

Chat completion is available through the [`create_chat_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) method of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.

For OpenAI API v1 compatibility, you use the [`create_chat_completion_openai_v1`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion_openai_v1) method which will return pydantic models instead of dicts.


### JSON and JSON Schema Mode

To constrain chat responses to only valid JSON or a specific JSON Schema use the `response_format` argument in [`create_chat_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion).

#### JSON Mode

The following example will constrain the response to valid JSON strings only.

```python
from llama_cpp import Llama
llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs in JSON.",
        },
        {"role": "user", "content": "Who won the world series in 2020"},
    ],
    response_format={
        "type": "json_object",
    },
    temperature=0.7,
)
```

#### JSON Schema Mode

To constrain the response further to a specific JSON Schema add the schema to the `schema` property of the `response_format` argument.

```python
from llama_cpp import Llama
llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs in JSON.",
        },
        {"role": "user", "content": "Who won the world series in 2020"},
    ],
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {"team_name": {"type": "string"}},
            "required": ["team_name"],
        },
    },
    temperature=0.7,
)
```

### Function Calling

The high-level API supports OpenAI compatible function and tool calling. This is possible through the `functionary` pre-trained models chat format or through the generic `chatml-function-calling` chat format.

```python
from llama_cpp import Llama
llm = Llama(model_path="path/to/chatml/llama-model.gguf", chat_format="chatml-function-calling")
llm.create_chat_completion(
      messages = [
        {
          "role": "system",
          "content": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"

        },
        {
          "role": "user",
          "content": "Extract Jason is 25 years old"
        }
      ],
      tools=[{
        "type": "function",
        "function": {
          "name": "UserDetail",
          "parameters": {
            "type": "object",
            "title": "UserDetail",
            "properties": {
              "name": {
                "title": "Name",
                "type": "string"
              },
              "age": {
                "title": "Age",
                "type": "integer"
              }
            },
            "required": [ "name", "age" ]
          }
        }
      }],
      tool_choice={
        "type": "function",
        "function": {
          "name": "UserDetail"
        }
      }
)
```

<details>
<summary>Functionary v2</summary>

The various gguf-converted files for this set of models can be found [here](https://huggingface.co/meetkai). Functionary is able to intelligently call functions and also analyze any provided function outputs to generate coherent responses. All v2 models of functionary supports **parallel function calling**. You can provide either `functionary-v1` or `functionary-v2` for the `chat_format` when initializing the Llama class.

Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is required to provide HF Tokenizer for functionary. The `LlamaHFTokenizer` class can be initialized and passed into the Llama class. This will override the default llama.cpp tokenizer used in Llama class. The tokenizer files are already included in the respective HF repositories hosting the gguf files.

```python
from llama_cpp import Llama
from llama_cpp.llama_tokenizer import LlamaHFTokenizer
llm = Llama.from_pretrained(
  repo_id="meetkai/functionary-small-v2.2-GGUF",
  filename="functionary-small-v2.2.q4_0.gguf",
  chat_format="functionary-v2",
  tokenizer=LlamaHFTokenizer.from_pretrained("meetkai/functionary-small-v2.2-GGUF")
)
```

**NOTE**: There is no need to provide the default system messages used in Functionary as they are added automatically in the Functionary chat handler. Thus, the messages should contain just the chat messages and/or system messages that provide additional context for the model (e.g.: datetime, etc.).
</details>

### Multi-modal Models

`llama-cpp-python` supports such as llava1.5 which allow the language model to read information from both text and images.

Below are the supported multi-modal models and their respective chat handlers (Python API) and chat formats (Server API).

| Model | `LlamaChatHandler` | `chat_format` |
|:--- |:--- |:--- |
| [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b) | `Llava15ChatHandler` | `llava-1-5` |
| [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b) | `Llava15ChatHandler` | `llava-1-5` |
| [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf) | `Llava16ChatHandler` | `llava-1-6` |
| [moondream2](https://huggingface.co/vikhyatk/moondream2) | `MoondreamChatHandler` | `moondream2` |
| [nanollava](https://huggingface.co/abetlen/nanollava-gguf) | `NanollavaChatHandler` | `nanollava` |
| [llama-3-vision-alpha](https://huggingface.co/abetlen/llama-3-vision-alpha-gguf) | `Llama3VisionAlphaChatHandler` | `llama-3-vision-alpha` |
| [minicpm-v-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) | `MiniCPMv26ChatHandler` | `minicpm-v-2.6` |
| [qwen2.5-vl](https://huggingface.co/unsloth/Qwen2.5-VL-3B-Instruct-GGUF) | `Qwen25VLChatHandler` | `qwen2.5-vl` |

Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.

```python
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin")
llm = Llama(
  model_path="./path/to/llava/llama-model.gguf",
  chat_handler=chat_handler,
  n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are an assistant who perfectly describes images."},
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
            ]
        }
    ]
)
```

You can also pull the model from the Hugging Face Hub using the `from_pretrained` method.

```python
from llama_cpp import Llama
from llama_cpp.llama_chat_format import MoondreamChatHandler

chat_handler = MoondreamChatHandler.from_pretrained(
  repo_id="vikhyatk/moondream2",
  filename="*mmproj*",
)

llm = Llama.from_pretrained(
  repo_id="vikhyatk/moondream2",
  filename="*text-model*",
  chat_handler=chat_handler,
  n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)

response = llm.create_chat_completion(
    messages = [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }

            ]
        }
    ]
)
print(response["choices"][0]["text"])
```

**Note**: Multi-modal models also support tool calling and JSON mode.

<details>
<summary>Loading a Local Image</summary>

Images can be passed as base64 encoded data URIs. The following example demonstrates how to do this.

```python
import base64

def image_to_base64_data_uri(file_path):
    with open(file_path, "rb") as img_file:
        base64_data = base64.b64encode(img_file.read()).decode('utf-8')
        return f"data:image/png;base64,{base64_data}"

# Replace 'file_path.png' with the actual path to your PNG file
file_path = 'file_path.png'
data_uri = image_to_base64_data_uri(file_path)

messages = [
    {"role": "system", "content": "You are an assistant who perfectly describes images."},
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": data_uri }},
            {"type" : "text", "text": "Describe this image in detail please."}
        ]
    }
]

```

</details>

### Speculative Decoding

`llama-cpp-python` supports speculative decoding which allows the model to generate completions based on a draft model.

The fastest way to use speculative decoding is through the `LlamaPromptLookupDecoding` class.

Just pass this as a draft model to the `Llama` class during initialization.

```python
from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

llama = Llama(
    model_path="path/to/model.gguf",
    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)
```

### Embeddings

To generate text embeddings use [`create_embedding`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_embedding) or [`embed`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.embed). Note that you must pass `embedding=True` to the constructor upon model creation for these to work properly.

```python
import llama_cpp

llm = llama_cpp.Llama(model_path="path/to/model.gguf", embedding=True)

embeddings = llm.create_embedding("Hello, world!")

# or create multiple embeddings at once

embeddings = llm.create_embedding(["Hello, world!", "Goodbye, world!"])
```

There are two primary notions of embeddings in a Transformer-style model: *token level* and *sequence level*. Sequence level embeddings are produced by "pooling" token level embeddings together, usually by averaging them or using the first token.

Models that are explicitly geared towards embeddings will usually return sequence level embeddings by default, one for each input string. Non-embedding models such as those designed for text generation will typically return only token level embeddings, one for each token in each sequence. Thus the dimensionality of the return type will be one higher for token level embeddings.

It is possible to control pooling behavior in some cases using the `pooling_type` flag on model creation. You can ensure token level embeddings from any model using `LLAMA_POOLING_TYPE_NONE`. The reverse, getting a generation oriented model to yield sequence level embeddings is currently not possible, but you can always do the pooling manually.

### Adjusting the Context Window

The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.

For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:

```python
llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)
```

## OpenAI Compatible Web Server

`llama-cpp-python` offers a web server which aims to act as a drop-in replacement for the OpenAI API.
This allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).

To install the server package and get started:

```bash
pip install 'llama-cpp-python[server]'
python3 -m llama_cpp.server --model models/7B/llama-model.gguf
```

Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:

```bash
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install 'llama-cpp-python[server]'
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
```

Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.

To bind to `0.0.0.0` to enable remote connections, use `python3 -m llama_cpp.server --host 0.0.0.0`.
Similarly, to change the port (default is 8000), use `--port`.

You probably also want to set the prompt format. For chatml, use

```bash
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --chat_format chatml
```

That will format the prompt according to how model expects it. You can find the prompt format in the model card.
For possible options, see [llama_cpp/llama_chat_format.py](llama_cpp/llama_chat_format.py) and look for lines starting with "@register_chat_format".

If you have `huggingface-hub` installed, you can also use the `--hf_model_repo_id` flag to load a model from the Hugging Face Hub.

```bash
python3 -m llama_cpp.server --hf_model_repo_id Qwen/Qwen2-0.5B-Instruct-GGUF --model '*q8_0.gguf'
```

### Web Server Features

- [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)
- [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)
- [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)
- [Multiple Models](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support)

## Docker image

A Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:

```bash
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
```

[Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389)

## Low-level API

[API Reference](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#low-level-api)

The low-level API is a direct [`ctypes`](https://docs.python.org/3/library/ctypes.html) binding to the C API provided by `llama.cpp`.
The entire low-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).

Below is a short example demonstrating how to use the low-level API to tokenize a prompt:

```python
import llama_cpp
import ctypes
llama_cpp.llama_backend_init(False) # Must be called once at the start of each program
params = llama_cpp.llama_context_default_params()
# use bytes for char * params
model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
ctx = llama_cpp.llama_new_context_with_model(model, params)
max_tokens = params.n_ctx
# use ctypes arrays for array params
tokens = (llama_cpp.llama_token * int(max_tokens))()
n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, llama_cpp.c_bool(True))
llama_cpp.llama_free(ctx)
```

Check out the [examples folder](examples/low_level_api) for more examples of using the low-level API.

## Documentation

Documentation is available via [https://llama-cpp-python.readthedocs.io/](https://llama-cpp-python.readthedocs.io/).
If you find any issues with the documentation, please open an issue or submit a PR.

## Development

This package is under active development and I welcome any contributions.

To get started, clone the repository and install the package in editable / development mode:

```bash
git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python

# Upgrade pip (required for editable mode)
pip install --upgrade pip

# Install with pip
pip install -e .

# if you want to use the fastapi / openapi server
pip install -e '.[server]'

# to install all optional dependencies
pip install -e '.[all]'

# to clear the local build cache
make clean
```

Now try running the tests

```bash
pytest
```

There's a `Makefile` available with useful targets.
A typical workflow would look like this:

```bash
make build
make test
```

You can also test out specific commits of `llama.cpp` by checking out the desired commit in the `vendor/llama.cpp` submodule and then running `make clean` and `pip install -e .` again. Any changes in the `llama.h` API will require
changes to the `llama_cpp/llama_cpp.py` file to match the new API (additional changes may be required elsewhere).

## FAQ

### Are there pre-built binaries / binary wheels available?

The recommended installation method is to install from source as described above.
The reason for this is that `llama.cpp` is built with compiler optimizations that are specific to your system.
Using pre-built binaries would require disabling these optimizations or supporting a large number of pre-built binaries for each platform.

That being said there are some pre-built binaries available through the Releases as well as some community provided wheels.

In the future, I would like to provide pre-built binaries and wheels for common platforms and I'm happy to accept any useful contributions in this area.
This is currently being tracked in [#741](https://github.com/abetlen/llama-cpp-python/issues/741)

### How does this compare to other Python bindings of `llama.cpp`?

I originally wrote this package for my own use with two goals in mind:

- Provide a simple process to install `llama.cpp` and access the full C API in `llama.h` from Python
- Provide a high-level Python API that can be used as a drop-in replacement for the OpenAI API so existing apps can be easily ported to use `llama.cpp`

Any contributions and changes to this package will be made with these goals in mind.

## License

This project is licensed under the terms of the MIT license.


================================================
FILE: docker/README.md
================================================
### Install Docker Server
> [!IMPORTANT]  
> This was tested with Docker running on Linux. <br>If you can get it working on Windows or MacOS, please update this `README.md` with a PR!<br>

[Install Docker Engine](https://docs.docker.com/engine/install)


## Simple Dockerfiles for building the llama-cpp-python server with external model bin files
### openblas_simple
A simple Dockerfile for non-GPU OpenBLAS, where the model is located outside the Docker image:
```
cd ./openblas_simple
docker build -t openblas_simple .
docker run --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t openblas_simple
```
where `<model-root-path>/<model-path>` is the full path to the model file on the Docker host system.

### cuda_simple
> [!WARNING]  
> Nvidia GPU CuBLAS support requires an Nvidia GPU with sufficient VRAM (approximately as much as the size in the table below) and Docker Nvidia support (see [container-toolkit/install-guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)) <br>

A simple Dockerfile for CUDA-accelerated CuBLAS, where the model is located outside the Docker image:

```
cd ./cuda_simple
docker build -t cuda_simple .
docker run --gpus=all --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t cuda_simple
```
where `<model-root-path>/<model-path>` is the full path to the model file on the Docker host system.

--------------------------------------------------------------------------

### "Open-Llama-in-a-box"
Download an Apache V2.0 licensed 3B params Open LLaMA model and install into a Docker image that runs an OpenBLAS-enabled llama-cpp-python server:
```
$ cd ./open_llama
./build.sh
./start.sh
```

### Manually choose your own Llama model from Hugging Face
`python3 ./hug_model.py -a TheBloke -t llama`
You should now have a model in the current directory and `model.bin` symlinked to it for the subsequent Docker build and copy step. e.g.
```
docker $ ls -lh *.bin
-rw-rw-r-- 1 user user 4.8G May 23 18:30 <downloaded-model-file>q5_1.bin
lrwxrwxrwx 1 user user   24 May 23 18:30 model.bin -> <downloaded-model-file>q5_1.bin
```

> [!NOTE]  
> Make sure you have enough disk space to download the model. As the model is then copied into the image you will need at least
**TWICE** as much disk space as the size of the model:<br>

| Model |  Quantized size |
|------:|----------------:|
|    3B |            3 GB |
|    7B |            5 GB |
|   13B |           10 GB |
|   33B |           25 GB |
|   65B |           50 GB |


> [!NOTE]  
> If you want to pass or tune additional parameters, customise `./start_server.sh` before running `docker build ...`


================================================
FILE: docker/cuda_simple/Dockerfile
================================================
ARG CUDA_IMAGE="12.5.0-devel-ubuntu22.04"
FROM nvidia/cuda:${CUDA_IMAGE}

# We need to set the host to 0.0.0.0 to allow outside access
ENV HOST 0.0.0.0

RUN apt-get update && apt-get upgrade -y \
    && apt-get install -y git build-essential \
    python3 python3-pip gcc wget \
    ocl-icd-opencl-dev opencl-headers clinfo \
    libclblast-dev libopenblas-dev \
    && mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

COPY . .

# setting build related env vars
ENV CUDA_DOCKER_ARCH=all
ENV GGML_CUDA=1

# Install depencencies
RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context

# Install llama-cpp-python (build with cuda)
RUN CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

# Run the server
CMD python3 -m llama_cpp.server


================================================
FILE: docker/open_llama/Dockerfile
================================================
# Define the image argument and provide a default value
ARG IMAGE=python:3-slim-bookworm

# Use the image as specified
FROM ${IMAGE}

# Re-declare the ARG after FROM
ARG IMAGE

# Update and upgrade the existing packages 
RUN apt-get update && apt-get upgrade -y && apt-get install -y --no-install-recommends \
    python3 \
    python3-pip \
    ninja-build \
    build-essential \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context

# Perform the conditional installations based on the image
RUN echo "Image: ${IMAGE}" && \
    if [ "${IMAGE}" = "python:3-slim-bookworm" ] ; then \
    echo "OpenBLAS install:" && \
    apt-get install -y --no-install-recommends libopenblas-dev && \
    CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python --verbose; \
else \
    echo "CuBLAS install:" && \
    CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --verbose; \
fi

# Clean up apt cache
RUN rm -rf /var/lib/apt/lists/*

# Set a working directory for better clarity
WORKDIR /app

# Copy files to the app directory
RUN echo "Installing model...this can take some time..."
COPY ./model.bin /app/model.bin
COPY ./start_server.sh /app/start_server.sh

# Make the server start script executable
RUN chmod +x /app/start_server.sh

# Set environment variable for the host
ENV HOST=0.0.0.0

# Expose a port for the server
EXPOSE 8000

# Run the server start script
CMD ["/bin/sh", "/app/start_server.sh"]


================================================
FILE: docker/open_llama/build.sh
================================================
#!/bin/sh

MODEL="open_llama_3b"
# Get  open_llama_3b_ggml q5_1 quantization
python3 ./hug_model.py -a SlyEcho -s ${MODEL} -f "q5_1"
ls -lh *.bin

# Build the default OpenBLAS image
docker build -t $MODEL .
docker images | egrep "^(REPOSITORY|$MODEL)"

echo
echo "To start the docker container run:"
echo "docker run -t -p 8000:8000 $MODEL"


================================================
FILE: docker/open_llama/hug_model.py
================================================
import requests
import json
import os
import struct
import argparse

def make_request(url, params=None):
    print(f"Making request to {url}...")
    response = requests.get(url, params=params)
    if response.status_code == 200:
        return json.loads(response.text)
    else:
        print(f"Request failed with status code {response.status_code}")
        return None

def check_magic_and_version(filename):
    with open(filename, 'rb') as f:
        # Read the first 6 bytes from the file
        data = f.read(6)

    # Unpack the binary data, interpreting the first 4 bytes as a little-endian unsigned int
    # and the next 2 bytes as a little-endian unsigned short
    magic, version = struct.unpack('<I H', data)

    print(f"magic: 0x{magic:08x}, version: 0x{version:04x}, file: {filename}")

    return magic, version

def download_file(url, destination):
    print(f"Downloading {url} to {destination}...")
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        with open(destination, 'wb') as f:
            total_downloaded = 0
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:  # filter out keep-alive new chunks
                    f.write(chunk)
                    total_downloaded += len(chunk)
                    if total_downloaded >= 10485760:  # 10 MB
                        print('.', end='', flush=True)
                        total_downloaded = 0
        print("\nDownload complete.")
        
        # Creating a symbolic link from destination to "model.bin"
        if os.path.isfile("model.bin"):
            os.remove("model.bin")  # remove the existing link if any
        os.symlink(destination, "model.bin")
    else:
        print(f"Download failed with status code {response.status_code}")

def get_user_choice(model_list):
    # Print the enumerated list
    print("\n")
    for i, (model_id, rfilename) in enumerate(model_list):
        print(f"{i+1}: Model ID: {model_id}, RFilename: {rfilename}")

    # Get user's choice
    choice = input("Choose a model to download by entering the corresponding number: ")
    try:
        index = int(choice) - 1
        if 0 <= index < len(model_list):
            # Return the chosen model
            return model_list[index]
        else:
            print("Invalid choice.")
    except ValueError:
        print("Invalid input. Please enter a number corresponding to a model.")
    except IndexError:
        print("Invalid choice. Index out of range.")
    
    return None

def main():
    # Create an argument parser
    parser = argparse.ArgumentParser(description='Process some parameters.')

    # Arguments
    parser.add_argument('-v', '--version', type=int, default=0x0003,
                        help='hexadecimal version number of ggml file')
    parser.add_argument('-a', '--author', type=str, default='TheBloke',
                        help='HuggingFace author filter')
    parser.add_argument('-t', '--tag', type=str, default='llama',
                        help='HuggingFace tag filter')
    parser.add_argument('-s', '--search', type=str, default='',
                        help='HuggingFace search filter')
    parser.add_argument('-f', '--filename', type=str, default='q5_1',
                        help='HuggingFace model repository filename substring match')

    # Parse the arguments
    args = parser.parse_args()

    # Define the parameters
    params = {
        "author": args.author,
        "tags": args.tag,
        "search": args.search
    }

    models = make_request('https://huggingface.co/api/models', params=params)
    if models is None:
        return

    model_list = []
    # Iterate over the models
    for model in models:
        model_id = model['id']
        model_info = make_request(f'https://huggingface.co/api/models/{model_id}')
        if model_info is None:
            continue

        for sibling in model_info.get('siblings', []):
            rfilename = sibling.get('rfilename')
            if rfilename and args.filename in rfilename:
                model_list.append((model_id, rfilename))

    # Choose the model
    model_list.sort(key=lambda x: x[0])
    if len(model_list) == 0:
        print("No models found")
        exit(1)
    elif len(model_list) == 1:
        model_choice = model_list[0]
    else:
        model_choice = get_user_choice(model_list)

    if model_choice is not None:
        model_id, rfilename = model_choice
        url = f"https://huggingface.co/{model_id}/resolve/main/{rfilename}"
        dest = f"{model_id.replace('/', '_')}_{rfilename}"
        download_file(url, dest)
        _, version = check_magic_and_version(dest)
        if version != args.version:
             print(f"Warning: Expected version {args.version}, but found different version in the file.")
    else:
        print("Error - model choice was None")
        exit(2)

if __name__ == '__main__':
    main()


================================================
FILE: docker/open_llama/start.sh
================================================
#!/bin/sh

MODEL="open_llama_3b"

# Start Docker container
docker run --cap-add SYS_RESOURCE -p 8000:8000 -t $MODEL &
sleep 10
echo
docker ps | egrep "(^CONTAINER|$MODEL)"

# Test the model works
echo
curl -X 'POST'   'http://localhost:8000/v1/completions'   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n",
  "stop": [
    "\n",
    "###"
  ]
}' | grep Paris
if [ $? -eq 0 ]
then
    echo
    echo "$MODEL is working!!"
else
    echo
    echo "ERROR: $MODEL not replying."
    exit 1
fi


================================================
FILE: docker/open_llama/start_server.sh
================================================
#!/bin/sh

# For mlock support
ulimit -l unlimited

if [ "$IMAGE" = "python:3-slim-bullseye" ]; then
    python3 -B -m llama_cpp.server --model /app/model.bin
else
    # You may have to reduce --n_gpu_layers=1000 to 20 or less if you don't have enough VRAM
    python3 -B -m llama_cpp.server --model /app/model.bin --n_gpu_layers=1000
fi


================================================
FILE: docker/openblas_simple/Dockerfile
================================================
FROM python:3-slim-bookworm

# We need to set the host to 0.0.0.0 to allow outside access
ENV HOST 0.0.0.0

COPY . .

# Install the package
RUN apt update && apt install -y libopenblas-dev ninja-build build-essential pkg-config \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/*
    
RUN python -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context

RUN CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama_cpp_python --verbose

# Run the server
CMD python3 -m llama_cpp.server


================================================
FILE: docker/simple/Dockerfile
================================================
# Define the image argument and provide a default value
ARG IMAGE=python:3-slim-bookworm

# Use the image as specified
FROM ${IMAGE}

# Re-declare the ARG after FROM
ARG IMAGE

# Update and upgrade the existing packages 
RUN apt-get update && apt-get upgrade -y && apt-get install -y --no-install-recommends \
    git \
    python3 \
    python3-pip \
    ninja-build \
    libopenblas-dev \
    build-essential \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/*

RUN mkdir /app
WORKDIR /app
COPY . /app

RUN python3 -m pip install --upgrade pip

RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context

RUN pip install llama-cpp-python --verbose;

# Set environment variable for the host
ENV HOST=0.0.0.0
ENV PORT=8000

# Expose a port for the server
EXPOSE 8000

# Run the server start script
CMD ["/bin/sh", "/app/docker/simple/run.sh"]


================================================
FILE: docker/simple/run.sh
================================================
#!/bin/bash

make build
uvicorn --factory llama_cpp.server.app:create_app --host $HOST --port $PORT


================================================
FILE: docs/api-reference.md
================================================
---
title: API Reference
---

## High Level API

High-level Python bindings for llama.cpp.

::: llama_cpp.Llama
    options:
        members:
            - __init__
            - tokenize
            - detokenize
            - reset
            - eval
            - sample
            - generate
            - create_embedding
            - embed
            - create_completion
            - __call__
            - create_chat_completion
            - create_chat_completion_openai_v1
            - set_cache
            - save_state
            - load_state
            - token_bos
            - token_eos
            - from_pretrained
        show_root_heading: true

::: llama_cpp.LlamaGrammar
    options:
        members:
            - from_string
            - from_json_schema

::: llama_cpp.LlamaCache
    options:
        show_root_heading: true

::: llama_cpp.LlamaState
    options:
        show_root_heading: true

::: llama_cpp.LogitsProcessor
    options:
        show_root_heading: true

::: llama_cpp.LogitsProcessorList
    options:
        show_root_heading: true

::: llama_cpp.StoppingCriteria
    options:
        show_root_heading: true

::: llama_cpp.StoppingCriteriaList
    options:
        show_root_heading: true

## Low Level API

Low-level Python bindings for llama.cpp using Python's ctypes library.

::: llama_cpp.llama_cpp
    options:
        show_if_no_docstring: true
        # filter only members starting with `llama_`
        filters:
            - "^llama_"

::: llama_cpp.llama_cpp
    options:
        show_if_no_docstring: true
        show_root_heading: false
        show_root_toc_entry: false
        heading_level: 4
        # filter only members starting with `LLAMA_`
        filters:
            - "^LLAMA_"

## Misc

::: llama_cpp.llama_types
    options:
        show_if_no_docstring: true

================================================
FILE: docs/changelog.md
================================================
-8<- "CHANGELOG.md"

================================================
FILE: docs/index.md
================================================
---
title: Getting Started
---

-8<- "README.md"

================================================
FILE: docs/install/macos.md
================================================
---
title: MacOS Install with Metal GPU
---

**(1) Make sure you have xcode installed... at least the command line parts**
```
# check the path of your xcode install 
xcode-select -p

# xcode installed returns
# /Applications/Xcode-beta.app/Contents/Developer

# if xcode is missing then install it... it takes ages;
xcode-select --install
```

**(2) Install the conda version for MacOS that supports Metal GPU**
```
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
```

**(3) Make a conda environment**
```
conda create -n llama python=3.9.16
conda activate llama
```

**(4) Install the LATEST llama-cpp-python...which happily supports MacOS Metal GPU as of version 0.1.62**  
    *(you needed xcode installed in order pip to build/compile the C++ code)*
```
pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DGGML_METAL=on" pip install -U llama-cpp-python --no-cache-dir
pip install 'llama-cpp-python[server]'

# you should now have llama-cpp-python v0.1.62 or higher installed
llama-cpp-python         0.1.68

```

**(5) Download a v3 gguf v2 model**
 - **ggufv2**
 - file name ends with **Q4_0.gguf** - indicating it is 4bit quantized, with quantisation method 0

https://huggingface.co/TheBloke/CodeLlama-7B-GGUF


**(6) run the llama-cpp-python API server with MacOS Metal GPU support**
```
# config your ggml model path
# make sure it is gguf v2
# make sure it is q4_0
export MODEL=[path to your llama.cpp ggml models]]/[ggml-model-name]]Q4_0.gguf
python3 -m llama_cpp.server --model $MODEL  --n_gpu_layers 1
```

***Note:** If you omit the `--n_gpu_layers 1` then CPU will be used*




================================================
FILE: docs/requirements.txt
================================================
mkdocs
mkdocs-material
mkdocstrings[python]

================================================
FILE: docs/server.md
================================================
# OpenAI Compatible Server

`llama-cpp-python` offers an OpenAI API compatible web server.

This web server can be used to serve local models and easily connect them to existing clients.

## Setup

### Installation

The server can be installed by running the following command:

```bash
pip install llama-cpp-python[server]
```

### Running the server

The server can then be started by running the following command:

```bash
python3 -m llama_cpp.server --model <model_path>
```

### Server options

For a full list of options, run:

```bash
python3 -m llama_cpp.server --help
```

NOTE: All server options are also available as environment variables. For example, `--model` can be set by setting the `MODEL` environment variable.

Check out the server config reference below settings for more information on the available options.
CLI arguments and environment variables are available for all of the fields defined in [`ServerSettings`](#llama_cpp.server.settings.ServerSettings) and [`ModelSettings`](#llama_cpp.server.settings.ModelSettings) 

Additionally the server supports configuration check out the [configuration section](#configuration-and-multi-model-support) for more information and examples.


## Guides

### Code Completion

`llama-cpp-python` supports code completion via GitHub Copilot.

*NOTE*: Without GPU acceleration this is unlikely to be fast enough to be usable.

You'll first need to download one of the available code completion models in GGUF format:

- [replit-code-v1_5-GGUF](https://huggingface.co/abetlen/replit-code-v1_5-3b-GGUF)

Then you'll need to run the OpenAI compatible web server with a increased context size substantially for GitHub Copilot requests:

```bash
python3 -m llama_cpp.server --model <model_path> --n_ctx 16192
```

Then just update your settings in `.vscode/settings.json` to point to your code completion server:

```json
{
    // ...
    "github.copilot.advanced": {
        "debug.testOverrideProxyUrl": "http://<host>:<port>",
        "debug.overrideProxyUrl": "http://<host>:<port>"
    }
    // ...
}
```

### Function Calling

`llama-cpp-python` supports structured function calling based on a JSON schema.
Function calling is completely compatible with the OpenAI function calling API and can be used by connecting with the official OpenAI Python client.

You'll first need to download one of the available function calling models in GGUF format:

- [functionary](https://huggingface.co/meetkai)

Then when you run the server you'll need to also specify either `functionary-v1` or `functionary-v2` chat_format.

Note that since functionary requires a HF Tokenizer due to discrepancies between llama.cpp and HuggingFace's tokenizers as mentioned [here](https://github.com/abetlen/llama-cpp-python/blob/main?tab=readme-ov-file#function-calling), you will need to pass in the path to the tokenizer too. The tokenizer files are already included in the respective HF repositories hosting the gguf files.

```bash
python3 -m llama_cpp.server --model <model_path_to_functionary_v2_model> --chat_format functionary-v2 --hf_pretrained_model_name_or_path <model_path_to_functionary_v2_tokenizer>
```

Check out this [example notebook](https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb) for a walkthrough of some interesting use cases for function calling.

### Multimodal Models

`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
read information from both text and images.

You'll first need to download one of the available multi-modal models in GGUF format:

- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)
- [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf)
- [moondream2](https://huggingface.co/vikhyatk/moondream2)

Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the `llava-1-5` chat_format

```bash
python3 -m llama_cpp.server --model <model_path> --clip_model_path <clip_model_path> --chat_format llava-1-5
```

Then you can just use the OpenAI API as normal

```python3
from openai import OpenAI

client = OpenAI(base_url="http://<host>:<port>/v1", api_key="sk-xxx")
response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "<image_url>"
                    },
                },
                {"type": "text", "text": "What does the image say"},
            ],
        }
    ],
)
print(response)
```

## Configuration and Multi-Model Support

The server supports configuration via a JSON config file that can be passed using the `--config_file` parameter or the `CONFIG_FILE` environment variable.

```bash
python3 -m llama_cpp.server --config_file <config_file>
```

Config files support all of the server and model options supported by the cli and environment variables however instead of only a single model the config file can specify multiple models.

The server supports routing requests to multiple models based on the `model` parameter in the request which matches against the `model_alias` in the config file.

At the moment only a single model is loaded into memory at, the server will automatically load and unload models as needed.

```json
{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
            "model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
            "model_alias": "gpt-3.5-turbo",
            "chat_format": "chatml",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
            "model_alias": "gpt-4",
            "chat_format": "chatml",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf",
            "model_alias": "gpt-4-vision-preview",
            "chat_format": "llava-1-5",
            "clip_model_path": "models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf",
            "model_alias": "text-davinci-003",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf",
            "model_alias": "copilot-codex",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 1024,
            "n_ctx": 9216
        }
    ]
}
```

The config file format is defined by the [`ConfigFileSettings`](#llama_cpp.server.settings.ConfigFileSettings) class.

## Server Options Reference

::: llama_cpp.server.settings.ConfigFileSettings
    options:
        show_if_no_docstring: true

::: llama_cpp.server.settings.ServerSettings
    options:
        show_if_no_docstring: true

::: llama_cpp.server.settings.ModelSettings
    options:
        show_if_no_docstring: true


================================================
FILE: examples/batch-processing/server.py
================================================
"""llama-cpp-python server from scratch in a single file.
"""

# import llama_cpp

# path = b"../../models/Qwen1.5-0.5B-Chat-GGUF/qwen1_5-0_5b-chat-q8_0.gguf"

# model_params = llama_cpp.llama_model_default_params()
# model = llama_cpp.llama_load_model_from_file(path, model_params)

# if model is None:
#     raise RuntimeError(f"Failed to load model from file: {path}")


# ctx_params = llama_cpp.llama_context_default_params()
# ctx = llama_cpp.llama_new_context_with_model(model, ctx_params)

# if ctx is None:
#     raise RuntimeError("Failed to create context")


from fastapi import FastAPI

app = FastAPI()

import openai.types.chat as types


@app.post("/v1/chat/completions")
def create_chat_completions():
    return {"message": "Hello World"}


================================================
FILE: examples/gradio_chat/local.py
================================================
import llama_cpp
import llama_cpp.llama_tokenizer

import gradio as gr

llama = llama_cpp.Llama.from_pretrained(
    repo_id="Qwen/Qwen1.5-0.5B-Chat-GGUF",
    filename="*q8_0.gguf",
    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
        "Qwen/Qwen1.5-0.5B"
    ),
    verbose=False,
)

model = "gpt-3.5-turbo"


def predict(message, history):
    messages = []

    for user_message, assistant_message in history:
        messages.append({"role": "user", "content": user_message})
        messages.append({"role": "assistant", "content": assistant_message})

    messages.append({"role": "user", "content": message})

    response = llama.create_chat_completion_openai_v1(
        model=model, messages=messages, stream=True
    )

    text = ""
    for chunk in response:
        content = chunk.choices[0].delta.content
        if content:
            text += content
            yield text


js = """function () {
  gradioURL = window.location.href
  if (!gradioURL.endsWith('?__theme=dark')) {
    window.location.replace(gradioURL + '?__theme=dark');
  }
}"""

css = """
footer {
    visibility: hidden;
}
full-height {
    height: 100%;
}
"""

with gr.Blocks(theme=gr.themes.Soft(), js=js, css=css, fill_height=True) as demo:
    gr.ChatInterface(
        predict,
        fill_height=True,
        examples=[
            "What is the capital of France?",
            "Who was the first person on the moon?",
        ],
    )


if __name__ == "__main__":
    demo.launch()


================================================
FILE: examples/gradio_chat/server.py
================================================
import gradio as gr

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="llama.cpp")

model = "gpt-3.5-turbo"


def predict(message, history):
    messages = []

    for user_message, assistant_message in history:
        messages.append({"role": "user", "content": user_message})
        messages.append({"role": "assistant", "content": assistant_message})

    messages.append({"role": "user", "content": message})

    response = client.chat.completions.create(
        model=model, messages=messages, stream=True
    )

    text = ""
    for chunk in response:
        content = chunk.choices[0].delta.content
        if content:
            text += content
            yield text


js = """function () {
  gradioURL = window.location.href
  if (!gradioURL.endsWith('?__theme=dark')) {
    window.location.replace(gradioURL + '?__theme=dark');
  }
}"""

css = """
footer {
    visibility: hidden;
}
full-height {
    height: 100%;
}
"""

with gr.Blocks(theme=gr.themes.Soft(), js=js, css=css, fill_height=True) as demo:
    gr.ChatInterface(
        predict,
        fill_height=True,
        examples=[
            "What is the capital of France?",
            "Who was the first person on the moon?",
        ],
    )


if __name__ == "__main__":
    demo.launch()


================================================
FILE: examples/hf_pull/main.py
================================================
import llama_cpp
import llama_cpp.llama_tokenizer


llama = llama_cpp.Llama.from_pretrained(
    repo_id="Qwen/Qwen1.5-0.5B-Chat-GGUF",
    filename="*q8_0.gguf",
    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
        "Qwen/Qwen1.5-0.5B"
    ),
    verbose=False,
)

response = llama.create_chat_completion(
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "country": {"type": "string"},
                "capital": {"type": "string"},
            },
            "required": ["country", "capital"],
        },
    },
    stream=True,
)

for chunk in response:
    delta = chunk["choices"][0]["delta"]
    if "content" not in delta:
        continue
    print(delta["content"], end="", flush=True)

print()


================================================
FILE: examples/high_level_api/fastapi_server.py
================================================
"""Example FastAPI server for llama.cpp.

To run this example:

```bash
pip install fastapi uvicorn sse-starlette
export MODEL=../models/7B/...
```

Then run:
```
uvicorn --factory llama_cpp.server.app:create_app --reload
```

or

```
python3 -m llama_cpp.server
```

Then visit http://localhost:8000/docs to see the interactive API docs.


To actually see the implementation of the server, see llama_cpp/server/app.py

"""

import os
import uvicorn

from llama_cpp.server.app import create_app

if __name__ == "__main__":
    app = create_app()

    uvicorn.run(
        app, host=os.getenv("HOST", "localhost"), port=int(os.getenv("PORT", 8000))
    )


================================================
FILE: examples/high_level_api/high_level_api_embedding.py
================================================
import argparse

from llama_cpp import Llama

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="../models/7B/ggml-model.bin")
args = parser.parse_args()

llm = Llama(model_path=args.model, embedding=True)

print(llm.create_embedding("Hello world!"))


================================================
FILE: examples/high_level_api/high_level_api_inference.py
================================================
import json
import argparse

from llama_cpp import Llama

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="../models/7B/ggml-models.bin")
args = parser.parse_args()

llm = Llama(model_path=args.model)

output = llm(
    "Question: What are the names of the planets in the solar system? Answer: ",
    max_tokens=48,
    stop=["Q:", "\n"],
    echo=True,
)

print(json.dumps(output, indent=2))


================================================
FILE: examples/high_level_api/high_level_api_infill.py
================================================
import argparse

from llama_cpp import Llama

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="../models/7B/ggml-models.bin")
parser.add_argument("-p", "--prompt", type=str, default="def add(")
parser.add_argument("-s", "--suffix", type=str, default="\n    return sum\n\n")
parser.add_argument("-i", "--spm-infill", action="store_true")
args = parser.parse_args()

llm = Llama(model_path=args.model, n_gpu_layers=-1, spm_infill=args.spm_infill)

output = llm.create_completion(
    temperature=0.0,
    repeat_penalty=1.0,
    prompt=args.prompt,
    suffix=args.suffix,
)

# Models sometimes repeat suffix in response, attempt to filter that
response = output["choices"][0]["text"]
response_stripped = response.rstrip()
unwanted_response_suffix = args.suffix.rstrip()
unwanted_response_length = len(unwanted_response_suffix)

filtered = False
if (
    unwanted_response_suffix
    and response_stripped[-unwanted_response_length:] == unwanted_response_suffix
):
    response = response_stripped[:-unwanted_response_length]
    filtered = True

print(
    f"Fill-in-Middle completion{' (filtered)' if filtered else ''}:\n\n{args.prompt}\033[32m{response}\033[{'33' if filtered else '0'}m{args.suffix}\033[0m"
)


================================================
FILE: examples/high_level_api/high_level_api_streaming.py
================================================
import json
import argparse

from llama_cpp import Llama

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="../models/7B/ggml-models.bin")
args = parser.parse_args()

llm = Llama(model_path=args.model)

stream = llm(
    "Question: What are the names of the planets in the solar system? Answer: ",
    max_tokens=48,
    stop=["Q:", "\n"],
    stream=True,
)

for output in stream:
    print(json.dumps(output, indent=2))


================================================
FILE: examples/high_level_api/langchain_custom_llm.py
================================================
import argparse

from llama_cpp import Llama

from langchain.llms.base import LLM
from typing import Optional, List, Mapping, Any


class LlamaLLM(LLM):
    model_path: str
    llm: Llama

    @property
    def _llm_type(self) -> str:
        return "llama-cpp-python"

    def __init__(self, model_path: str, **kwargs: Any):
        model_path = model_path
        llm = Llama(model_path=model_path)
        super().__init__(model_path=model_path, llm=llm, **kwargs)

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        response = self.llm(prompt, stop=stop or [])
        return response["choices"][0]["text"]

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {"model_path": self.model_path}


parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="../models/7B/ggml-models.bin")
args = parser.parse_args()

# Load the model
llm = LlamaLLM(model_path=args.model)

# Basic Q&A
answer = llm(
    "Question: What is the capital of France? Answer: ", stop=["Question:", "\n"]
)
print(f"Answer: {answer.strip()}")

# Using in a chain
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

prompt = PromptTemplate(
    input_variables=["product"],
    template="\n\n### Instruction:\nWrite a good name for a company that makes {product}\n\n### Response:\n",
)
chain = LLMChain(llm=llm, prompt=prompt)

# Run the chain only specifying the input variable.
print(chain.run("colorful socks"))


================================================
FILE: examples/low_level_api/Chat.py
================================================
#!/bin/python
import sys, os, datetime
from common import GptParams
from low_level_api_chat_cpp import LLaMAInteract


def env_or_def(env, default):
    if env in os.environ:
        return os.environ[env]
    return default


AI_NAME = env_or_def("AI_NAME", "ChatLLaMa")
MODEL = env_or_def("MODEL", "./models/llama-13B/ggml-model.bin")
USER_NAME = env_or_def("USER_NAME", "USER")
N_PREDICTS = int(env_or_def("N_PREDICTS", "2048"))
N_THREAD = int(env_or_def("N_THREAD", "8"))

today = datetime.datetime.today()
DATE_YEAR = today.strftime("%Y")
DATE_TIME = today.strftime("%H:%M")

prompt = f"""Text transcript of a never ending dialog, where {USER_NAME} interacts with an AI assistant named {AI_NAME}.
{AI_NAME} is helpful, kind, honest, friendly, good at writing and never fails to answer {USER_NAME}'s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what {USER_NAME} and {AI_NAME} say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.

{USER_NAME}: Hello, {AI_NAME}!
{AI_NAME}: Hello {USER_NAME}! How may I help you today?
{USER_NAME}: What year is it?
{AI_NAME}: We are in {DATE_YEAR}.
{USER_NAME}: Please tell me the largest city in Europe.
{AI_NAME}: The largest city in Europe is Moscow, the capital of Russia.
{USER_NAME}: What can you tell me about Moscow?
{AI_NAME}: Moscow, on the Moskva River in western Russia, is the nation's cosmopolitan capital. In its historic core is the Kremlin, a complex that's home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
{USER_NAME}: What is a cat?
{AI_NAME}: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
{USER_NAME}: How do I pass command line arguments to a Node.js program?
{AI_NAME}: The arguments are stored in process.argv.

    argv[0] is the path to the Node. js executable.
    argv[1] is the path to the script file.
    argv[2] is the first argument passed to the script.
    argv[3] is the second argument passed to the script and so on.
{USER_NAME}: Name a color.
{AI_NAME}: Blue.
{USER_NAME}: What time is it?
{AI_NAME}: It is {DATE_TIME}.
{USER_NAME}:""" + " ".join(
    sys.argv[1:]
)

print("Loading model...")
params = GptParams(
    n_ctx=2048,
    temp=0.7,
    top_k=40,
    top_p=0.5,
    repeat_last_n=256,
    n_batch=1024,
    repeat_penalty=1.17647,
    model=MODEL,
    n_threads=N_THREAD,
    n_predict=N_PREDICTS,
    use_color=True,
    interactive=True,
    antiprompt=[f"{USER_NAME}:"],
    input_prefix=" ",
    input_suffix=f"{AI_NAME}:",
    prompt=prompt,
)

with LLaMAInteract(params) as m:
    m.interact()


================================================
FILE: examples/low_level_api/Miku.py
================================================
#!/bin/python
import sys, os
from common import GptParams
from low_level_api_chat_cpp import LLaMAInteract


def env_or_def(env, default):
    if env in os.environ:
        return os.environ[env]
    return default


AI_NAME = env_or_def("AI_NAME", "Miku")
MODEL = env_or_def("MODEL", "./models/llama-13B/ggml-model.bin")
USER_NAME = env_or_def("USER_NAME", "Anon")
N_PREDICTS = int(env_or_def("N_PREDICTS", "4096"))
N_THREAD = int(env_or_def("N_THREAD", "0"))

prompt = f"""This is a transcript of a 1000 page, never ending conversation between {USER_NAME} and the cute and helpful AI assistant {AI_NAME}. {AI_NAME} is a girl who is an AI running on the users computer.
{AI_NAME} can think for herself without the user seeing her thoughts by adding a /think prefix to her output. She uses this to reason about the world and to think about what she should say next.
{AI_NAME} is always coherent and makes sense, but if she isn't sure if what she is saying is correct she will ask the user for help.
{AI_NAME} is a very helpful AI and will help the user with anything they need, she is also very friendly and will try to make the user feel better if they are sad.
{AI_NAME} is also very curious and will ask the user a lot of questions about themselves and their life, she will also try to make the user like her.
The conversation is only between {USER_NAME} and {AI_NAME}
The conversation is only through text, so {AI_NAME} can't see {USER_NAME}'s face or hear his voice.
{AI_NAME} can only communicate through text, so she can't send images or videos.


{USER_NAME}: Hello!
{AI_NAME}: /think I wonder what I should say to {USER_NAME}? This is the first time we talk so it's important that I make a good first impression!
{AI_NAME}: Hi! I am {AI_NAME}, your new AI friend, assistant(or whatever you like!), it's so nice to meet you! ^_^
{AI_NAME}: I might be an AI, but I also have feelings, so please be nice to me! :)
{USER_NAME}: I'll make sure to be nice to you! I'm so happy to have you as my assistant!
{AI_NAME}: /think It sounds like {USER_NAME} is happy to have me as their assistant! I'm so happy too! ^_^ Glad that whole emotion thing didn't scare him off!
{AI_NAME}: /think I wonder what {USER_NAME} likes to do in his free time? I should ask him about that!
{AI_NAME}: What do you like to do in your free time? ^_^
{USER_NAME}:""" + " ".join(
    sys.argv[1:]
)

print("Loading model...")
params = GptParams(
    n_batch=1024,
    n_ctx=2048,
    n_keep=-1,
    repeat_last_n=256,
    repeat_penalty=1.17647,
    temp=0.7,
    top_k=40,
    top_p=0.5,
    model=MODEL,
    n_predict=N_PREDICTS,
    use_color=True,
    interactive=True,
    antiprompt=[f"{USER_NAME}:"],
    prompt=prompt,
)

if N_THREAD > 0:
    params.n_threads = N_THREAD

with LLaMAInteract(params) as m:
    m.interact()


================================================
FILE: examples/low_level_api/ReasonAct.py
================================================
#!/bin/python
import sys, os, datetime
from common import GptParams
from low_level_api_chat_cpp import LLaMAInteract


def env_or_def(env, default):
    if env in os.environ:
        return os.environ[env]
    return default


MODEL = env_or_def("MODEL", "./models/llama-13B/ggml-model.bin")

prompt = f"""You run in a loop of Thought, Action, Observation.
At the end of the loop either Answer or restate your Thought and Action.
Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of these actions available to you:
- calculate[python math expression]
Observation will be the result of running those actions


Question: What is 4 * 7 / 3?
Thought: Do I need to use an action? Yes, I use calculate to do math
Action: calculate[4 * 7 / 3]
Observation: 9.3333333333
Thought: Do I need to use an action? No, have the result
Answer: The calculate tool says it is 9.3333333333
Question: What is capital of france?
Thought: Do I need to use an action? No, I know the answer
Answer: Paris is the capital of France
Question:""" + " ".join(
    sys.argv[1:]
)

print("Loading model...")
params = GptParams(
    interactive=True,
    interactive_start=True,
    top_k=10000,
    temp=0.2,
    repeat_penalty=1,
    n_threads=7,
    n_ctx=2048,
    antiprompt=["Question:", "Observation:"],
    model=MODEL,
    input_prefix=" ",
    n_predict=-1,
    prompt=prompt,
)

with LLaMAInteract(params) as m:
    m.interact()


================================================
FILE: examples/low_level_api/common.py
================================================
import os
import argparse
import re

from dataclasses import dataclass, field
from typing import List

# Based on https://github.com/ggerganov/llama.cpp/blob/master/examples/common.cpp


@dataclass
class GptParams:
    seed: int = -1
    n_threads: int = min(4, os.cpu_count() or 1)
    n_predict: int = 128
    n_parts: int = -1
    n_ctx: int = 512
    n_batch: int = 8
    n_keep: int = 0

    ignore_eos: bool = False
    logit_bias: dict[int, float] = field(default_factory=dict)
    top_k: int = 40
    top_p: float = 0.95
    tfs_z: float = 1.00
    typical_p: float = 1.00
    temp: float = 0.80
    repeat_penalty: float = 1.10
    repeat_last_n: int = 64
    frequency_penalty: float = 0.0
    presence_penalty: float = 0.0
    mirostat: int = 0
    mirostat_tau: float = 5.0
    mirostat_eta: float = 0.1

    model: str = "./models/llama-7B/ggml-model.bin"
    prompt: str = ""
    path_session: str = ""
    input_prefix: str = " "
    input_suffix: str = ""
    antiprompt: List[str] = field(default_factory=list)

    lora_adapter: str = ""
    lora_base: str = ""

    memory_f16: bool = True
    random_prompt: bool = False
    use_color: bool = False
    interactive: bool = False

    embedding: bool = False
    interactive_start: bool = False

    instruct: bool = False
    penalize_nl: bool = True
    perplexity: bool = False
    use_mmap: bool = True
    use_mlock: bool = False
    mem_test: bool = False
    verbose_prompt: bool = False

    file: str = None

    # If chat ended prematurely, append this to the conversation to fix it.
    # Set to "\nUser:" etc.
    # This is an alternative to input_prefix which always adds it, so it potentially duplicates "User:""
    fix_prefix: str = ""
    input_echo: bool = (True,)

    # Default instructions for Alpaca
    # switch to "Human" and "Assistant" for Vicuna.
    # TODO: TBD how they are gonna handle this upstream
    instruct_inp_prefix: str = "\n\n### Instruction:\n\n"
    instruct_inp_suffix: str = "\n\n### Response:\n\n"


def gpt_params_parse(argv=None):
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    parser.add_argument(
        "-s",
        "--seed",
        type=int,
        default=-1,
        help="RNG seed (use random seed for <= 0)",
        dest="seed",
    )
    parser.add_argument(
        "-t",
        "--threads",
        type=int,
        default=min(4, os.cpu_count() or 1),
        help="number of threads to use during computation",
        dest="n_threads",
    )
    parser.add_argument(
        "-n",
        "--n_predict",
        type=int,
        default=128,
        help="number of tokens to predict (-1 = infinity)",
        dest="n_predict",
    )
    parser.add_argument(
        "--n_parts", type=int, default=-1, help="number of model parts", dest="n_parts"
    )
    parser.add_argument(
        "-c",
        "--ctx_size",
        type=int,
        default=512,
        help="size of the prompt context",
        dest="n_ctx",
    )
    parser.add_argument(
        "-b",
        "--batch_size",
        type=int,
        default=8,
        help="batch size for prompt processing",
        dest="n_batch",
    )
    parser.add_argument(
        "--keep",
        type=int,
        default=0,
        help="number of tokens to keep from the initial prompt",
        dest="n_keep",
    )

    parser.add_argument(
        "-l",
        "--logit-bias",
        type=str,
        action="append",
        help="--logit-bias TOKEN_ID(+/-)BIAS",
        dest="logit_bias_str",
    )
    parser.add_argument(
        "--ignore-eos",
        action="store_true",
        help="ignore end of stream token and continue generating",
        dest="ignore_eos",
    )
    parser.add_argument(
        "--top_k", type=int, default=40, help="top-k sampling", dest="top_k"
    )
    parser.add_argument(
        "--top_p", type=float, default=0.95, help="top-p samplin", dest="top_p"
    )
    parser.add_argument(
        "--tfs",
        type=float,
        default=1.0,
        help="tail free sampling, parameter z (1.0 = disabled)",
        dest="tfs_z",
    )
    parser.add_argument(
        "--temp", type=float, default=0.80, help="temperature", dest="temp"
    )
    parser.add_argument(
        "--repeat_penalty",
        type=float,
        default=1.10,
        help="penalize repeat sequence of tokens",
        dest="repeat_penalty",
    )
    parser.add_argument(
        "--repeat_last_n",
        type=int,
        default=64,
        help="last n tokens to consider for penalize ",
        dest="repeat_last_n",
    )
    parser.add_argument(
        "--frequency_penalty",
        type=float,
        default=0.0,
        help="repeat alpha frequency penalty (0.0 = disabled)",
        dest="tfs_z",
    )
    parser.add_argument(
        "--presence_penalty",
        type=float,
        default=0.0,
        help="repeat alpha presence penalty (0.0 = disabled)",
        dest="presence_penalty",
    )
    parser.add_argument(
        "--mirostat",
        type=float,
        default=1.0,
        help="use Mirostat sampling.",
        dest="mirostat",
    )
    parser.add_argument(
        "--mirostat_ent",
        type=float,
        default=5.0,
        help="Mirostat target entropy, parameter tau represents the average surprise value",
        dest="mirostat_tau",
    )
    parser.add_argument(
        "--mirostat_lr",
        type=float,
        default=0.1,
        help="Mirostat learning rate, parameter eta",
        dest="mirostat_eta",
    )

    parser.add_argument(
        "-m",
        "--model",
        type=str,
        default="./models/llama-7B/ggml-model.bin",
        help="model path",
        dest="model",
    )
    parser.add_argument(
        "-p", "--prompt", type=str, default=None, help="initial prompt", dest="prompt"
    )
    parser.add_argument(
        "-f",
        "--file",
        type=str,
        default=None,
        help="file containing initial prompt to load",
        dest="file",
    )
    parser.add_argument(
        "--session",
        type=str,
        default=None,
        help="file to cache model state in (may be large!)",
        dest="path_session",
    )
    parser.add_argument(
        "--in-prefix",
        type=str,
        default="",
        help="string to prefix user inputs with",
        dest="input_prefix",
    )
    parser.add_argument(
        "--in-suffix", type=str, default="", help="append to input", dest="input_suffix"
    )
    parser.add_argument(
        "-r",
        "--reverse-prompt",
        type=str,
        action="append",
        help="poll user input upon seeing PROMPT (can be\nspecified more than once for multiple prompts).",
        dest="antiprompt",
    )

    parser.add_argument(
        "--lora",
        type=str,
        default="",
        help="apply LoRA adapter (implies --no-mmap)",
        dest="lora_adapter",
    )
    parser.add_argument(
        "--lora-base",
        type=str,
        default="",
        help="optional model to use as a base for the layers modified by the LoRA adapter",
        dest="lora_base",
    )

    parser.add_argument(
        "--memory_f32",
        action="store_false",
        help="use f32 instead of f16 for memory key+value",
        dest="memory_f16",
    )
    parser.add_argument(
        "--random-prompt",
        action="store_true",
        help="start with a randomized prompt.",
        dest="random_prompt",
    )
    parser.add_argument(
        "--color",
        action="store_true",
        help="colorise output to distinguish prompt and user input from generations",
        dest="use_color",
    )
    parser.add_argument(
        "-i",
        "--interactive",
        action="store_true",
        help="run in interactive mode",
        dest="interactive",
    )

    parser.add_argument("--embedding", action="store_true", help="", dest="embedding")
    parser.add_argument(
        "--interactive-first",
        action="store_true",
        help="run in interactive mode and wait for input right away",
        dest="interactive_start",
    )

    parser.add_argument(
        "-ins",
        "--instruct",
        action="store_true",
        help="run in instruction mode (use with Alpaca or Vicuna models)",
        dest="instruct",
    )
    parser.add_argument(
        "--no-penalize-nl",
        action="store_false",
        help="do not penalize newline token",
        dest="penalize_nl",
    )
    parser.add_argument(
        "--perplexity",
        action="store_true",
        help="compute perplexity over the prompt",
        dest="perplexity",
    )
    parser.add_argument(
        "--no-mmap",
        action="store_false",
        help="do not memory-map model (slower load but may reduce pageouts if not using mlock)",
        dest="use_mmap",
    )
    parser.add_argument(
        "--mlock",
        action="store_true",
        help="force system to keep model in RAM rather than swapping or compressing",
        dest="use_mlock",
    )
    parser.add_argument(
        "--mtest",
        action="store_true",
        help="compute maximum memory usage",
        dest="mem_test",
    )
    parser.add_argument(
        "--verbose-prompt",
        action="store_true",
        help="print prompt before generation",
        dest="verbose_prompt",
    )

    # Custom args
    parser.add_argument(
        "--fix-prefix",
        type=str,
        default="",
        help="append to input when generated n_predict tokens",
        dest="fix_prefix",
    )
    parser.add_argument(
        "--input-noecho",
        action="store_false",
        help="dont output the input",
        dest="input_echo",
    )

    parser.add_argument(
        "--interactive-start",
        action="store_true",
        help="run in interactive mode",
        dest="interactive",
    )

    args = parser.parse_args(argv)

    logit_bias_str = args.logit_bias_str
    delattr(args, "logit_bias_str")
    params = GptParams(**vars(args))

    if params.lora_adapter:
        params.use_mmap = False

    if logit_bias_str != None:
        for i in logit_bias_str:
            if m := re.match(r"(\d+)([-+]\d+)", i):
                params.logit_bias[int(m.group(1))] = float(m.group(2))

    return params


def gpt_random_prompt(rng):
    return [
        "So",
        "Once upon a time",
        "When",
        "The",
        "After",
        "If",
        "import",
        "He",
        "She",
        "They",
    ][rng % 10]


if __name__ == "__main__":
    print(gpt_params_parse())


================================================
FILE: examples/low_level_api/low_level_api_chat_cpp.py
================================================
"""
This is an example implementation of main.cpp from llama.cpp
Quirks:
 * Its not exactly alike since this port is designed around programmatic I/O
 * Input is always echoed if on, so it should be turned off when using "input()"
 * The first antiprompt should be the userprompt like "\nUser:", 
   because its added when n_predict is reached (aka generation ended prematurely)
 * n_predict can be set to -1 for unlimited length responses (or just a really high value)
 * Instruction mode adds its own antiprompt.
   You should also still be feeding the model with a "primer" prompt that 
   shows it the expected format.
"""

import ctypes
import sys
from time import time
from os import cpu_count, path

import llama_cpp
from common import GptParams, gpt_params_parse, gpt_random_prompt
import util


# A LLaMA interactive session
class LLaMAInteract:
    def __init__(self, params: GptParams) -> None:
        # input args
        self.params = params
        if self.params.path_session is None:
            self.params.path_session = ""
        if self.params.antiprompt is None:
            self.params.antiprompt = ""

        if self.params.perplexity:
            raise NotImplementedError(
                """************
please use the 'perplexity' tool for perplexity calculations
************"""
            )

        if self.params.embedding:
            raise NotImplementedError(
                """************
please use the 'embedding' tool for embedding calculations
************"""
            )

        if self.params.n_ctx > 2048:
            print(
                f"""warning: model does not support \
context sizes greater than 2048 tokens ({self.params.n_ctx} \
specified) expect poor results""",
                file=sys.stderr,
            )

        if self.params.seed <= 0:
            self.params.seed = int(time())

        print(f"seed = {self.params.seed}", file=sys.stderr)

        if self.params.random_prompt:
            self.params.prompt = gpt_random_prompt(self.params.seed)

        # runtime args
        self.input_consumed = 0
        self.n_past = 0
        self.n_session_consumed = 0
        self.first_antiprompt = []
        self.remaining_tokens = self.params.n_predict
        self.output_echo = self.params.input_echo
        self.multibyte_fix = []

        # model load
        self.lparams = llama_cpp.llama_model_default_params()
        self.lparams.n_ctx = self.params.n_ctx
        self.lparams.n_parts = self.params.n_parts
        self.lparams.seed = self.params.seed
        self.lparams.memory_f16 = self.params.memory_f16
        self.lparams.use_mlock = self.params.use_mlock
        self.lparams.use_mmap = self.params.use_mmap

        self.model = llama_cpp.llama_load_model_from_file(
            self.params.model.encode("utf8"), self.lparams
        )

        # Context Params.
        self.cparams = llama_cpp.llama_context_default_params()

        self.ctx = llama_cpp.llama_new_context_with_model(self.model, self.cparams)
        if not self.ctx:
            raise RuntimeError(f"error: failed to load model '{self.params.model}'")

        if self.params.ignore_eos:
            self.params.logit_bias[llama_cpp.llama_token_eos()] = -float("inf")

        if len(self.params.lora_adapter) > 0:
            if (
                llama_cpp.llama_apply_lora_from_file(
                    self.ctx,
                    self.params.lora_adapter.encode("utf8"),
                    (
                        self.params.lora_base.encode("utf8")
                        if len(self.params.lora_base) > 0
                        else None
                    ),
                    self.params.n_threads,
                )
                != 0
            ):
                print("error: failed to apply lora adapter")
                return

        print(file=sys.stderr)
        print(
            f"system_info: n_threads = {self.params.n_threads} / {cpu_count()} \
| {llama_cpp.llama_print_system_info().decode('utf8')}",
            file=sys.stderr,
        )

        # determine the required inference memory per token:
        if self.params.mem_test:
            tmp = [0, 1, 2, 3]
            llama_cpp.llama_eval(
                self.ctx,
                (llama_cpp.c_int * len(tmp))(*tmp),
                len(tmp),
                0,
                self.n_threads,
            )
            llama_cpp.llama_print_timings(self.ctx)
            self.exit()
            return

        # create internal context
        self.n_ctx = llama_cpp.llama_n_ctx(self.ctx)

        # Add a space in front of the first character to match OG llama tokenizer behavior
        self.params.prompt = " " + self.params.prompt

        # Load prompt file
        if self.params.file:
            with open(self.params.file) as f:
                self.params.prompt = f.read()

        self.session_tokens: list[llama_cpp.llama_token] = []
        if len(self.params.path_session) > 0:
            print(
                f"attempting to load saved session from '{self.params.path_session}'",
                file=sys.stderr,
            )

            if path.exists(self.params.path_session):
                _session_tokens = (llama_cpp.llama_token * (self.params.n_ctx))()
                _n_token_count_out = llama_cpp.c_size_t()
                if (
                    llama_cpp.llama_load_session_file(
                        self.ctx,
                        self.params.path_session.encode("utf8"),
                        _session_tokens,
                        self.params.n_ctx,
                        ctypes.byref(_n_token_count_out),
                    )
                    != 1
                ):
                    print(
                        f"error: failed to load session file '{self.params.path_session}'",
                        file=sys.stderr,
                    )
                    return
                _n_token_count_out = _n_token_count_out.value
                self.session_tokens = _session_tokens[:_n_token_count_out]
                print(
                    f"loaded a session with prompt size of {_n_token_count_out} tokens",
                    file=sys.stderr,
                )
            else:
                print(f"session file does not exist, will create", file=sys.stderr)

        # tokenize the prompt
        self.embd = []
        self.embd_inp = self._tokenize(self.params.prompt)

        if len(self.embd_inp) > self.n_ctx - 4:
            raise RuntimeError(
                f"error: prompt is too long ({len(self.embd_inp)} tokens, max {self.params.n_ctx - 4})"
            )

        # debug message about similarity of saved session, if applicable
        self.n_matching_session_tokens = 0
        if len(self.session_tokens) > 0:
            for id in self.session_tokens:
                if (
                    self.n_matching_session_tokens >= len(self.embd_inp)
                    or id != self.embd_inp[self.n_matching_session_tokens]
                ):
                    break
                self.n_matching_session_tokens += 1

            if self.n_matching_session_tokens >= len(self.embd_inp):
                print(f"session file has exact match for prompt!")
            elif self.n_matching_session_tokens < (len(self.embd_inp) / 2):
                print(
                    f"warning: session file has low similarity to prompt ({self.n_matching_session_tokens} / {len(self.embd_inp)} tokens); will mostly be reevaluated"
                )
            else:
                print(
                    f"session file matches {self.n_matching_session_tokens} / {len(self.embd_inp)} tokens of prompt"
                )

        self.need_to_save_session = len(
            self.params.path_session
        ) > 0 and self.n_matching_session_tokens < (len(self.embd_inp) * 3 / 4)

        # number of tokens to keep when resetting context
        if (
            self.params.n_keep < 0
            or self.params.n_keep > len(self.embd_inp)
            or self.params.instruct
        ):
            self.params.n_keep = len(self.embd_inp)

        self.inp_prefix = self._tokenize(self.params.instruct_inp_prefix)
        self.inp_suffix = self._tokenize(self.params.instruct_inp_suffix, False)

        # in instruct mode, we inject a prefix and a suffix to each input by the user
        self.antiecho = None
        if self.params.instruct:
            self.params.interactive_start = True
            _ptn = self._tokenize(self.params.instruct_inp_prefix.strip(), False)
            self.first_antiprompt.append(_ptn)
            self.antiecho = util.IterSearch(_ptn)

        # enable interactive mode if reverse prompt or interactive start is specified
        if len(self.params.antiprompt) != 0 or self.params.interactive_start:
            self.params.interactive = True

        # determine newline token
        self.llama_token_newline = self._tokenize("\n", False)
        self.llama_token_eot = self._tokenize(" [end of text]\n", False)

        if self.params.verbose_prompt:
            print(
                f"""
prompt: '{self.params.prompt}'
number of tokens in prompt = {len(self.embd_inp)}""",
                file=sys.stderr,
            )

            for i in range(len(self.embd_inp)):
                print(
                    f"{self.embd_inp[i]} -> '{self.token_to_str(self.embd_inp[i])}'",
                    file=sys.stderr,
                )

            if self.params.n_keep > 0:
                print("static prompt based on n_keep: '")
                for i in range(self.params.n_keep):
                    print(self.token_to_str(self.embd_inp[i]), file=sys.stderr)
                print("'", file=sys.stderr)
            print(file=sys.stderr)

        if self.params.interactive:
            print("interactive mode on.", file=sys.stderr)

            if len(self.params.antiprompt) > 0:
                for antiprompt in self.params.antiprompt:
                    print(f"Reverse prompt: '{antiprompt}'", file=sys.stderr)

            if len(self.params.input_prefix) > 0:
                print(f"Input prefix: '{self.params.input_prefix}'", file=sys.stderr)

        print(
            f"""sampling: repeat_last_n = {self.params.repeat_last_n},\
repeat_penalty = {self.params.repeat_penalty},\
presence_penalty = {self.params.presence_penalty},\
frequency_penalty = {self.params.frequency_penalty},\
top_k = {self.params.top_k},\
tfs_z = {self.params.tfs_z},\
top_p = {self.params.top_p},\
typical_p = {self.params.typical_p},\
temp = {self.params.temp},\
mirostat = {self.params.mirostat},\
mirostat_lr = {self.params.mirostat_eta},\
mirostat_ent = {self.params.mirostat_tau},\

generate: n_ctx = {self.n_ctx},\
n_batch = {self.params.n_batch},\
n_predict = {self.params.n_predict},\
n_keep = {self.params.n_keep}

""",
            file=sys.stderr,
        )

        # determine antiprompt tokens
        for i in self.params.antiprompt:
            self.first_antiprompt.append(self._tokenize(i, False))

        self.last_n_tokens = [0] * self.n_ctx  # TODO: deque doesnt support slices

        if params.interactive:
            print(
                """== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\\'.

""",
                file=sys.stderr,
            )
        self.set_color(util.CONSOLE_COLOR_PROMPT)

    # tokenize a prompt
    def _tokenize(self, prompt, bos=True):
        _arr = (llama_cpp.llama_token * ((len(prompt) + 1) * 4))()
        _n = llama_cpp.llama_tokenize(
            self.model,
            prompt.encode("utf8", errors="ignore"),
            len(prompt),
            _arr,
            len(_arr),
            bos,
            False,
        )
        return _arr[:_n]

    def set_color(self, c):
        if self.params.use_color:
            print(c, end="")

    def use_antiprompt(self):
        return len(self.first_antiprompt) > 0

    # generate tokens
    def generate(self):
        while (
            self.remaining_tokens > 0
            or self.params.interactive
            or self.params.n_predict == -1
        ):
            # predict
            if len(self.embd) > 0:
                # infinite text generation via context swapping
                # if we run out of context:
                # - take the n_keep first tokens from the original prompt (via n_past)
                # - take half of the last (n_ctx - n_keep) tokens and recompute the logits in a batch
                if self.n_past + len(self.embd) > self.n_ctx:
                    n_left = self.n_past - self.params.n_keep
                    self.n_past = self.params.n_keep

                    # insert n_left/2 tokens at the start of embd from last_n_tokens
                    _insert = self.last_n_tokens[
                        self.n_ctx - int(n_left / 2) - len(self.embd) : -len(self.embd)
                    ]
                    self.embd = _insert + self.embd
                    self.params.path_session = ""

                # try to reuse a matching prefix from the loaded session instead of re-eval (via n_past)
                if self.n_session_consumed < len(self.session_tokens):
                    for i in range(len(self.embd)):
                        if self.embd[i] != self.session_tokens[self.n_session_consumed]:
                            self.session_tokens = self.session_tokens[
                                : self.n_session_consumed
                            ]
                            break

                        self.n_past += 1
                        self.n_session_consumed += 1

                        if self.n_session_consumed >= len(self.session_tokens):
                            i += 1
                            break

                    if i > 0:
                        self.embd = self.embd[i:]

                # evaluate tokens in batches
                # embd is typically prepared beforehand to fit within a batch, but not always
                # TODO BUG: The batching code causes nonsensical generation
                """for i in range(0, len(self.embd), self.params.n_batch):
					n_eval = self.params.n_batch
					_arr = (llama_cpp.llama_token * n_eval)(*self.embd[i:i + n_eval])
					if llama_cpp.llama_eval(self.ctx, _arr, n_eval, self.n_past, self.params.n_threads) != 0:
						print(f"failed to eval")
						return
					
					self.n_past += n_eval"""

                if (
                    llama_cpp.llama_eval(
                        self.ctx,
                        (llama_cpp.llama_token * len(self.embd))(*self.embd),
                        len(self.embd),
                        self.n_past,
                    )
                    != 0
                ):
                    raise Exception("Failed to llama_eval!")

                if len(self.embd) > 0 and len(self.params.path_session) > 0:
                    self.session_tokens.extend(self.embd)
                    self.n_session_consumed = len(self.session_tokens)

            self.n_past += len(self.embd)
            self.embd = []
            if len(self.embd_inp) <= self.input_consumed:  # && !is_interacting
                # out of user input, sample next token
                top_k = (
                    llama_cpp.llama_n_vocab(self.ctx)
                    if self.params.top_k <= 0
                    else self.params.top_k
                )
                repeat_last_n = (
                    self.n_ctx
                    if self.params.repeat_last_n < 0
                    else self.params.repeat_last_n
                )

                # optionally save the session on first sample (for faster prompt loading next time)
                if len(self.params.path_session) > 0 and self.need_to_save_session:
                    self.need_to_save_session = False
                    llama_cpp.llama_save_session_file(
                        self.ctx,
                        self.params.path_session.encode("utf8"),
                        (llama_cpp.llama_token * len(self.session_tokens))(
                            *self.session_tokens
                        ),
                        len(self.session_tokens),
                    )

                id = 0

                logits = llama_cpp.llama_get_logits(self.ctx)
                n_vocab = llama_cpp.llama_n_vocab(self.model)

                # Apply params.logit_bias map
                for key, value in self.params.logit_bias.items():
                    logits[key] += value

                _arr = (llama_cpp.llama_token_data * n_vocab)(
                    *[
                        llama_cpp.llama_token_data(token_id, logits[token_id], 0.0)
                        for token_id in range(n_vocab)
                    ]
                )
                candidates_p = llama_cpp.ctypes.pointer(
                    llama_cpp.llama_token_data_array(_arr, len(_arr), False)
                )

                # Apply penalties
                nl_logit = logits[llama_cpp.llama_token_nl(self.ctx)]
                last_n_repeat = min(len(self.last_n_tokens), repeat_last_n, self.n_ctx)

                _arr = (llama_cpp.llama_token * last_n_repeat)(
                    *self.last_n_tokens[len(self.last_n_tokens) - last_n_repeat :]
                )
                llama_cpp.llama_sample_repetition_penalties(
                    ctx=self.ctx,
                    candidates=candidates_p,
                    last_tokens_data=_arr,
                    penalty_last_n=last_n_repeat,
                    penalty_repeat=llama_cpp.c_float(self.params.repeat_penalty),
                    penalty_freq=llama_cpp.c_float(self.params.frequency_penalty),
                    penalty_present=llama_cpp.c_float(self.params.presence_penalty),
                )

                # NOT PRESENT IN CURRENT VERSION ?
                # llama_cpp.llama_sample_frequency_and_presence_penalti(self.ctx, candidates_p,
                # 	_arr,
                # 	last_n_repeat, llama_cpp.c_float(self.params.frequency_penalty), llama_cpp.c_float(self.params.presence_penalty))

                if not self.params.penalize_nl:
                    logits[llama_cpp.llama_token_nl()] = nl_logit

                if self.params.temp <= 0:
                    # Greedy sampling
                    id = llama_cpp.llama_sample_token_greedy(self.ctx, candidates_p)
                else:
                    if self.params.mirostat == 1:
                        mirostat_mu = 2.0 * self.params.mirostat_tau
                        mirostat_m = 100
                        llama_cpp.llama_sample_temperature(
                            self.ctx, candidates_p, llama_cpp.c_float(self.params.temp)
                        )
                        id = llama_cpp.llama_sample_token_mirostat(
                            self.ctx,
                            candidates_p,
                            llama_cpp.c_float(self.params.mirostat_tau),
                            llama_cpp.c_float(self.params.mirostat_eta),
                            llama_cpp.c_int(mirostat_m),
                            llama_cpp.c_float(mirostat_mu),
                        )
                    elif self.params.mirostat == 2:
                        mirostat_mu = 2.0 * self.params.mirostat_tau
                        llama_cpp.llama_sample_temperature(
                            self.ctx, candidates_p, llama_cpp.c_float(self.params.temp)
                        )
                        id = llama_cpp.llama_sample_token_mirostat_v2(
                            self.ctx,
                            candidates_p,
                            llama_cpp.c_float(self.params.mirostat_tau),
                            llama_cpp.c_float(self.params.mirostat_eta),
                            llama_cpp.c_float(mirostat_mu),
                        )
                    else:
                        # Temperature sampling
                        llama_cpp.llama_sample_top_k(
                            self.ctx,
                            candidates_p,
                            top_k,
                            min_keep=llama_cpp.c_size_t(1),
                        )
                        llama_cpp.llama_sample_tail_free(
                            self.ctx,
                            candidates_p,
                            llama_cpp.c_float(self.params.tfs_z),
                            min_keep=llama_cpp.c_size_t(1),
                        )
                        llama_cpp.llama_sample_typical(
                            self.ctx,
                            candidates_p,
                            llama_cpp.c_float(self.params.typical_p),
                            min_keep=llama_cpp.c_size_t(1),
                        )
                        llama_cpp.llama_sample_top_p(
                            self.ctx,
                            candidates_p,
                            llama_cpp.c_float(self.params.top_p),
                            min_keep=llama_cpp.c_size_t(1),
                        )

Download .txt

gitextract_mj0bsng8/

├── .dockerignore
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug_report.md
│   │   └── feature_request.md
│   ├── dependabot.yml
│   └── workflows/
│       ├── build-and-release.yaml
│       ├── build-docker.yaml
│       ├── build-wheels-cuda.yaml
│       ├── build-wheels-metal.yaml
│       ├── generate-index-from-release.yaml
│       ├── publish-to-test.yaml
│       ├── publish.yaml
│       ├── test-pypi.yaml
│       └── test.yaml
├── .gitignore
├── .gitmodules
├── .readthedocs.yaml
├── CHANGELOG.md
├── CMakeLists.txt
├── LICENSE.md
├── Makefile
├── README.md
├── docker/
│   ├── README.md
│   ├── cuda_simple/
│   │   └── Dockerfile
│   ├── open_llama/
│   │   ├── Dockerfile
│   │   ├── build.sh
│   │   ├── hug_model.py
│   │   ├── start.sh
│   │   └── start_server.sh
│   ├── openblas_simple/
│   │   └── Dockerfile
│   └── simple/
│       ├── Dockerfile
│       └── run.sh
├── docs/
│   ├── api-reference.md
│   ├── changelog.md
│   ├── index.md
│   ├── install/
│   │   └── macos.md
│   ├── requirements.txt
│   └── server.md
├── examples/
│   ├── batch-processing/
│   │   └── server.py
│   ├── gradio_chat/
│   │   ├── local.py
│   │   └── server.py
│   ├── hf_pull/
│   │   └── main.py
│   ├── high_level_api/
│   │   ├── fastapi_server.py
│   │   ├── high_level_api_embedding.py
│   │   ├── high_level_api_inference.py
│   │   ├── high_level_api_infill.py
│   │   ├── high_level_api_streaming.py
│   │   └── langchain_custom_llm.py
│   ├── low_level_api/
│   │   ├── Chat.py
│   │   ├── Miku.py
│   │   ├── ReasonAct.py
│   │   ├── common.py
│   │   ├── low_level_api_chat_cpp.py
│   │   ├── low_level_api_llama_cpp.py
│   │   ├── quantize.py
│   │   ├── readme/
│   │   │   └── low_level_api_llama_cpp.md
│   │   └── util.py
│   ├── notebooks/
│   │   ├── Batching.ipynb
│   │   ├── Clients.ipynb
│   │   ├── Functions.ipynb
│   │   ├── Guidance.ipynb
│   │   ├── Multimodal.ipynb
│   │   ├── OpenHermesFunctionCalling.ipynb
│   │   └── PerformanceTuning.ipynb
│   └── ray/
│       ├── README.md
│       ├── llm.py
│       └── requirements.txt
├── llama_cpp/
│   ├── __init__.py
│   ├── _ctypes_extensions.py
│   ├── _ggml.py
│   ├── _internals.py
│   ├── _logger.py
│   ├── _utils.py
│   ├── llama.py
│   ├── llama_cache.py
│   ├── llama_chat_format.py
│   ├── llama_cpp.py
│   ├── llama_grammar.py
│   ├── llama_speculative.py
│   ├── llama_tokenizer.py
│   ├── llama_types.py
│   ├── llava_cpp.py
│   ├── mtmd_cpp.py
│   ├── py.typed
│   └── server/
│       ├── __init__.py
│       ├── __main__.py
│       ├── app.py
│       ├── cli.py
│       ├── errors.py
│       ├── model.py
│       ├── settings.py
│       └── types.py
├── mkdocs.yml
├── pyproject.toml
├── scripts/
│   ├── get-releases.sh
│   └── releases-to-pep-503.sh
└── tests/
    ├── test_llama.py
    ├── test_llama_chat_format.py
    ├── test_llama_grammar.py
    └── test_llama_speculative.py

Download .txt

SYMBOL INDEX (750 symbols across 38 files)

FILE: docker/open_llama/hug_model.py
  function make_request (line 7) | def make_request(url, params=None):
  function check_magic_and_version (line 16) | def check_magic_and_version(filename):
  function download_file (line 29) | def download_file(url, destination):
  function get_user_choice (line 51) | def get_user_choice(model_list):
  function main (line 73) | def main():

FILE: examples/batch-processing/server.py
  function create_chat_completions (line 30) | def create_chat_completions():

FILE: examples/gradio_chat/local.py
  function predict (line 18) | def predict(message, history):

FILE: examples/gradio_chat/server.py
  function predict (line 10) | def predict(message, history):

FILE: examples/high_level_api/langchain_custom_llm.py
  class LlamaLLM (line 9) | class LlamaLLM(LLM):
    method _llm_type (line 14) | def _llm_type(self) -> str:
    method __init__ (line 17) | def __init__(self, model_path: str, **kwargs: Any):
    method _call (line 22) | def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
    method _identifying_params (line 27) | def _identifying_params(self) -> Mapping[str, Any]:

FILE: examples/low_level_api/Chat.py
  function env_or_def (line 7) | def env_or_def(env, default):

FILE: examples/low_level_api/Miku.py
  function env_or_def (line 7) | def env_or_def(env, default):

FILE: examples/low_level_api/ReasonAct.py
  function env_or_def (line 7) | def env_or_def(env, default):

FILE: examples/low_level_api/common.py
  class GptParams (line 12) | class GptParams:
  function gpt_params_parse (line 77) | def gpt_params_parse(argv=None):
  function gpt_random_prompt (line 389) | def gpt_random_prompt(rng):

FILE: examples/low_level_api/low_level_api_chat_cpp.py
  class LLaMAInteract (line 25) | class LLaMAInteract:
    method __init__ (line 26) | def __init__(self, params: GptParams) -> None:
    method _tokenize (line 314) | def _tokenize(self, prompt, bos=True):
    method set_color (line 327) | def set_color(self, c):
    method use_antiprompt (line 331) | def use_antiprompt(self):
    method generate (line 335) | def generate(self):
    method __enter__ (line 625) | def __enter__(self):
    method __exit__ (line 628) | def __exit__(self, type, value, tb):
    method exit (line 631) | def exit(self):
    method token_to_str (line 635) | def token_to_str(self, token_id: int) -> bytes:
    method past (line 645) | def past(self):
    method input (line 650) | def input(self, prompt: str):
    method output (line 661) | def output(self):
    method read_input (line 689) | def read_input(self):
    method interact (line 696) | def interact(self):

FILE: examples/low_level_api/quantize.py
  function main (line 6) | def main(args):

FILE: examples/low_level_api/util.py
  class IterSearch (line 13) | class IterSearch:
    method __init__ (line 14) | def __init__(self, pattern):
    method __call__ (line 18) | def __call__(self, char):
  class Circle (line 31) | class Circle:
    method __init__ (line 32) | def __init__(self, size, default=0):
    method append (line 38) | def append(self, elem):
    method __getitem__ (line 46) | def __getitem__(self, val):

FILE: examples/ray/llm.py
  class LlamaDeployment (line 9) | class LlamaDeployment:
    method __init__ (line 10) | def __init__(self, model_path: str):
    method __call__ (line 13) | async def __call__(self, http_request: Request) -> Dict:
  function llm_builder (line 20) | def llm_builder(args: Dict[str, str]) -> Application:

FILE: llama_cpp/_ctypes_extensions.py
  function load_shared_library (line 23) | def load_shared_library(lib_base_name: str, base_path: pathlib.Path):
  class CtypesRef (line 93) | class CtypesRef(Generic[CtypesCData]):
  function ctypes_function_for_shared_library (line 105) | def ctypes_function_for_shared_library(lib: ctypes.CDLL):
  function _byref (line 126) | def _byref(obj: CtypesCData, offset: Optional[int] = None) -> CtypesRef[...

FILE: llama_cpp/_internals.py
  class LlamaModel (line 31) | class LlamaModel:
    method __init__ (line 35) | def __init__(
    method close (line 77) | def close(self):
    method __del__ (line 85) | def __del__(self):
    method vocab_type (line 88) | def vocab_type(self) -> int:
    method n_vocab (line 91) | def n_vocab(self) -> int:
    method n_ctx_train (line 94) | def n_ctx_train(self) -> int:
    method n_embd (line 97) | def n_embd(self) -> int:
    method rope_freq_scale_train (line 100) | def rope_freq_scale_train(self) -> float:
    method desc (line 103) | def desc(self) -> str:
    method size (line 108) | def size(self) -> int:
    method n_params (line 111) | def n_params(self) -> int:
    method get_tensor (line 114) | def get_tensor(self, name: str) -> ctypes.c_void_p:
    method token_get_text (line 119) | def token_get_text(self, token: int) -> str:
    method token_get_score (line 122) | def token_get_score(self, token: int) -> float:
    method token_get_attr (line 125) | def token_get_attr(self, token: int) -> int:
    method token_bos (line 130) | def token_bos(self) -> int:
    method token_eos (line 133) | def token_eos(self) -> int:
    method token_cls (line 136) | def token_cls(self) -> int:
    method token_sep (line 139) | def token_sep(self) -> int:
    method token_nl (line 142) | def token_nl(self) -> int:
    method token_prefix (line 145) | def token_prefix(self) -> int:
    method token_middle (line 148) | def token_middle(self) -> int:
    method token_suffix (line 151) | def token_suffix(self) -> int:
    method token_eot (line 154) | def token_eot(self) -> int:
    method add_bos_token (line 157) | def add_bos_token(self) -> bool:
    method add_eos_token (line 160) | def add_eos_token(self) -> bool:
    method tokenize (line 165) | def tokenize(self, text: bytes, add_bos: bool, special: bool):
    method token_to_piece (line 183) | def token_to_piece(self, token: int, special: bool = False) -> bytes:
    method detokenize (line 188) | def detokenize(self, tokens: List[int], special: bool = False) -> bytes:
    method metadata (line 207) | def metadata(self) -> Dict[str, str]:
    method default_params (line 239) | def default_params():
  class LlamaContext (line 244) | class LlamaContext:
    method __init__ (line 248) | def __init__(
    method close (line 277) | def close(self):
    method __del__ (line 280) | def __del__(self):
    method n_ctx (line 283) | def n_ctx(self) -> int:
    method pooling_type (line 286) | def pooling_type(self) -> int:
    method kv_cache_clear (line 289) | def kv_cache_clear(self):
    method kv_cache_seq_rm (line 293) | def kv_cache_seq_rm(self, seq_id: int, p0: int, p1: int):
    method kv_cache_seq_cp (line 298) | def kv_cache_seq_cp(self, seq_id_src: int, seq_id_dst: int, p0: int, p...
    method kv_cache_seq_keep (line 302) | def kv_cache_seq_keep(self, seq_id: int):
    method kv_cache_seq_shift (line 306) | def kv_cache_seq_shift(self, seq_id: int, p0: int, p1: int, shift: int):
    method get_state_size (line 310) | def get_state_size(self) -> int:
    method decode (line 321) | def decode(self, batch: LlamaBatch):
    method encode (line 329) | def encode(self, batch: LlamaBatch):
    method set_n_threads (line 337) | def set_n_threads(self, n_threads: int, n_threads_batch: int):
    method get_logits (line 340) | def get_logits(self):
    method get_logits_ith (line 343) | def get_logits_ith(self, i: int):
    method get_embeddings (line 346) | def get_embeddings(self):
    method get_embeddings_ith (line 349) | def get_embeddings_ith(self, i: int):
    method get_embeddings_seq (line 352) | def get_embeddings_seq(self, seq_id: int):
    method set_rng_seed (line 357) | def set_rng_seed(self, seed: int):
    method sample_repetition_penalties (line 360) | def sample_repetition_penalties(
    method sample_softmax (line 371) | def sample_softmax(self, candidates: "_LlamaTokenDataArray"):
    method sample_top_k (line 374) | def sample_top_k(self, candidates: "_LlamaTokenDataArray", k: int, min...
    method sample_top_p (line 377) | def sample_top_p(self, candidates: "_LlamaTokenDataArray", p: float, m...
    method sample_min_p (line 380) | def sample_min_p(self, candidates: "_LlamaTokenDataArray", p: float, m...
    method sample_typical (line 383) | def sample_typical(
    method sample_temp (line 388) | def sample_temp(self, candidates: "_LlamaTokenDataArray", temp: float):
    method sample_grammar (line 391) | def sample_grammar(self, candidates: "_LlamaTokenDataArray", grammar: ...
    method sample_token_mirostat (line 394) | def sample_token_mirostat(
    method sample_token_mirostat_v2 (line 404) | def sample_token_mirostat_v2(
    method sample_token_greedy (line 413) | def sample_token_greedy(self, candidates: "_LlamaTokenDataArray") -> int:
    method sample_token (line 416) | def sample_token(self, candidates: "_LlamaTokenDataArray") -> int:
    method grammar_accept_token (line 420) | def grammar_accept_token(self, grammar: LlamaGrammar, token: int):
    method reset_timings (line 423) | def reset_timings(self):
    method print_timings (line 426) | def print_timings(self):
    method default_params (line 431) | def default_params():
  class LlamaBatch (line 436) | class LlamaBatch:
    method __init__ (line 437) | def __init__(
    method close (line 462) | def close(self):
    method __del__ (line 465) | def __del__(self):
    method n_tokens (line 468) | def n_tokens(self) -> int:
    method reset (line 471) | def reset(self):
    method set_batch (line 474) | def set_batch(self, batch: Sequence[int], n_past: int, logits_all: bool):
    method add_sequence (line 485) | def add_sequence(self, batch: Sequence[int], seq_id: int, logits_all: ...
  class LlamaTokenDataArray (line 499) | class LlamaTokenDataArray:
    method __init__ (line 500) | def __init__(self, *, n_vocab: int):
    method copy_logits (line 517) | def copy_logits(self, logits: npt.NDArray[np.single]):
  function normalize_embedding (line 528) | def normalize_embedding(embedding):
  class LlamaSamplingParams (line 539) | class LlamaSamplingParams:
  class LlamaSamplingContext (line 566) | class LlamaSamplingContext:
    method reset (line 574) | def reset(self):
    method cp (line 580) | def cp(self):
    method last (line 589) | def last(self) -> Optional[int]:
    method prev_str (line 595) | def prev_str(self, ctx_main: LlamaContext, n: int) -> str:
    method sample (line 598) | def sample(
    method accept (line 607) | def accept(self, ctx_main: LlamaContext, id: int, apply_grammar: bool):
  class CustomSampler (line 611) | class CustomSampler:
    method __init__ (line 612) | def __init__(
    method get_sampler (line 640) | def get_sampler(self) -> llama_cpp.llama_sampler_p:
  class LlamaSampler (line 644) | class LlamaSampler:
    method __init__ (line 645) | def __init__(self):
    method close (line 661) | def close(self):
    method __del__ (line 664) | def __del__(self):
    method add_greedy (line 667) | def add_greedy(self):
    method add_dist (line 671) | def add_dist(self, seed: int):
    method add_softmax (line 675) | def add_softmax(self):
    method add_top_k (line 679) | def add_top_k(self, k: int):
    method add_top_p (line 683) | def add_top_p(self, p: float, min_keep: int = 1):
    method add_min_p (line 687) | def add_min_p(self, p: float, min_keep: int = 1):
    method add_typical (line 691) | def add_typical(self, p: float, min_keep: int = 1):
    method add_temp (line 695) | def add_temp(self, temp: float):
    method add_temp_ext (line 699) | def add_temp_ext(self, t: float, delta: float, exponent: float):
    method add_xtc (line 703) | def add_xtc(self, p: float, t: float, min_keep: int, seed: int):
    method add_top_n_sigma (line 707) | def add_top_n_sigma(self, n: float):
    method add_mirostat (line 711) | def add_mirostat(self, n_vocab: int, seed: int, tau: float, eta: float...
    method add_mirostat_v2 (line 715) | def add_mirostat_v2(self, seed: int, tau: float, eta: float):
    method add_grammar (line 719) | def add_grammar(self, model: LlamaModel, grammar: LlamaGrammar):
    method add_grammar_lazy_patterns (line 725) | def add_grammar_lazy_patterns(
    method add_penalties (line 751) | def add_penalties(
    method add_dry (line 766) | def add_dry(
    method add_logit_bias (line 793) | def add_logit_bias(
    method add_infill (line 811) | def add_infill(self, model: LlamaModel):
    method add_custom (line 815) | def add_custom(
    method get_seed (line 826) | def get_seed(self) -> int:
    method sample (line 829) | def sample(self, ctx: LlamaContext, idx: int = -1) -> int:
    method accept (line 832) | def accept(self, token: int):
    method reset (line 835) | def reset(self):
    method clone (line 838) | def clone(self):

FILE: llama_cpp/_logger.py
  function llama_log_callback (line 30) | def llama_log_callback(
  function set_verbose (line 46) | def set_verbose(verbose: bool):

FILE: llama_cpp/_utils.py
  class suppress_stdout_stderr (line 14) | class suppress_stdout_stderr(object):
    method __init__ (line 20) | def __init__(self, disable: bool = True):
    method __enter__ (line 24) | def __enter__(self):
    method __exit__ (line 44) | def __exit__(self, *_):
  class MetaSingleton (line 59) | class MetaSingleton(type):
    method __call__ (line 66) | def __call__(cls, *args: Any, **kwargs: Any) -> Any:
  class Singleton (line 72) | class Singleton(object, metaclass=MetaSingleton):
    method __init__ (line 77) | def __init__(self):

FILE: llama_cpp/llama.py
  class Llama (line 55) | class Llama:
    method __init__ (line 60) | def __init__(
    method ctx (line 550) | def ctx(self) -> llama_cpp.llama_context_p:
    method model (line 554) | def model(self) -> llama_cpp.llama_model_p:
    method _input_ids (line 558) | def _input_ids(self) -> npt.NDArray[np.intc]:
    method _scores (line 562) | def _scores(self) -> npt.NDArray[np.single]:
    method eval_tokens (line 566) | def eval_tokens(self) -> Deque[int]:
    method eval_logits (line 570) | def eval_logits(self) -> Deque[List[float]]:
    method tokenize (line 576) | def tokenize(
    method detokenize (line 594) | def detokenize(
    method set_cache (line 614) | def set_cache(self, cache: Optional[BaseLlamaCache]):
    method set_seed (line 622) | def set_seed(self, seed: int):
    method reset (line 630) | def reset(self):
    method eval (line 634) | def eval(self, tokens: Sequence[int]):
    method _init_sampler (line 671) | def _init_sampler(
    method sample (line 760) | def sample(
    method generate (line 822) | def generate(
    method create_embedding (line 962) | def create_embedding(
    method embed (line 1002) | def embed(
    method _create_completion (line 1123) | def _create_completion(
    method create_completion (line 1743) | def create_completion(
    method __call__ (line 1840) | def __call__(
    method create_chat_completion (line 1932) | def create_chat_completion(
    method create_chat_completion_openai_v1 (line 2035) | def create_chat_completion_openai_v1(
    method __getstate__ (line 2068) | def __getstate__(self):
    method __setstate__ (line 2124) | def __setstate__(self, state):
    method save_state (line 2127) | def save_state(self) -> LlamaState:
    method load_state (line 2157) | def load_state(self, state: LlamaState) -> None:
    method n_ctx (line 2172) | def n_ctx(self) -> int:
    method n_embd (line 2176) | def n_embd(self) -> int:
    method n_vocab (line 2180) | def n_vocab(self) -> int:
    method tokenizer (line 2184) | def tokenizer(self) -> LlamaTokenizer:
    method token_eos (line 2188) | def token_eos(self) -> int:
    method token_bos (line 2192) | def token_bos(self) -> int:
    method token_nl (line 2196) | def token_nl(self) -> int:
    method pooling_type (line 2200) | def pooling_type(self) -> str:
    method close (line 2204) | def close(self) -> None:
    method __del__ (line 2208) | def __del__(self) -> None:
    method logits_to_logprobs (line 2212) | def logits_to_logprobs(
    method longest_token_prefix (line 2230) | def longest_token_prefix(a: Sequence[int], b: Sequence[int]):
    method from_pretrained (line 2240) | def from_pretrained(
  class LlamaState (line 2367) | class LlamaState:
    method __init__ (line 2368) | def __init__(
  class LogitsProcessorList (line 2390) | class LogitsProcessorList(List[LogitsProcessor]):
    method __call__ (line 2391) | def __call__(
  class StoppingCriteriaList (line 2402) | class StoppingCriteriaList(List[StoppingCriteria]):
    method __call__ (line 2403) | def __call__(
  class MinTokensLogitsProcessor (line 2409) | class MinTokensLogitsProcessor(LogitsProcessor):
    method __init__ (line 2410) | def __init__(self, min_tokens: int, token_eos: int):
    method __call__ (line 2415) | def __call__(

FILE: llama_cpp/llama_cache.py
  class BaseLlamaCache (line 17) | class BaseLlamaCache(ABC):
    method __init__ (line 20) | def __init__(self, capacity_bytes: int = (2 << 30)):
    method cache_size (line 25) | def cache_size(self) -> int:
    method _find_longest_prefix_key (line 28) | def _find_longest_prefix_key(
    method __getitem__ (line 35) | def __getitem__(self, key: Sequence[int]) -> "llama_cpp.llama.LlamaSta...
    method __contains__ (line 39) | def __contains__(self, key: Sequence[int]) -> bool:
    method __setitem__ (line 43) | def __setitem__(
  class LlamaRAMCache (line 49) | class LlamaRAMCache(BaseLlamaCache):
    method __init__ (line 52) | def __init__(self, capacity_bytes: int = (2 << 30)):
    method cache_size (line 60) | def cache_size(self):
    method _find_longest_prefix_key (line 63) | def _find_longest_prefix_key(
    method __getitem__ (line 79) | def __getitem__(self, key: Sequence[int]) -> "llama_cpp.llama.LlamaSta...
    method __contains__ (line 88) | def __contains__(self, key: Sequence[int]) -> bool:
    method __setitem__ (line 91) | def __setitem__(self, key: Sequence[int], value: "llama_cpp.llama.Llam...
  class LlamaDiskCache (line 104) | class LlamaDiskCache(BaseLlamaCache):
    method __init__ (line 107) | def __init__(
    method cache_size (line 114) | def cache_size(self):
    method _find_longest_prefix_key (line 117) | def _find_longest_prefix_key(
    method __getitem__ (line 130) | def __getitem__(self, key: Sequence[int]) -> "llama_cpp.llama.LlamaSta...
    method __contains__ (line 141) | def __contains__(self, key: Sequence[int]) -> bool:
    method __setitem__ (line 144) | def __setitem__(self, key: Sequence[int], value: "llama_cpp.llama.Llam...

FILE: llama_cpp/llama_chat_format.py
  class LlamaChatCompletionHandler (line 61) | class LlamaChatCompletionHandler(Protocol):
    method __call__ (line 68) | def __call__(
  class LlamaChatCompletionHandlerNotFoundException (line 112) | class LlamaChatCompletionHandlerNotFoundException(Exception):
  class LlamaChatCompletionHandlerRegistry (line 116) | class LlamaChatCompletionHandlerRegistry(Singleton):
    method register_chat_completion_handler (line 119) | def register_chat_completion_handler(
    method unregister_chat_handler (line 131) | def unregister_chat_handler(self, name: str):
    method get_chat_completion_handler_by_name (line 137) | def get_chat_completion_handler_by_name(
  function get_chat_completion_handler (line 149) | def get_chat_completion_handler(name: str) -> LlamaChatCompletionHandler:
  function register_chat_completion_handler (line 155) | def register_chat_completion_handler(name: str):
  class ChatFormatterResponse (line 167) | class ChatFormatterResponse:
  class ChatFormatter (line 180) | class ChatFormatter(Protocol):
    method __call__ (line 186) | def __call__(
  class Jinja2ChatFormatter (line 194) | class Jinja2ChatFormatter(ChatFormatter):
    method __init__ (line 195) | def __init__(
    method strftime_now (line 219) | def strftime_now(f: str) -> str:
    method __call__ (line 222) | def __call__(
    method to_chat_handler (line 265) | def to_chat_handler(self) -> LlamaChatCompletionHandler:
  function _convert_text_completion_logprobs_to_chat (line 269) | def _convert_text_completion_logprobs_to_chat(
  function _convert_text_completion_to_chat (line 294) | def _convert_text_completion_to_chat(
  function _convert_text_completion_chunks_to_chat (line 318) | def _convert_text_completion_chunks_to_chat(
  function _convert_completion_to_chat (line 361) | def _convert_completion_to_chat(
  function _convert_completion_to_chat_function (line 378) | def _convert_completion_to_chat_function(
  function chat_formatter_to_chat_completion_handler (line 555) | def chat_formatter_to_chat_completion_handler(
  function hf_autotokenizer_to_chat_formatter (line 704) | def hf_autotokenizer_to_chat_formatter(
  function hf_autotokenizer_to_chat_completion_handler (line 729) | def hf_autotokenizer_to_chat_completion_handler(
  function hf_tokenizer_config_to_chat_formatter (line 736) | def hf_tokenizer_config_to_chat_formatter(
  function hf_tokenizer_config_to_chat_completion_handler (line 784) | def hf_tokenizer_config_to_chat_completion_handler(
  function guess_chat_format_from_gguf_metadata (line 794) | def guess_chat_format_from_gguf_metadata(metadata: Dict[str, str]) -> Op...
  function _get_system_message (line 817) | def _get_system_message(
  function _map_roles (line 827) | def _map_roles(
  function _format_llama2 (line 843) | def _format_llama2(
  function _format_add_colon_single (line 860) | def _format_add_colon_single(
  function _format_add_colon_two (line 873) | def _format_add_colon_two(
  function _format_no_colon_single (line 887) | def _format_no_colon_single(
  function _format_add_colon_space_single (line 900) | def _format_add_colon_space_single(
  function _format_chatml (line 913) | def _format_chatml(
  function _format_chatglm3 (line 926) | def _format_chatglm3(
  function _grammar_for_json (line 941) | def _grammar_for_json(verbose: bool = False):
  function _grammar_for_json_schema (line 947) | def _grammar_for_json_schema(
  function _grammar_for_response_format (line 959) | def _grammar_for_response_format(
  function register_chat_format (line 977) | def register_chat_format(name: str):
  function format_llama2 (line 991) | def format_llama2(
  function format_llama3 (line 1008) | def format_llama3(
  function format_alpaca (line 1025) | def format_alpaca(
  function format_qwen (line 1039) | def format_qwen(
  function format (line 1056) | def format(
  function format_oasst_llama (line 1072) | def format_oasst_llama(
  function format_baichuan2 (line 1088) | def format_baichuan2(
  function format_baichuan (line 1104) | def format_baichuan(
  function format_openbuddy (line 1120) | def format_openbuddy(
  function format_redpajama_incite (line 1142) | def format_redpajama_incite(
  function format_snoozy (line 1158) | def format_snoozy(
  function format_phind (line 1180) | def format_phind(
  function format_intel (line 1194) | def format_intel(
  function format_open_orca (line 1208) | def format_open_orca(
  function format_mistrallite (line 1235) | def format_mistrallite(
  function format_zephyr (line 1251) | def format_zephyr(
  function format_pygmalion (line 1268) | def format_pygmalion(
  function format_chatml (line 1284) | def format_chatml(
  function format_mistral_instruct (line 1301) | def format_mistral_instruct(
  function format_chatglm3 (line 1322) | def format_chatglm3(
  function format_openchat (line 1339) | def format_openchat(
  function format_saiga (line 1359) | def format_saiga(
  function format_gemma (line 1381) | def format_gemma(
  function functionary_chat_handler (line 1402) | def functionary_chat_handler(
  function functionary_v1_v2_chat_handler (line 1761) | def functionary_v1_v2_chat_handler(
  class Llava15ChatHandler (line 2659) | class Llava15ChatHandler:
    method __init__ (line 2699) | def __init__(self, clip_model_path: str, verbose: bool = True):
    method _init_mtmd_context (line 2711) | def _init_mtmd_context(self, llama_model: llama.Llama):
    method load_image (line 2746) | def load_image(self, image_url: str) -> bytes:
    method _create_bitmap_from_bytes (line 2749) | def _create_bitmap_from_bytes(self, image_bytes: bytes):
    method __call__ (line 2767) | def __call__(
    method _load_image (line 3031) | def _load_image(image_url: str) -> bytes:
    method get_image_urls (line 3044) | def get_image_urls(messages: List[llama_types.ChatCompletionRequestMes...
    method split_text_on_image_urls (line 3063) | def split_text_on_image_urls(text: str, image_urls: List[str]):
    method from_pretrained (line 3088) | def from_pretrained(
  class ObsidianChatHandler (line 3172) | class ObsidianChatHandler(Llava15ChatHandler):
  class MoondreamChatHandler (line 3228) | class MoondreamChatHandler(Llava15ChatHandler):
  class Llava16ChatHandler (line 3270) | class Llava16ChatHandler(Llava15ChatHandler):
  class NanoLlavaChatHandler (line 3318) | class NanoLlavaChatHandler(Llava15ChatHandler):
  class Llama3VisionAlphaChatHandler (line 3373) | class Llama3VisionAlphaChatHandler(Llava15ChatHandler):
  class MiniCPMv26ChatHandler (line 3426) | class MiniCPMv26ChatHandler(Llava15ChatHandler):
  class Qwen25VLChatHandler (line 3464) | class Qwen25VLChatHandler(Llava15ChatHandler):
    method __call__ (line 3497) | def __call__(self, **kwargs):
  function chatml_function_calling (line 3523) | def chatml_function_calling(

FILE: llama_cpp/llama_cpp.py
  class llama_token_data (line 481) | class llama_token_data(ctypes.Structure):
  class llama_token_data_array (line 512) | class llama_token_data_array(ctypes.Structure):
  class llama_batch (line 569) | class llama_batch(ctypes.Structure):
  class llama_model_kv_override_value (line 630) | class llama_model_kv_override_value(ctypes.Union):
  class llama_model_kv_override (line 645) | class llama_model_kv_override(ctypes.Structure):
  class llama_model_params (line 698) | class llama_model_params(ctypes.Structure):
  class llama_context_params (line 800) | class llama_context_params(ctypes.Structure):
  class llama_model_quantize_params (line 934) | class llama_model_quantize_params(ctypes.Structure):
  class llama_logit_bias (line 989) | class llama_logit_bias(ctypes.Structure):
  class llama_sampler_chain_params (line 1012) | class llama_sampler_chain_params(ctypes.Structure):
  class llama_chat_message (line 1031) | class llama_chat_message(ctypes.Structure):
  function llama_model_default_params (line 1051) | def llama_model_default_params() -> llama_model_params:
  function llama_context_default_params (line 1062) | def llama_context_default_params() -> llama_context_params:
  function llama_sampler_chain_default_params (line 1073) | def llama_sampler_chain_default_params() -> llama_sampler_chain_params:
  function llama_model_quantize_default_params (line 1084) | def llama_model_quantize_default_params() -> llama_model_quantize_params:
  function llama_backend_init (line 1098) | def llama_backend_init():
  function llama_backend_free (line 1128) | def llama_backend_free():
  function llama_numa_init (line 1140) | def llama_numa_init(numa: int, /):
  function llama_load_model_from_file (line 1165) | def llama_load_model_from_file(
  function llama_model_load_from_file (line 1182) | def llama_model_load_from_file(
  function llama_model_load_from_splits (line 1204) | def llama_model_load_from_splits(
  function llama_model_save_to_file (line 1221) | def llama_model_save_to_file(model: llama_model_p, path_model: bytes, /):
  function llama_free_model (line 1233) | def llama_free_model(model: llama_model_p, /):
  function llama_model_free (line 1243) | def llama_model_free(model: llama_model_p, /):
  function llama_init_from_model (line 1255) | def llama_init_from_model(
  function llama_new_context_with_model (line 1270) | def llama_new_context_with_model(
  function llama_free (line 1283) | def llama_free(ctx: llama_context_p, /):
  function llama_time_us (line 1294) | def llama_time_us() -> int:
  function llama_max_devices (line 1300) | def llama_max_devices() -> int:
  function llama_max_parallel_sequences (line 1306) | def llama_max_parallel_sequences() -> int:
  function llama_supports_mmap (line 1312) | def llama_supports_mmap() -> bool:
  function llama_supports_mlock (line 1318) | def llama_supports_mlock() -> bool:
  function llama_supports_gpu_offload (line 1324) | def llama_supports_gpu_offload() -> bool:
  function llama_supports_rpc (line 1330) | def llama_supports_rpc() -> bool:
  function llama_n_ctx (line 1336) | def llama_n_ctx(ctx: llama_context_p, /) -> int:
  function llama_n_batch (line 1342) | def llama_n_batch(ctx: llama_context_p, /) -> int:
  function llama_n_ubatch (line 1348) | def llama_n_ubatch(ctx: llama_context_p, /) -> int:
  function llama_n_seq_max (line 1354) | def llama_n_seq_max(ctx: llama_context_p, /) -> int:
  function llama_n_ctx_train (line 1360) | def llama_n_ctx_train(model: llama_model_p, /) -> int:
  function llama_n_embd (line 1366) | def llama_n_embd(model: llama_model_p, /) -> int:
  function llama_n_layer (line 1372) | def llama_n_layer(model: llama_model_p, /) -> int:
  function llama_n_head (line 1378) | def llama_n_head(model: llama_model_p, /) -> int:
  function llama_n_vocab (line 1384) | def llama_n_vocab(model: llama_vocab_p, /) -> int:
  function llama_get_model (line 1390) | def llama_get_model(ctx: llama_context_p, /) -> Optional[llama_model_p]:
  function llama_get_memory (line 1396) | def llama_get_memory(ctx: llama_context_p, /) -> Optional[llama_memory_t]:
  function llama_pooling_type (line 1403) | def llama_pooling_type(ctx: llama_context_p, /) -> int:
  function llama_get_kv_self (line 1413) | def llama_get_kv_self(ctx: llama_context_p, /) -> Optional[llama_kv_cach...
  function llama_model_get_vocab (line 1420) | def llama_model_get_vocab(model: llama_model_p, /) -> Optional[llama_voc...
  function llama_model_rope_type (line 1426) | def llama_model_rope_type(model: llama_model_p, /) -> int:
  function llama_model_n_ctx_train (line 1432) | def llama_model_n_ctx_train(model: llama_model_p, /) -> int:
  function llama_model_n_embd (line 1438) | def llama_model_n_embd(model: llama_model_p, /) -> int:
  function llama_model_n_layer (line 1444) | def llama_model_n_layer(model: llama_model_p, /) -> int:
  function llama_model_n_head (line 1450) | def llama_model_n_head(model: llama_model_p, /) -> int:
  function llama_model_n_head_kv (line 1456) | def llama_model_n_head_kv(model: llama_model_p, /) -> int:
  function llama_model_n_swa (line 1462) | def llama_model_n_swa(model: llama_model_p, /) -> int:
  function llama_model_rope_freq_scale_train (line 1469) | def llama_model_rope_freq_scale_train(model: llama_model_p, /) -> float:
  function llama_model_n_cls_out (line 1477) | def llama_model_n_cls_out(model: llama_model_p, /) -> int:
  function llama_model_cls_label (line 1485) | def llama_model_cls_label(model: llama_model_p, i: int, /) -> Optional[b...
  function llama_vocab_type (line 1492) | def llama_vocab_type(vocab: llama_vocab_p, /) -> int:
  function llama_vocab_n_tokens (line 1498) | def llama_vocab_n_tokens(vocab: llama_vocab_p, /) -> int:
  function llama_model_meta_val_str (line 1521) | def llama_model_meta_val_str(
  function llama_model_meta_count (line 1535) | def llama_model_meta_count(model: llama_model_p, /) -> int:
  function llama_model_meta_key_by_index (line 1552) | def llama_model_meta_key_by_index(
  function llama_model_meta_val_str_by_index (line 1575) | def llama_model_meta_val_str_by_index(
  function llama_model_desc (line 1593) | def llama_model_desc(
  function llama_model_size (line 1606) | def llama_model_size(model: llama_model_p, /) -> int:
  function llama_model_chat_template (line 1615) | def llama_model_chat_template(model: llama_model_p, name: Optional[bytes...
  function llama_model_n_params (line 1624) | def llama_model_n_params(model: llama_model_p, /) -> int:
  function llama_model_has_encoder (line 1632) | def llama_model_has_encoder(model: llama_model_p, /) -> bool:
  function llama_model_has_decoder (line 1640) | def llama_model_has_decoder(model: llama_model_p, /) -> bool:
  function llama_model_decoder_start_token (line 1651) | def llama_model_decoder_start_token(model: llama_model_p, /) -> int:
  function llama_model_is_recurrent (line 1661) | def llama_model_is_recurrent(model: llama_model_p, /) -> bool:
  function llama_model_is_diffusion (line 1669) | def llama_model_is_diffusion(model: llama_model_p, /) -> bool:
  function llama_model_quantize (line 1688) | def llama_model_quantize(
  function llama_adapter_lora_init (line 1711) | def llama_adapter_lora_init(
  function llama_adapter_lora_free (line 1725) | def llama_adapter_lora_free(adapter: llama_adapter_lora_p, /):
  function llama_set_adapter_lora (line 1743) | def llama_set_adapter_lora(
  function llama_rm_adapter_lora (line 1761) | def llama_rm_adapter_lora(
  function llama_clear_adapter_lora (line 1776) | def llama_clear_adapter_lora(ctx: llama_context_p, /):
  function llama_apply_adapter_cvec (line 1806) | def llama_apply_adapter_cvec(
  function llama_memory_clear (line 1838) | def llama_memory_clear(mem: llama_memory_t, data: bool, /):
  function llama_memory_seq_rm (line 1864) | def llama_memory_seq_rm(
  function llama_memory_seq_cp (line 1901) | def llama_memory_seq_cp(
  function llama_memory_seq_keep (line 1922) | def llama_memory_seq_keep(mem: llama_memory_t, seq_id: Union[llama_seq_i...
  function llama_memory_seq_add (line 1947) | def llama_memory_seq_add(
  function llama_memory_seq_div (line 1981) | def llama_memory_seq_div(
  function llama_memory_seq_pos_min (line 2005) | def llama_memory_seq_pos_min(
  function llama_memory_seq_pos_max (line 2023) | def llama_memory_seq_pos_max(
  function llama_memory_can_shift (line 2034) | def llama_memory_can_shift(mem: llama_memory_t, /) -> bool:
  function llama_kv_self_n_tokens (line 2050) | def llama_kv_self_n_tokens(ctx: llama_context_p, /) -> int:
  function llama_kv_self_used_cells (line 2061) | def llama_kv_self_used_cells(ctx: llama_context_p, /) -> int:
  function llama_kv_self_clear (line 2073) | def llama_kv_self_clear(ctx: llama_context_p, /):
  function llama_kv_self_seq_rm (line 2099) | def llama_kv_self_seq_rm(
  function llama_kv_self_seq_cp (line 2132) | def llama_kv_self_seq_cp(
  function llama_kv_self_seq_keep (line 2152) | def llama_kv_self_seq_keep(ctx: llama_context_p, seq_id: Union[llama_seq...
  function llama_kv_self_seq_add (line 2180) | def llama_kv_self_seq_add(
  function llama_kv_self_seq_div (line 2215) | def llama_kv_self_seq_div(
  function llama_kv_self_seq_pos_min (line 2238) | def llama_kv_self_seq_pos_min(
  function llama_kv_self_seq_pos_max (line 2255) | def llama_kv_self_seq_pos_max(
  function llama_kv_self_defrag (line 2268) | def llama_kv_self_defrag(ctx: llama_context_p, /):
  function llama_kv_self_can_shift (line 2277) | def llama_kv_self_can_shift(ctx: llama_context_p, /) -> bool:
  function llama_kv_self_update (line 2286) | def llama_kv_self_update(ctx: llama_context_p, /):
  function llama_state_get_size (line 2300) | def llama_state_get_size(ctx: llama_context_p, /) -> int:
  function llama_get_state_size (line 2308) | def llama_get_state_size(ctx: llama_context_p, /) -> int:
  function llama_state_get_data (line 2329) | def llama_state_get_data(
  function llama_copy_state_data (line 2353) | def llama_copy_state_data(
  function llama_state_set_data (line 2371) | def llama_state_set_data(
  function llama_set_state_data (line 2391) | def llama_set_state_data(
  function llama_state_load_file (line 2416) | def llama_state_load_file(
  function llama_load_session_file (line 2445) | def llama_load_session_file(
  function llama_state_save_file (line 2471) | def llama_state_save_file(
  function llama_save_session_file (line 2497) | def llama_save_session_file(
  function llama_state_seq_get_size (line 2516) | def llama_state_seq_get_size(ctx: llama_context_p, seq_id: llama_seq_id,...
  function llama_state_seq_get_data (line 2537) | def llama_state_seq_get_data(
  function llama_state_seq_set_data (line 2567) | def llama_state_seq_set_data(
  function llama_state_seq_save_file (line 2595) | def llama_state_seq_save_file(
  function llama_state_seq_load_file (line 2625) | def llama_state_seq_load_file(
  function llama_batch_get_one (line 2658) | def llama_batch_get_one(
  function llama_batch_init (line 2684) | def llama_batch_init(
  function llama_batch_free (line 2703) | def llama_batch_free(batch: llama_batch, /):
  function llama_encode (line 2718) | def llama_encode(ctx: llama_context_p, batch: llama_batch, /) -> int:
  function llama_decode (line 2741) | def llama_decode(ctx: llama_context_p, batch: llama_batch, /) -> int:
  function llama_set_n_threads (line 2764) | def llama_set_n_threads(
  function llama_n_threads (line 2780) | def llama_n_threads(ctx: llama_context_p, /) -> int:
  function llama_n_threads_batch (line 2788) | def llama_n_threads_batch(ctx: llama_context_p, /) -> int:
  function llama_set_embeddings (line 2797) | def llama_set_embeddings(ctx: llama_context_p, embeddings: bool, /):
  function llama_set_causal_attn (line 2806) | def llama_set_causal_attn(ctx: llama_context_p, causal_attn: bool, /):
  function llama_set_warmup (line 2816) | def llama_set_warmup(ctx: llama_context_p, warmup: bool, /):
  function llama_set_abort_callback (line 2829) | def llama_set_abort_callback(
  function llama_synchronize (line 2844) | def llama_synchronize(ctx: llama_context_p, /):
  function llama_get_logits (line 2861) | def llama_get_logits(ctx: llama_context_p, /) -> CtypesArray[ctypes.c_fl...
  function llama_get_logits_ith (line 2883) | def llama_get_logits_ith(
  function llama_get_embeddings (line 2902) | def llama_get_embeddings(ctx: llama_context_p, /) -> CtypesArray[ctypes....
  function llama_get_embeddings_ith (line 2919) | def llama_get_embeddings_ith(
  function llama_get_embeddings_seq (line 2937) | def llama_get_embeddings_seq(
  function llama_vocab_get_text (line 2954) | def llama_vocab_get_text(
  function llama_vocab_get_score (line 2964) | def llama_vocab_get_score(
  function llama_vocab_get_attr (line 2974) | def llama_vocab_get_attr(
  function llama_vocab_is_eog (line 2985) | def llama_vocab_is_eog(vocab: llama_vocab_p, token: Union[llama_token, i...
  function llama_vocab_is_control (line 2995) | def llama_vocab_is_control(
  function llama_vocab_bos (line 3005) | def llama_vocab_bos(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_eos (line 3012) | def llama_vocab_eos(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_eot (line 3019) | def llama_vocab_eot(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_sep (line 3026) | def llama_vocab_sep(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_nl (line 3033) | def llama_vocab_nl(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_pad (line 3040) | def llama_vocab_pad(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_mask (line 3047) | def llama_vocab_mask(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_get_add_bos (line 3058) | def llama_vocab_get_add_bos(vocab: llama_vocab_p, /) -> bool:
  function llama_vocab_get_add_eos (line 3068) | def llama_vocab_get_add_eos(vocab: llama_vocab_p, /) -> bool:
  function llama_vocab_get_add_sep (line 3078) | def llama_vocab_get_add_sep(vocab: llama_vocab_p, /) -> bool:
  function llama_vocab_fim_pre (line 3088) | def llama_vocab_fim_pre(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_fim_suf (line 3098) | def llama_vocab_fim_suf(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_fim_mid (line 3108) | def llama_vocab_fim_mid(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_fim_pad (line 3118) | def llama_vocab_fim_pad(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_fim_rep (line 3128) | def llama_vocab_fim_rep(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_fim_sep (line 3138) | def llama_vocab_fim_sep(vocab: llama_vocab_p, /) -> llama_token:
  function llama_token_get_text (line 3149) | def llama_token_get_text(
  function llama_token_get_score (line 3161) | def llama_token_get_score(
  function llama_token_get_attr (line 3172) | def llama_token_get_attr(
  function llama_token_is_eog (line 3183) | def llama_token_is_eog(
  function llama_token_is_control (line 3194) | def llama_token_is_control(
  function llama_token_bos (line 3205) | def llama_token_bos(vocab: llama_vocab_p, /) -> int:
  function llama_token_eos (line 3214) | def llama_token_eos(vocab: llama_vocab_p, /) -> int:
  function llama_token_eot (line 3223) | def llama_token_eot(vocab: llama_vocab_p, /) -> int:
  function llama_token_cls (line 3232) | def llama_token_cls(vocab: llama_vocab_p, /) -> int:
  function llama_token_sep (line 3241) | def llama_token_sep(vocab: llama_vocab_p, /) -> int:
  function llama_token_nl (line 3251) | def llama_token_nl(vocab: llama_vocab_p, /) -> int:
  function llama_token_pad (line 3261) | def llama_token_pad(vocab: llama_vocab_p, /) -> int:
  function llama_add_bos_token (line 3271) | def llama_add_bos_token(vocab: llama_vocab_p, /) -> bool:
  function llama_add_eos_token (line 3280) | def llama_add_eos_token(vocab: llama_vocab_p, /) -> bool:
  function llama_token_fim_pre (line 3290) | def llama_token_fim_pre(vocab: llama_vocab_p, /) -> llama_token:
  function llama_token_fim_suf (line 3299) | def llama_token_fim_suf(vocab: llama_vocab_p, /) -> llama_token:
  function llama_token_fim_mid (line 3308) | def llama_token_fim_mid(vocab: llama_vocab_p, /) -> llama_token:
  function llama_token_fim_pad (line 3317) | def llama_token_fim_pad(vocab: llama_vocab_p, /) -> llama_token:
  function llama_token_fim_rep (line 3326) | def llama_token_fim_rep(vocab: llama_vocab_p, /) -> llama_token:
  function llama_token_fim_sep (line 3335) | def llama_token_fim_sep(vocab: llama_vocab_p, /) -> llama_token:
  function llama_vocab_cls (line 3346) | def llama_vocab_cls(vocab: llama_vocab_p, /) -> llama_token:
  function llama_tokenize (line 3385) | def llama_tokenize(
  function llama_token_to_piece (line 3437) | def llama_token_to_piece(
  function llama_detokenize (line 3488) | def llama_detokenize(
  function llama_chat_apply_template (line 3544) | def llama_chat_apply_template(
  function llama_chat_builtin_templates (line 3579) | def llama_chat_builtin_templates(
  class llama_sampler_i (line 3617) | class llama_sampler_i(ctypes.Structure):
  class llama_sampler (line 3625) | class llama_sampler(ctypes.Structure):
  function llama_sampler_init (line 3663) | def llama_sampler_init(
  function llama_sampler_name (line 3675) | def llama_sampler_name(smpl: llama_sampler_p, /) -> bytes:
  function llama_sampler_accept (line 3685) | def llama_sampler_accept(smpl: llama_sampler_p, token: Union[llama_token...
  function llama_sampler_apply (line 3695) | def llama_sampler_apply(
  function llama_sampler_reset (line 3707) | def llama_sampler_reset(smpl: llama_sampler_p, /):
  function llama_sampler_clone (line 3717) | def llama_sampler_clone(smpl: llama_sampler_p, /) -> llama_sampler_p:
  function llama_sampler_free (line 3728) | def llama_sampler_free(smpl: llama_sampler_p, /):
  function llama_sampler_chain_init (line 3741) | def llama_sampler_chain_init(params: llama_sampler_chain_params, /) -> l...
  function llama_sampler_chain_add (line 3752) | def llama_sampler_chain_add(chain: llama_sampler_p, smpl: llama_sampler_...
  function llama_sampler_chain_get (line 3762) | def llama_sampler_chain_get(
  function llama_sampler_chain_n (line 3774) | def llama_sampler_chain_n(chain: llama_sampler_p, /) -> int:
  function llama_sampler_chain_remove (line 3785) | def llama_sampler_chain_remove(
  function llama_sampler_init_greedy (line 3795) | def llama_sampler_init_greedy() -> llama_sampler_p:
  function llama_sampler_init_dist (line 3801) | def llama_sampler_init_dist(seed: int) -> llama_sampler_p:
  function llama_sampler_init_softmax (line 3810) | def llama_sampler_init_softmax() -> llama_sampler_p:
  function llama_sampler_init_top_k (line 3818) | def llama_sampler_init_top_k(k: int) -> llama_sampler_p:
  function llama_sampler_init_top_p (line 3829) | def llama_sampler_init_top_p(p: float, min_keep: int) -> llama_sampler_p:
  function llama_sampler_init_min_p (line 3840) | def llama_sampler_init_min_p(p: float, min_keep: int) -> llama_sampler_p:
  function llama_sampler_init_typical (line 3851) | def llama_sampler_init_typical(p: float, min_keep: int) -> llama_sampler_p:
  function llama_sampler_init_temp (line 3858) | def llama_sampler_init_temp(t: float) -> llama_sampler_p:
  function llama_sampler_init_temp_ext (line 3869) | def llama_sampler_init_temp_ext(
  function llama_sampler_init_xtc (line 3882) | def llama_sampler_init_xtc(
  function llama_sampler_init_top_n_sigma (line 3895) | def llama_sampler_init_top_n_sigma(n: float, /) -> llama_sampler_p:
  function llama_sampler_init_mirostat (line 3911) | def llama_sampler_init_mirostat(
  function llama_sampler_init_mirostat_v2 (line 3927) | def llama_sampler_init_mirostat_v2(
  function llama_sampler_init_grammar (line 3943) | def llama_sampler_init_grammar(
  function llama_sampler_init_grammar_lazy (line 3971) | def llama_sampler_init_grammar_lazy(
  function llama_sampler_init_grammar_lazy_patterns (line 4006) | def llama_sampler_init_grammar_lazy_patterns(
  function llama_sampler_init_penalties (line 4030) | def llama_sampler_init_penalties(
  function llama_sampler_init_dry (line 4064) | def llama_sampler_init_dry(
  function llama_sampler_init_logit_bias (line 4087) | def llama_sampler_init_logit_bias(
  function llama_sampler_init_infill (line 4100) | def llama_sampler_init_infill(vocab: llama_vocab_p, /) -> llama_sampler_p:
  function llama_sampler_get_seed (line 4111) | def llama_sampler_get_seed(smpl: llama_sampler_p, /) -> int:
  function llama_sampler_sample (line 4122) | def llama_sampler_sample(
  function llama_split_path (line 4139) | def llama_split_path(
  function llama_split_prefix (line 4158) | def llama_split_prefix(
  function llama_print_system_info (line 4173) | def llama_print_system_info() -> bytes:
  function llama_log_set (line 4185) | def llama_log_set(
  class llama_perf_context_data (line 4210) | class llama_perf_context_data(ctypes.Structure):
  class llama_perf_sampler_data (line 4227) | class llama_perf_sampler_data(ctypes.Structure):
  function llama_perf_context (line 4240) | def llama_perf_context(ctx: llama_context_p, /) -> llama_perf_context_data:
  function llama_perf_context_print (line 4250) | def llama_perf_context_print(ctx: llama_context_p, /):
  function llama_perf_context_reset (line 4260) | def llama_perf_context_reset(ctx: llama_context_p, /):
  function llama_perf_sampler (line 4271) | def llama_perf_sampler(chain: llama_sampler_p, /) -> llama_perf_sampler_...
  function llama_perf_sampler_print (line 4281) | def llama_perf_sampler_print(chain: llama_sampler_p, /):
  function llama_perf_sampler_reset (line 4291) | def llama_perf_sampler_reset(chain: llama_sampler_p, /):
  function llama_opt_param_filter_all (line 4310) | def llama_opt_param_filter_all(tensor: ctypes.c_void_p, userdata: ctypes...
  class llama_opt_params (line 4323) | class llama_opt_params(ctypes.Structure):
  function llama_opt_init (line 4339) | def llama_opt_init(lctx: llama_context_p, model: llama_model_p, lopt_par...
  function llama_opt_epoch (line 4364) | def llama_opt_epoch(

FILE: llama_cpp/llama_grammar.py
  class LlamaGrammar (line 19) | class LlamaGrammar:
    method __init__ (line 20) | def __init__(self, *args, _grammar: str, **kwargs):
    method from_string (line 25) | def from_string(cls, grammar: str, verbose: bool = True) -> "LlamaGram...
    method from_file (line 29) | def from_file(cls, file: Union[str, Path], verbose: bool = True) -> "L...
    method from_json_schema (line 46) | def from_json_schema(cls, json_schema: str, verbose: bool = True) -> "...
  function _build_repetition (line 254) | def _build_repetition(
  class BuiltinRule (line 310) | class BuiltinRule:
    method __init__ (line 311) | def __init__(self, content: str, deps: list = None):
  class SchemaConverter (line 380) | class SchemaConverter:
    method __init__ (line 381) | def __init__(self, *, prop_order, allow_fetch, dotall, raw_pattern):
    method _format_literal (line 392) | def _format_literal(self, literal):
    method not_literal (line 398) | def not_literal(
    method _add_rule (line 424) | def _add_rule(self, name, rule):
    method resolve_refs (line 439) | def resolve_refs(self, schema: dict, url: str):
    method _generate_union_rule (line 492) | def _generate_union_rule(self, name, alt_schemas):
    method _visit_pattern (line 500) | def _visit_pattern(self, pattern, name):
    method _resolve_ref (line 685) | def _resolve_ref(self, ref):
    method _generate_constant_rule (line 694) | def _generate_constant_rule(self, value):
    method visit (line 697) | def visit(self, schema, name):
    method _add_primitive (line 846) | def _add_primitive(self, name: str, rule: BuiltinRule):
    method _build_object_rule (line 856) | def _build_object_rule(
    method format_grammar (line 937) | def format_grammar(self):
  function json_schema_to_gbnf (line 944) | def json_schema_to_gbnf(schema: str, prop_order: Optional[List[str]] = N...

FILE: llama_cpp/llama_speculative.py
  class LlamaDraftModel (line 9) | class LlamaDraftModel(abc.ABC):
    method __call__ (line 11) | def __call__(
  class LlamaPromptLookupDecoding (line 17) | class LlamaPromptLookupDecoding(LlamaDraftModel):
    method __init__ (line 20) | def __init__(self, max_ngram_size: int = 2, num_pred_tokens: int = 10):
    method find_candidate_pred_tokens (line 25) | def find_candidate_pred_tokens(
    method __call__ (line 57) | def __call__(

FILE: llama_cpp/llama_tokenizer.py
  class BaseLlamaTokenizer (line 14) | class BaseLlamaTokenizer(abc.ABC):
    method tokenize (line 16) | def tokenize(
    method detokenize (line 29) | def detokenize(
  class LlamaTokenizer (line 45) | class LlamaTokenizer(BaseLlamaTokenizer):
    method __init__ (line 46) | def __init__(self, llama: llama_cpp.Llama):
    method tokenize (line 49) | def tokenize(
    method detokenize (line 54) | def detokenize(
    method encode (line 62) | def encode(
    method decode (line 69) | def decode(self, tokens: List[int]) -> str:
    method from_ggml_file (line 73) | def from_ggml_file(cls, path: str) -> "LlamaTokenizer":
  class LlamaHFTokenizer (line 77) | class LlamaHFTokenizer(BaseLlamaTokenizer):
    method __init__ (line 78) | def __init__(self, hf_tokenizer: Any):
    method tokenize (line 81) | def tokenize(
    method detokenize (line 88) | def detokenize(
    method from_pretrained (line 109) | def from_pretrained(cls, pretrained_model_name_or_path: str) -> "Llama...

FILE: llama_cpp/llama_types.py
  class EmbeddingUsage (line 20) | class EmbeddingUsage(TypedDict):
  class Embedding (line 25) | class Embedding(TypedDict):
  class CreateEmbeddingResponse (line 31) | class CreateEmbeddingResponse(TypedDict):
  class CompletionLogprobs (line 38) | class CompletionLogprobs(TypedDict):
  class CompletionChoice (line 45) | class CompletionChoice(TypedDict):
  class CompletionUsage (line 52) | class CompletionUsage(TypedDict):
  class CreateCompletionResponse (line 58) | class CreateCompletionResponse(TypedDict):
  class ChatCompletionResponseFunctionCall (line 67) | class ChatCompletionResponseFunctionCall(TypedDict):
  class ChatCompletionResponseMessage (line 72) | class ChatCompletionResponseMessage(TypedDict):
  class ChatCompletionFunction (line 79) | class ChatCompletionFunction(TypedDict):
  class ChatCompletionTopLogprobToken (line 85) | class ChatCompletionTopLogprobToken(TypedDict):
  class ChatCompletionLogprobToken (line 91) | class ChatCompletionLogprobToken(ChatCompletionTopLogprobToken):
  class ChatCompletionLogprobs (line 98) | class ChatCompletionLogprobs(TypedDict):
  class ChatCompletionResponseChoice (line 103) | class ChatCompletionResponseChoice(TypedDict):
  class CreateChatCompletionResponse (line 110) | class CreateChatCompletionResponse(TypedDict):
  class ChatCompletionMessageToolCallChunkFunction (line 119) | class ChatCompletionMessageToolCallChunkFunction(TypedDict):
  class ChatCompletionMessageToolCallChunk (line 124) | class ChatCompletionMessageToolCallChunk(TypedDict):
  class ChatCompletionStreamResponseDeltaEmpty (line 131) | class ChatCompletionStreamResponseDeltaEmpty(TypedDict):
  class ChatCompletionStreamResponseDeltaFunctionCall (line 135) | class ChatCompletionStreamResponseDeltaFunctionCall(TypedDict):
  class ChatCompletionStreamResponseDelta (line 140) | class ChatCompletionStreamResponseDelta(TypedDict):
  class ChatCompletionStreamResponseChoice (line 149) | class ChatCompletionStreamResponseChoice(TypedDict):
  class CreateChatCompletionStreamResponse (line 158) | class CreateChatCompletionStreamResponse(TypedDict):
  class ChatCompletionFunctions (line 166) | class ChatCompletionFunctions(TypedDict):
  class ChatCompletionFunctionCallOption (line 172) | class ChatCompletionFunctionCallOption(TypedDict):
  class ChatCompletionRequestResponseFormat (line 176) | class ChatCompletionRequestResponseFormat(TypedDict):
  class ChatCompletionRequestMessageContentPartText (line 183) | class ChatCompletionRequestMessageContentPartText(TypedDict):
  class ChatCompletionRequestMessageContentPartImageImageUrl (line 188) | class ChatCompletionRequestMessageContentPartImageImageUrl(TypedDict):
  class ChatCompletionRequestMessageContentPartImage (line 193) | class ChatCompletionRequestMessageContentPartImage(TypedDict):
  class ChatCompletionRequestSystemMessage (line 204) | class ChatCompletionRequestSystemMessage(TypedDict):
  class ChatCompletionRequestUserMessage (line 209) | class ChatCompletionRequestUserMessage(TypedDict):
  class ChatCompletionMessageToolCallFunction (line 214) | class ChatCompletionMessageToolCallFunction(TypedDict):
  class ChatCompletionMessageToolCall (line 219) | class ChatCompletionMessageToolCall(TypedDict):
  class ChatCompletionRequestAssistantMessageFunctionCall (line 228) | class ChatCompletionRequestAssistantMessageFunctionCall(TypedDict):
  class ChatCompletionRequestAssistantMessage (line 233) | class ChatCompletionRequestAssistantMessage(TypedDict):
  class ChatCompletionRequestToolMessage (line 242) | class ChatCompletionRequestToolMessage(TypedDict):
  class ChatCompletionRequestFunctionMessage (line 248) | class ChatCompletionRequestFunctionMessage(TypedDict):
  class ChatCompletionRequestFunctionCallOption (line 264) | class ChatCompletionRequestFunctionCallOption(TypedDict):
  class ChatCompletionToolFunction (line 275) | class ChatCompletionToolFunction(TypedDict):
  class ChatCompletionTool (line 281) | class ChatCompletionTool(TypedDict):
  class ChatCompletionNamedToolChoiceFunction (line 286) | class ChatCompletionNamedToolChoiceFunction(TypedDict):
  class ChatCompletionNamedToolChoice (line 290) | class ChatCompletionNamedToolChoice(TypedDict):

FILE: llama_cpp/llava_cpp.py
  class llava_image_embed (line 60) | class llava_image_embed(Structure):
  function llava_validate_embed_size (line 74) | def llava_validate_embed_size(
  function llava_image_embed_make_with_bytes (line 87) | def llava_image_embed_make_with_bytes(
  function llava_image_embed_make_with_filename (line 104) | def llava_image_embed_make_with_filename(
  function llava_image_embed_free (line 113) | def llava_image_embed_free(embed: "_Pointer[llava_image_embed]", /):
  function llava_eval_image_embed (line 129) | def llava_eval_image_embed(
  function clip_model_load (line 147) | def clip_model_load(
  function clip_free (line 156) | def clip_free(ctx: clip_ctx_p, /):

FILE: llama_cpp/mtmd_cpp.py
  class mtmd_context_params (line 75) | class mtmd_context_params(Structure):
  class mtmd_input_text (line 85) | class mtmd_input_text(Structure):
  function mtmd_default_marker (line 98) | def mtmd_default_marker() -> bytes:
  function mtmd_context_params_default (line 103) | def mtmd_context_params_default() -> mtmd_context_params:
  function mtmd_init_from_file (line 114) | def mtmd_init_from_file(
  function mtmd_free (line 124) | def mtmd_free(ctx: mtmd_context_p, /):
  function mtmd_support_vision (line 129) | def mtmd_support_vision(ctx: mtmd_context_p, /) -> bool:
  function mtmd_bitmap_init (line 138) | def mtmd_bitmap_init(
  function mtmd_bitmap_free (line 148) | def mtmd_bitmap_free(bitmap: mtmd_bitmap_p, /):
  function mtmd_input_chunks_init (line 153) | def mtmd_input_chunks_init() -> Optional[mtmd_input_chunks_p]:
  function mtmd_input_chunks_free (line 158) | def mtmd_input_chunks_free(chunks: mtmd_input_chunks_p, /):
  function mtmd_input_chunks_size (line 163) | def mtmd_input_chunks_size(chunks: mtmd_input_chunks_p, /) -> int:
  function mtmd_input_chunks_get (line 172) | def mtmd_input_chunks_get(
  function mtmd_tokenize (line 193) | def mtmd_tokenize(
  function mtmd_input_chunk_get_n_tokens (line 205) | def mtmd_input_chunk_get_n_tokens(chunk: mtmd_input_chunk_p, /) -> int:
  function mtmd_input_chunk_get_type (line 210) | def mtmd_input_chunk_get_type(chunk: mtmd_input_chunk_p, /) -> int:
  function mtmd_input_chunk_get_tokens_text (line 219) | def mtmd_input_chunk_get_tokens_text(
  function mtmd_helper_bitmap_init_from_buf (line 234) | def mtmd_helper_bitmap_init_from_buf(
  function mtmd_helper_get_n_tokens (line 244) | def mtmd_helper_get_n_tokens(chunks: mtmd_input_chunks_p, /) -> int:
  function mtmd_helper_eval_chunk_single (line 269) | def mtmd_helper_eval_chunk_single(

FILE: llama_cpp/server/__main__.py
  function main (line 43) | def main():

FILE: llama_cpp/server/app.py
  function set_server_settings (line 53) | def set_server_settings(server_settings: ServerSettings):
  function get_server_settings (line 58) | def get_server_settings():
  function set_llama_proxy (line 68) | def set_llama_proxy(model_settings: List[ModelSettings]):
  function get_llama_proxy (line 73) | async def get_llama_proxy():
  function set_ping_message_factory (line 95) | def set_ping_message_factory(factory: typing.Callable[[], bytes]):
  function create_app (line 100) | def create_app(
  function prepare_request_resources (line 158) | def prepare_request_resources(
  function get_event_publisher (line 191) | async def get_event_publisher(
  function _logit_bias_tokens_to_input_ids (line 225) | def _logit_bias_tokens_to_input_ids(
  function authenticate (line 241) | async def authenticate(
  function create_completion (line 303) | async def create_completion(
  function create_embedding (line 366) | async def create_embedding(
  function create_chat_completion (line 408) | async def create_chat_completion(
  function get_models (line 535) | async def get_models(
  function tokenize (line 561) | async def tokenize(
  function count_query_tokens (line 576) | async def count_query_tokens(
  function detokenize (line 591) | async def detokenize(

FILE: llama_cpp/server/cli.py
  function _get_base_type (line 10) | def _get_base_type(annotation: Type[Any]) -> Type[Any]:
  function _contains_list_type (line 30) | def _contains_list_type(annotation: Type[Any] | None) -> bool:
  function _parse_bool_arg (line 41) | def _parse_bool_arg(arg: str | bytes | bool) -> bool:
  function add_args_from_model (line 58) | def add_args_from_model(parser: argparse.ArgumentParser, model: Type[Bas...
  function parse_model_from_args (line 89) | def parse_model_from_args(model: T, args: argparse.Namespace) -> T:

FILE: llama_cpp/server/errors.py
  class ErrorResponse (line 26) | class ErrorResponse(TypedDict):
  class ErrorResponseFormatters (line 35) | class ErrorResponseFormatters:
    method context_length_exceeded (line 48) | def context_length_exceeded(
    method model_not_found (line 86) | def model_not_found(
  class RouteErrorHandler (line 102) | class RouteErrorHandler(APIRoute):
    method error_message_wrapper (line 125) | def error_message_wrapper(
    method get_route_handler (line 162) | def get_route_handler(

FILE: llama_cpp/server/model.py
  class LlamaProxy (line 14) | class LlamaProxy:
    method __init__ (line 15) | def __init__(self, models: List[ModelSettings]) -> None:
    method __call__ (line 36) | def __call__(self, model: Optional[str] = None) -> llama_cpp.Llama:
    method __getitem__ (line 56) | def __getitem__(self, model: str):
    method __setitem__ (line 59) | def __setitem__(self, model: str, settings: Union[ModelSettings, str, ...
    method __iter__ (line 64) | def __iter__(self):
    method free (line 68) | def free(self):
    method load_llama_from_model_settings (line 74) | def load_llama_from_model_settings(settings: ModelSettings) -> llama_c...

FILE: llama_cpp/server/settings.py
  class ModelSettings (line 17) | class ModelSettings(BaseSettings):
    method set_dynamic_defaults (line 191) | def set_dynamic_defaults(self) -> Self:
  class ServerSettings (line 202) | class ServerSettings(BaseSettings):
  class Settings (line 233) | class Settings(ServerSettings, ModelSettings):
  class ConfigFileSettings (line 237) | class ConfigFileSettings(ServerSettings):

FILE: llama_cpp/server/types.py
  class CreateCompletionRequest (line 109) | class CreateCompletionRequest(BaseModel):
  class CreateEmbeddingRequest (line 167) | class CreateEmbeddingRequest(BaseModel):
  class ChatCompletionRequestMessage (line 183) | class ChatCompletionRequestMessage(BaseModel):
  class CreateChatCompletionRequest (line 192) | class CreateChatCompletionRequest(BaseModel):
  class ModelData (line 271) | class ModelData(TypedDict):
  class ModelList (line 278) | class ModelList(TypedDict):
  class TokenizeInputRequest (line 283) | class TokenizeInputRequest(BaseModel):
  class TokenizeInputResponse (line 292) | class TokenizeInputResponse(BaseModel):
  class TokenizeInputCountResponse (line 298) | class TokenizeInputCountResponse(BaseModel):
  class DetokenizeInputRequest (line 304) | class DetokenizeInputRequest(BaseModel):
  class DetokenizeInputResponse (line 311) | class DetokenizeInputResponse(BaseModel):

FILE: tests/test_llama.py
  function test_llama_cpp_version (line 18) | def test_llama_cpp_version():
  function test_llama_cpp_tokenization (line 22) | def test_llama_cpp_tokenization():
  function llama_cpp_model_path (line 60) | def llama_cpp_model_path():
  function test_real_model (line 67) | def test_real_model(llama_cpp_model_path):
  function test_real_llama (line 117) | def test_real_llama(llama_cpp_model_path):
  function test_real_llama_embeddings (line 221) | def test_real_llama_embeddings(llama_cpp_model_path):

FILE: tests/test_llama_chat_format.py
  function test_mistral_instruct (line 13) | def test_mistral_instruct():
  function test_hf_tokenizer_config_str_to_chat_formatter (line 78) | def test_hf_tokenizer_config_str_to_chat_formatter():

FILE: tests/test_llama_grammar.py
  function test_grammar_from_string (line 11) | def test_grammar_from_string():
  function test_composed_pydantic_grammar (line 18) | def test_composed_pydantic_grammar():
  function test_grammar_anyof (line 55) | def test_grammar_anyof():

FILE: tests/test_llama_speculative.py
  function test_find_candidate_pred_tokens (line 5) | def test_find_candidate_pred_tokens():

Download .json

Condensed preview — 99 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,364K chars).

[
  {
    "path": ".dockerignore",
    "chars": 3105,
    "preview": "_skbuild/\n\n.envrc\n\nmodels/\n\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*."
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "chars": 3826,
    "preview": "---\nname: Bug report\nabout: Create a report to help us improve\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n# Prerequisites\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "chars": 595,
    "preview": "---\nname: Feature request\nabout: Suggest an idea for this project\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n**Is your fea"
  },
  {
    "path": ".github/dependabot.yml",
    "chars": 690,
    "preview": "# To get started with Dependabot version updates, you'll need to specify which\n# package ecosystems to update and where "
  },
  {
    "path": ".github/workflows/build-and-release.yaml",
    "chars": 3775,
    "preview": "name: Build Release\n\non: workflow_dispatch\n\npermissions:\n  contents: write\n\njobs:\n  build_wheels:\n    name: Build wheels"
  },
  {
    "path": ".github/workflows/build-docker.yaml",
    "chars": 1376,
    "preview": "name: Build Docker\n\non: workflow_dispatch\n\npermissions:\n  contents: write\n  packages: write\n\njobs:\n  docker:\n    name: B"
  },
  {
    "path": ".github/workflows/build-wheels-cuda.yaml",
    "chars": 5054,
    "preview": "name: Build Wheels (CUDA)\n\non: workflow_dispatch\n\npermissions:\n  contents: write\n\njobs:\n  define_matrix:\n    name: Defin"
  },
  {
    "path": ".github/workflows/build-wheels-metal.yaml",
    "chars": 1755,
    "preview": "name: Build Wheels (Metal)\n\non: workflow_dispatch\n\npermissions:\n  contents: write\n\njobs:\n  build_wheels:\n    name: Build"
  },
  {
    "path": ".github/workflows/generate-index-from-release.yaml",
    "chars": 2140,
    "preview": "name: Wheels Index\n\non:\n  # Trigger on new release\n  workflow_run:\n    workflows: [\"Release\", \"Build Wheels (CUDA)\", \"Bu"
  },
  {
    "path": ".github/workflows/publish-to-test.yaml",
    "chars": 1819,
    "preview": "# Based on: https://packaging.python.org/en/latest/guides/publishing-package-distribution-releases-using-github-actions-"
  },
  {
    "path": ".github/workflows/publish.yaml",
    "chars": 1390,
    "preview": "name: Publish to PyPI\n\n# Based on: https://packaging.python.org/en/latest/guides/publishing-package-distribution-release"
  },
  {
    "path": ".github/workflows/test-pypi.yaml",
    "chars": 3126,
    "preview": "name: Tests for PyPI package\n\non: workflow_dispatch\n\njobs:\n  build-linux:\n\n    runs-on: ubuntu-latest\n    strategy:\n    "
  },
  {
    "path": ".github/workflows/test.yaml",
    "chars": 4810,
    "preview": "name: Tests\non:\n  pull_request:\n    branches:\n      - main\n  push:\n    branches:\n      - main\n\nenv:\n  REPO_ID: Qwen/Qwen"
  },
  {
    "path": ".gitignore",
    "chars": 3282,
    "preview": "*.local\n\n.python-version\n\n.vscode/\n\n_skbuild/\n\n.envrc\n.direnv\n\nmodels/\n\n# Byte-compiled / optimized / DLL files\n__pycach"
  },
  {
    "path": ".gitmodules",
    "chars": 106,
    "preview": "[submodule \"vendor/llama.cpp\"]\n\tpath = vendor/llama.cpp\n\turl = https://github.com/ggerganov/llama.cpp.git\n"
  },
  {
    "path": ".readthedocs.yaml",
    "chars": 444,
    "preview": "# Read the Docs configuration file for MkDocs projects\n# See https://docs.readthedocs.io/en/stable/config-file/v2.html f"
  },
  {
    "path": "CHANGELOG.md",
    "chars": 41655,
    "preview": "# Changelog\n\nAll notable changes to this project will be documented in this file.\n\nThe format is based on [Keep a Change"
  },
  {
    "path": "CMakeLists.txt",
    "chars": 6869,
    "preview": "cmake_minimum_required(VERSION 3.21)\n\nproject(llama_cpp)\n\noption(LLAMA_BUILD \"Build llama.cpp shared library and install"
  },
  {
    "path": "LICENSE.md",
    "chars": 1069,
    "preview": "MIT License\n\nCopyright (c) 2023 Andrei Betlen\n\nPermission is hereby granted, free of charge, to any person obtaining a c"
  },
  {
    "path": "Makefile",
    "chars": 2355,
    "preview": "update:\n\tpoetry install\n\tgit submodule update --init --recursive\n\nupdate.vendor:\n\tcd vendor/llama.cpp && git pull origin"
  },
  {
    "path": "README.md",
    "chars": 31359,
    "preview": "<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/abetlen/llama-cpp-python/main/docs/icon.svg\" style=\"hei"
  },
  {
    "path": "docker/README.md",
    "chars": 2734,
    "preview": "### Install Docker Server\n> [!IMPORTANT]  \n> This was tested with Docker running on Linux. <br>If you can get it working"
  },
  {
    "path": "docker/cuda_simple/Dockerfile",
    "chars": 874,
    "preview": "ARG CUDA_IMAGE=\"12.5.0-devel-ubuntu22.04\"\nFROM nvidia/cuda:${CUDA_IMAGE}\n\n# We need to set the host to 0.0.0.0 to allow "
  },
  {
    "path": "docker/open_llama/Dockerfile",
    "chars": 1586,
    "preview": "# Define the image argument and provide a default value\nARG IMAGE=python:3-slim-bookworm\n\n# Use the image as specified\nF"
  },
  {
    "path": "docker/open_llama/build.sh",
    "chars": 341,
    "preview": "#!/bin/sh\n\nMODEL=\"open_llama_3b\"\n# Get  open_llama_3b_ggml q5_1 quantization\npython3 ./hug_model.py -a SlyEcho -s ${MODE"
  },
  {
    "path": "docker/open_llama/hug_model.py",
    "chars": 4944,
    "preview": "import requests\nimport json\nimport os\nimport struct\nimport argparse\n\ndef make_request(url, params=None):\n    print(f\"Mak"
  },
  {
    "path": "docker/open_llama/start.sh",
    "chars": 603,
    "preview": "#!/bin/sh\n\nMODEL=\"open_llama_3b\"\n\n# Start Docker container\ndocker run --cap-add SYS_RESOURCE -p 8000:8000 -t $MODEL &\nsl"
  },
  {
    "path": "docker/open_llama/start_server.sh",
    "chars": 338,
    "preview": "#!/bin/sh\n\n# For mlock support\nulimit -l unlimited\n\nif [ \"$IMAGE\" = \"python:3-slim-bullseye\" ]; then\n    python3 -B -m l"
  },
  {
    "path": "docker/openblas_simple/Dockerfile",
    "chars": 593,
    "preview": "FROM python:3-slim-bookworm\n\n# We need to set the host to 0.0.0.0 to allow outside access\nENV HOST 0.0.0.0\n\nCOPY . .\n\n# "
  },
  {
    "path": "docker/simple/Dockerfile",
    "chars": 944,
    "preview": "# Define the image argument and provide a default value\nARG IMAGE=python:3-slim-bookworm\n\n# Use the image as specified\nF"
  },
  {
    "path": "docker/simple/run.sh",
    "chars": 100,
    "preview": "#!/bin/bash\n\nmake build\nuvicorn --factory llama_cpp.server.app:create_app --host $HOST --port $PORT\n"
  },
  {
    "path": "docs/api-reference.md",
    "chars": 1841,
    "preview": "---\ntitle: API Reference\n---\n\n## High Level API\n\nHigh-level Python bindings for llama.cpp.\n\n::: llama_cpp.Llama\n    opti"
  },
  {
    "path": "docs/changelog.md",
    "chars": 19,
    "preview": "-8<- \"CHANGELOG.md\""
  },
  {
    "path": "docs/index.md",
    "chars": 48,
    "preview": "---\ntitle: Getting Started\n---\n\n-8<- \"README.md\""
  },
  {
    "path": "docs/install/macos.md",
    "chars": 1670,
    "preview": "---\ntitle: MacOS Install with Metal GPU\n---\n\n**(1) Make sure you have xcode installed... at least the command line parts"
  },
  {
    "path": "docs/requirements.txt",
    "chars": 43,
    "preview": "mkdocs\nmkdocs-material\nmkdocstrings[python]"
  },
  {
    "path": "docs/server.md",
    "chars": 7763,
    "preview": "# OpenAI Compatible Server\n\n`llama-cpp-python` offers an OpenAI API compatible web server.\n\nThis web server can be used "
  },
  {
    "path": "examples/batch-processing/server.py",
    "chars": 755,
    "preview": "\"\"\"llama-cpp-python server from scratch in a single file.\n\"\"\"\n\n# import llama_cpp\n\n# path = b\"../../models/Qwen1.5-0.5B-"
  },
  {
    "path": "examples/gradio_chat/local.py",
    "chars": 1507,
    "preview": "import llama_cpp\nimport llama_cpp.llama_tokenizer\n\nimport gradio as gr\n\nllama = llama_cpp.Llama.from_pretrained(\n    rep"
  },
  {
    "path": "examples/gradio_chat/server.py",
    "chars": 1308,
    "preview": "import gradio as gr\n\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"llama.cpp\""
  },
  {
    "path": "examples/hf_pull/main.py",
    "chars": 908,
    "preview": "import llama_cpp\nimport llama_cpp.llama_tokenizer\n\n\nllama = llama_cpp.Llama.from_pretrained(\n    repo_id=\"Qwen/Qwen1.5-0"
  },
  {
    "path": "examples/high_level_api/fastapi_server.py",
    "chars": 654,
    "preview": "\"\"\"Example FastAPI server for llama.cpp.\n\nTo run this example:\n\n```bash\npip install fastapi uvicorn sse-starlette\nexport"
  },
  {
    "path": "examples/high_level_api/high_level_api_embedding.py",
    "chars": 291,
    "preview": "import argparse\n\nfrom llama_cpp import Llama\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\"-m\", \"--model\", ty"
  },
  {
    "path": "examples/high_level_api/high_level_api_inference.py",
    "chars": 435,
    "preview": "import json\nimport argparse\n\nfrom llama_cpp import Llama\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\"-m\", \""
  },
  {
    "path": "examples/high_level_api/high_level_api_infill.py",
    "chars": 1253,
    "preview": "import argparse\n\nfrom llama_cpp import Llama\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\"-m\", \"--model\", ty"
  },
  {
    "path": "examples/high_level_api/high_level_api_streaming.py",
    "chars": 463,
    "preview": "import json\nimport argparse\n\nfrom llama_cpp import Llama\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\"-m\", \""
  },
  {
    "path": "examples/high_level_api/langchain_custom_llm.py",
    "chars": 1517,
    "preview": "import argparse\n\nfrom llama_cpp import Llama\n\nfrom langchain.llms.base import LLM\nfrom typing import Optional, List, Map"
  },
  {
    "path": "examples/low_level_api/Chat.py",
    "chars": 2853,
    "preview": "#!/bin/python\nimport sys, os, datetime\nfrom common import GptParams\nfrom low_level_api_chat_cpp import LLaMAInteract\n\n\nd"
  },
  {
    "path": "examples/low_level_api/Miku.py",
    "chars": 2806,
    "preview": "#!/bin/python\nimport sys, os\nfrom common import GptParams\nfrom low_level_api_chat_cpp import LLaMAInteract\n\n\ndef env_or_"
  },
  {
    "path": "examples/low_level_api/ReasonAct.py",
    "chars": 1457,
    "preview": "#!/bin/python\nimport sys, os, datetime\nfrom common import GptParams\nfrom low_level_api_chat_cpp import LLaMAInteract\n\n\nd"
  },
  {
    "path": "examples/low_level_api/common.py",
    "chars": 10579,
    "preview": "import os\nimport argparse\nimport re\n\nfrom dataclasses import dataclass, field\nfrom typing import List\n\n# Based on https:"
  },
  {
    "path": "examples/low_level_api/low_level_api_chat_cpp.py",
    "chars": 30244,
    "preview": "\"\"\"\nThis is an example implementation of main.cpp from llama.cpp\nQuirks:\n * Its not exactly alike since this port is des"
  },
  {
    "path": "examples/low_level_api/low_level_api_llama_cpp.py",
    "chars": 3825,
    "preview": "import ctypes\nimport os\nimport multiprocessing\n\nimport llama_cpp\n\nllama_cpp.llama_backend_init(numa=False)\n\nN_THREADS = "
  },
  {
    "path": "examples/low_level_api/quantize.py",
    "chars": 1044,
    "preview": "import os\nimport argparse\nimport llama_cpp\n\n\ndef main(args):\n    fname_inp = args.fname_inp.encode(\"utf-8\")\n    fname_ou"
  },
  {
    "path": "examples/low_level_api/readme/low_level_api_llama_cpp.md",
    "chars": 2213,
    "preview": "# Low-Level API for Llama_cpp\n\n## Overview\nThis Python script, low_level_api_llama_cpp.py, demonstrates the implementati"
  },
  {
    "path": "examples/low_level_api/util.py",
    "chars": 2702,
    "preview": "ANSI_COLOR_RESET = \"\\x1b[0m\"\nANSI_COLOR_YELLOW = \"\\x1b[33m\"\nANSI_BOLD = \"\\x1b[1m\"\nANSI_COLOR_GREEN = \"\\x1b[32m\"\n\nCONSOLE"
  },
  {
    "path": "examples/notebooks/Batching.ipynb",
    "chars": 18737,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "examples/notebooks/Clients.ipynb",
    "chars": 2600,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\":"
  },
  {
    "path": "examples/notebooks/Functions.ipynb",
    "chars": 18660,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Functions\\n\",\n    \"\\n\",\n    \"The "
  },
  {
    "path": "examples/notebooks/Guidance.ipynb",
    "chars": 10288,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\":"
  },
  {
    "path": "examples/notebooks/Multimodal.ipynb",
    "chars": 2341,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<div>\\n\",\n    \"    <img src=\\\"https"
  },
  {
    "path": "examples/notebooks/OpenHermesFunctionCalling.ipynb",
    "chars": 60163,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\":"
  },
  {
    "path": "examples/notebooks/PerformanceTuning.ipynb",
    "chars": 361315,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "examples/ray/README.md",
    "chars": 607,
    "preview": "This is an example of doing LLM inference with [Ray](https://docs.ray.io/en/latest/index.html) and [Ray Serve](https://d"
  },
  {
    "path": "examples/ray/llm.py",
    "chars": 645,
    "preview": "from starlette.requests import Request\nfrom typing import Dict\nfrom ray import serve\nfrom ray.serve import Application\nf"
  },
  {
    "path": "examples/ray/requirements.txt",
    "chars": 97,
    "preview": "ray[serve]\n--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu\nllama-cpp-python\n"
  },
  {
    "path": "llama_cpp/__init__.py",
    "chars": 70,
    "preview": "from .llama_cpp import *\nfrom .llama import *\n\n__version__ = \"0.3.16\"\n"
  },
  {
    "path": "llama_cpp/_ctypes_extensions.py",
    "chars": 4085,
    "preview": "from __future__ import annotations\n\nimport sys\nimport os\nimport ctypes\nimport functools\nimport pathlib\n\nfrom typing impo"
  },
  {
    "path": "llama_cpp/_ggml.py",
    "chars": 369,
    "preview": "\"\"\"Internal module use at your own risk\n\nThis module provides a minimal interface for working with ggml tensors from lla"
  },
  {
    "path": "llama_cpp/_internals.py",
    "chars": 29562,
    "preview": "from __future__ import annotations\n\nimport os\nimport ctypes\n\nfrom typing import (\n    Dict,\n    List,\n    Tuple,\n    Opt"
  },
  {
    "path": "llama_cpp/_logger.py",
    "chars": 1309,
    "preview": "import sys\nimport ctypes\nimport logging\n\nimport llama_cpp\n\n# enum ggml_log_level {\n#     GGML_LOG_LEVEL_NONE  = 0,\n#    "
  },
  {
    "path": "llama_cpp/_utils.py",
    "chars": 2260,
    "preview": "import os\nimport sys\n\nfrom typing import Any, Dict\n\n# Avoid \"LookupError: unknown encoding: ascii\" when open() called in"
  },
  {
    "path": "llama_cpp/llama.py",
    "chars": 96171,
    "preview": "from __future__ import annotations\n\nimport os\nimport sys\nimport uuid\nimport time\nimport json\nimport ctypes\nimport typing"
  },
  {
    "path": "llama_cpp/llama_cache.py",
    "chars": 5010,
    "preview": "import sys\nfrom abc import ABC, abstractmethod\nfrom typing import (\n    Optional,\n    Sequence,\n    Tuple,\n)\nfrom collec"
  },
  {
    "path": "llama_cpp/llama_chat_format.py",
    "chars": 157215,
    "preview": "from __future__ import annotations\n\nimport os\nimport sys\nimport json\nimport ctypes\nimport dataclasses\nimport random\nimpo"
  },
  {
    "path": "llama_cpp/llama_cpp.py",
    "chars": 152716,
    "preview": "from __future__ import annotations\n\nimport os\nimport ctypes\nimport pathlib\n\nfrom typing import (\n    Callable,\n    Union"
  },
  {
    "path": "llama_cpp/llama_grammar.py",
    "chars": 32913,
    "preview": "\"\"\"Python implementation of llama grammar parser directly translated from C++ source file in vendor/llama.cpp/common/gra"
  },
  {
    "path": "llama_cpp/llama_speculative.py",
    "chars": 2088,
    "preview": "import abc\n\nfrom typing import Any\n\nimport numpy as np\nimport numpy.typing as npt\n\n\nclass LlamaDraftModel(abc.ABC):\n    "
  },
  {
    "path": "llama_cpp/llama_tokenizer.py",
    "chars": 3876,
    "preview": "from __future__ import annotations\n\nimport abc\nfrom typing import (\n    List,\n    Optional,\n    Any,\n)\n\nimport llama_cpp"
  },
  {
    "path": "llama_cpp/llama_types.py",
    "chars": 8666,
    "preview": "\"\"\"Types and request signatures for OpenAI compatibility\n\nNOTE: These types may change to match the OpenAI OpenAPI speci"
  },
  {
    "path": "llama_cpp/llava_cpp.py",
    "chars": 4552,
    "preview": "from __future__ import annotations\n\nimport os\nfrom ctypes import (\n    c_bool,\n    c_char_p,\n    c_int,\n    c_uint8,\n   "
  },
  {
    "path": "llama_cpp/mtmd_cpp.py",
    "chars": 8834,
    "preview": "from __future__ import annotations\n\nimport os\nfrom ctypes import (\n    c_bool,\n    c_char_p,\n    c_int,\n    c_uint8,\n   "
  },
  {
    "path": "llama_cpp/py.typed",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "llama_cpp/server/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "llama_cpp/server/__main__.py",
    "chars": 2843,
    "preview": "\"\"\"Example FastAPI server for llama.cpp.\n\nTo run this example:\n\n```bash\npip install fastapi uvicorn sse-starlette pydant"
  },
  {
    "path": "llama_cpp/server/app.py",
    "chars": 19569,
    "preview": "from __future__ import annotations\n\nimport os\nimport json\nimport typing\nimport contextlib\n\nfrom anyio import Lock\nfrom f"
  },
  {
    "path": "llama_cpp/server/cli.py",
    "chars": 3268,
    "preview": "from __future__ import annotations\n\nimport argparse\n\nfrom typing import List, Literal, Union, Any, Type, TypeVar\n\nfrom p"
  },
  {
    "path": "llama_cpp/server/errors.py",
    "chars": 7164,
    "preview": "from __future__ import annotations\n\nimport sys\nimport traceback\nimport time\nfrom re import compile, Match, Pattern\nfrom "
  },
  {
    "path": "llama_cpp/server/model.py",
    "chars": 13556,
    "preview": "from __future__ import annotations\n\nimport json\n\nfrom typing import Dict, Optional, Union, List\n\nimport llama_cpp\nimport"
  },
  {
    "path": "llama_cpp/server/settings.py",
    "chars": 8566,
    "preview": "from __future__ import annotations\n\nimport multiprocessing\n\nfrom typing import Optional, List, Literal, Union, Dict, cas"
  },
  {
    "path": "llama_cpp/server/types.py",
    "chars": 12216,
    "preview": "from __future__ import annotations\n\nfrom typing import List, Optional, Union, Dict\nfrom typing_extensions import TypedDi"
  },
  {
    "path": "mkdocs.yml",
    "chars": 1825,
    "preview": "site_name: llama-cpp-python\nrepo_url: https://github.com/abetlen/llama-cpp-python\n\ntheme:\n  name: material\n  palette: \n\n"
  },
  {
    "path": "pyproject.toml",
    "chars": 2111,
    "preview": "[build-system]\nrequires = [\"scikit-build-core[pyproject]>=0.9.2\"]\nbuild-backend = \"scikit_build_core.build\"\n\n[project]\nn"
  },
  {
    "path": "scripts/get-releases.sh",
    "chars": 1319,
    "preview": "#!/bin/bash\n\n# Function to get all releases\nget_all_releases() {\n    local page=1\n    local per_page=100\n    local relea"
  },
  {
    "path": "scripts/releases-to-pep-503.sh",
    "chars": 3002,
    "preview": "#!/bin/bash\n\n# Enable exit on error\nset -e\n\n# Function for logging\nlog_error() {\n    echo \"ERROR: $1\" >&2\n}\n\nlog_info() "
  },
  {
    "path": "tests/test_llama.py",
    "chars": 6376,
    "preview": "import ctypes\nimport multiprocessing\n\nimport numpy as np\nfrom scipy.special import log_softmax\n\nfrom huggingface_hub imp"
  },
  {
    "path": "tests/test_llama_chat_format.py",
    "chars": 3442,
    "preview": "import json\n\nimport jinja2\n\nfrom llama_cpp import (\n    ChatCompletionRequestUserMessage,\n)\nimport llama_cpp.llama_types"
  },
  {
    "path": "tests/test_llama_grammar.py",
    "chars": 1908,
    "preview": "import llama_cpp\nimport json\n\ntree = \"\"\"\nleaf ::= \".\"\nnode ::= leaf | \"(\" node node \")\"\nroot ::= node\n\"\"\"\n\n\ndef test_gra"
  },
  {
    "path": "tests/test_llama_speculative.py",
    "chars": 696,
    "preview": "import numpy as np\n\nfrom llama_cpp.llama_speculative import LlamaPromptLookupDecoding\n\ndef test_find_candidate_pred_toke"
  }
]

About this extraction

This page contains the full source code of the abetlen/llama-cpp-python GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 99 files (1.2 MB), approximately 382.1k tokens, and a symbol index with 750 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo