[
  {
    "path": ".dockerignore",
    "content": "_skbuild/\n\n.envrc\n\nmodels/\n\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\ncover/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\n.pybuilder/\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n#   For a library or package, you might want to ignore these files since the code is\n#   intended to run in multiple environments; otherwise, check them in:\n# .python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# poetry\n#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.\n#   This is especially recommended for binary packages to ensure reproducibility, and is more\n#   commonly ignored for libraries.\n#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control\n#poetry.lock\n\n# pdm\n#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.\n#pdm.lock\n#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it\n#   in version control.\n#   https://pdm.fming.dev/#use-with-ide\n.pdm.toml\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n# pytype static type analyzer\n.pytype/\n\n# Cython debug symbols\ncython_debug/\n\n# PyCharm\n#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can\n#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore\n#  and can be added to the global gitignore or merged into this file.  For a more nuclear\n#  option (not recommended) you can uncomment the following to ignore the entire idea folder.\n.idea/\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "content": "---\nname: Bug report\nabout: Create a report to help us improve\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n# Prerequisites\n\nPlease answer the following questions for yourself before submitting an issue.\n\n- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.\n- [ ] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).\n- [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).\n- [ ] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.\n\n# Expected Behavior\n\nPlease provide a detailed written description of what you were trying to do, and what you expected `llama-cpp-python` to do.\n\n# Current Behavior\n\nPlease provide a detailed written description of what `llama-cpp-python` did, instead.\n\n# Environment and Context\n\nPlease provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.\n\n* Physical (or virtual) hardware you are using, e.g. for Linux:\n\n`$ lscpu`\n\n* Operating System, e.g. for Linux:\n\n`$ uname -a`\n\n* SDK version, e.g. for Linux:\n\n```\n$ python3 --version\n$ make --version\n$ g++ --version\n```\n\n# Failure Information (for bugs)\n\nPlease help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.\n\n# Steps to Reproduce\n\nPlease provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.\n\n1. step 1\n2. step 2\n3. step 3\n4. etc.\n\n**Note: Many issues seem to be regarding functional or performance issues / differences with `llama.cpp`. In these cases we need to confirm that you're comparing against the version of `llama.cpp` that was built with your python package, and which parameters you're passing to the context.**\n\nTry the following:\n\n1. `git clone https://github.com/abetlen/llama-cpp-python`\n2. `cd llama-cpp-python`\n3. `rm -rf _skbuild/` # delete any old builds\n4. `python -m pip install .`\n5. `cd ./vendor/llama.cpp`\n6. Follow [llama.cpp's instructions](https://github.com/ggerganov/llama.cpp#build) to `cmake` llama.cpp\n7. Run llama.cpp's `./main` with the same arguments you previously passed to llama-cpp-python and see if you can reproduce the issue. If you can, [log an issue with llama.cpp](https://github.com/ggerganov/llama.cpp/issues)\n\n# Failure Logs\n\nPlease include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.\n\nAlso, please try to **avoid using screenshots** if at all possible. Instead, copy/paste the console output and use [Github's markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to cleanly format your logs for easy readability.\n\nExample environment info:\n```\nllama-cpp-python$ git log | head -1\ncommit 47b0aa6e957b93dbe2c29d53af16fbae2dd628f2\n\nllama-cpp-python$ python3 --version\nPython 3.10.10\n\nllama-cpp-python$ pip list | egrep \"uvicorn|fastapi|sse-starlette|numpy\"\nfastapi                  0.95.0\nnumpy                    1.24.3\nsse-starlette            1.3.3\nuvicorn                  0.21.1\n\nllama-cpp-python/vendor/llama.cpp$ git log | head -3\ncommit 66874d4fbcc7866377246efbcee938e8cc9c7d76\nAuthor: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>\nDate:   Thu May 25 20:18:01 2023 -0600\n```\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "content": "---\nname: Feature request\nabout: Suggest an idea for this project\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n**Is your feature request related to a problem? Please describe.**\nA clear and concise description of what the problem is. Ex. I'm always frustrated when [...]\n\n**Describe the solution you'd like**\nA clear and concise description of what you want to happen.\n\n**Describe alternatives you've considered**\nA clear and concise description of any alternative solutions or features you've considered.\n\n**Additional context**\nAdd any other context or screenshots about the feature request here.\n"
  },
  {
    "path": ".github/dependabot.yml",
    "content": "# To get started with Dependabot version updates, you'll need to specify which\n# package ecosystems to update and where the package manifests are located.\n# Please see the documentation for all configuration options:\n# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates\n\nversion: 2\nupdates:\n  - package-ecosystem: \"pip\" # See documentation for possible values\n    directory: \"/\" # Location of package manifests\n    schedule:\n      interval: \"daily\"\n  - package-ecosystem: \"github-actions\"\n    directory: \"/\"\n    schedule:\n      interval: \"daily\"\n  - package-ecosystem: \"docker\"\n    directory: \"/\"\n    schedule:\n      interval: \"daily\"   \n"
  },
  {
    "path": ".github/workflows/build-and-release.yaml",
    "content": "name: Build Release\n\non: workflow_dispatch\n\npermissions:\n  contents: write\n\njobs:\n  build_wheels:\n    name: Build wheels on ${{ matrix.os }}\n    runs-on: ${{ matrix.os }}\n    strategy:\n      matrix:\n        os: [ubuntu-22.04, windows-2022, macos-14, macos-15]\n\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          submodules: \"recursive\"\n\n      # Used to host cibuildwheel\n      - uses: actions/setup-python@v5\n        with:\n          python-version: \"3.9\"\n\n      - name: Install dependencies (Linux/MacOS)\n        if: runner.os != 'Windows'\n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          RUST_LOG=trace python -m uv pip install -e .[all] --verbose\n        shell: bash\n\n      - name: Install dependencies (Windows)\n        if: runner.os == 'Windows'\n        env:\n          RUST_LOG: trace        \n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          python -m uv pip install -e .[all] --verbose\n        shell: cmd\n\n      - name: Build wheels\n        uses: pypa/cibuildwheel@v2.22.0\n        env:\n          # disable repair\n          CIBW_REPAIR_WHEEL_COMMAND: \"\"\n        with:\n          package-dir: .\n          output-dir: wheelhouse\n\n      - uses: actions/upload-artifact@v4\n        with:\n          name: wheels-${{ matrix.os }}\n          path: ./wheelhouse/*.whl\n\n  build_wheels_arm64:\n    name: Build arm64 wheels\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          submodules: \"recursive\"\n\n      - name: Set up QEMU\n        uses: docker/setup-qemu-action@v3\n        with:\n          platforms: linux/arm64\n\n      - name: Build wheels\n        uses: pypa/cibuildwheel@v2.22.0\n        env:\n          CIBW_SKIP: \"*musllinux* pp*\"\n          CIBW_REPAIR_WHEEL_COMMAND: \"\"\n          CIBW_ARCHS: \"aarch64\"\n          CIBW_ENVIRONMENT: CMAKE_ARGS=\"-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DCMAKE_CROSSCOMPILING=ON\"\n          CIBW_BUILD: \"cp38-* cp39-* cp310-* cp311-* cp312-*\"\n        with:\n          output-dir: wheelhouse\n\n      - name: Upload wheels as artifacts\n        uses: actions/upload-artifact@v4\n        with:\n          name: wheels_arm64\n          path: ./wheelhouse/*.whl\n\n  build_sdist:\n    name: Build source distribution\n    runs-on: ubuntu-latest\n\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          submodules: \"recursive\"\n\n      - uses: actions/setup-python@v5\n        with:\n          python-version: \"3.9\"\n\n      - name: Install dependencies (Linux/MacOS)\n        if: runner.os != 'Windows'\n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          RUST_LOG=trace python -m uv pip install -e .[all] --verbose\n          python -m uv pip install build\n        shell: bash\n\n      - name: Install dependencies (Windows)\n        if: runner.os == 'Windows'\n        env:\n          RUST_LOG: trace        \n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          python -m uv pip install -e .[all] --verbose\n          python -m uv pip install build\n        shell: cmd\n\n      - name: Build source distribution\n        run: |\n          python -m build --sdist\n\n      - uses: actions/upload-artifact@v4\n        with:\n          name: sdist\n          path: ./dist/*.tar.gz\n\n  release:\n    name: Release\n    needs: [build_wheels, build_wheels_arm64, build_sdist]\n    runs-on: ubuntu-latest\n\n    steps:\n      - uses: actions/download-artifact@v4\n        with:\n          merge-multiple: true\n          path: dist\n\n      - uses: softprops/action-gh-release@v2\n        with:\n          files: dist/*\n        env:\n          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n"
  },
  {
    "path": ".github/workflows/build-docker.yaml",
    "content": "name: Build Docker\n\non: workflow_dispatch\n\npermissions:\n  contents: write\n  packages: write\n\njobs:\n  docker:\n    name: Build and push Docker image\n    runs-on: ubuntu-22.04\n    steps:\n      - name: Checkout\n        uses: actions/checkout@v4\n        with:\n          submodules: \"recursive\"\n\n      - name: Set up QEMU\n        uses: docker/setup-qemu-action@v3\n\n      - name: Set up Docker Buildx\n        uses: docker/setup-buildx-action@v3\n\n      - name: Login to GitHub Container Registry\n        uses: docker/login-action@v3 \n        with:\n          registry: ghcr.io\n          username: ${{ github.repository_owner }}\n          password: ${{ secrets.GITHUB_TOKEN }}\n\n      - name: Build and push\n        id: docker_build\n        uses: docker/build-push-action@v6\n        with:\n          context: .\n          file: \"docker/simple/Dockerfile\"\n          push: ${{ startsWith(github.ref, 'refs/tags/') }}\n          pull: true\n          platforms: linux/amd64,linux/arm64\n          tags: |\n            ghcr.io/abetlen/llama-cpp-python:latest\n            ghcr.io/abetlen/llama-cpp-python:${{ github.ref_name }}\n          build-args: |\n            BUILDKIT_INLINE_CACHE=1\n\n      - name: Publish to GitHub Tag\n        if: steps.docker_build.outputs.digest && startsWith(github.ref, 'refs/tags/')\n        run: |\n          echo \"Docker image published for tag: ${{ github.ref_name }}\"\n"
  },
  {
    "path": ".github/workflows/build-wheels-cuda.yaml",
    "content": "name: Build Wheels (CUDA)\n\non: workflow_dispatch\n\npermissions:\n  contents: write\n\njobs:\n  define_matrix:\n    name: Define Build Matrix\n    runs-on: ubuntu-22.04\n    outputs:\n      matrix: ${{ steps.set-matrix.outputs.matrix }}\n    defaults:\n      run:\n        shell: pwsh\n\n    steps:\n      - name: Define Job Output\n        id: set-matrix\n        run: |\n          $matrix = @{\n              'os' = @('ubuntu-22.04') #, 'windows-2022')\n              'pyver' = @(\"3.9\", \"3.10\", \"3.11\", \"3.12\")\n              'cuda' = @(\"12.1.1\", \"12.2.2\", \"12.3.2\", \"12.4.1\") #, \"12.5.1\", \"12.6.1\")\n              'releasetag' = @(\"basic\")\n          }\n\n          $matrixOut = ConvertTo-Json $matrix -Compress\n          Write-Output ('matrix=' + $matrixOut) >> $env:GITHUB_OUTPUT\n\n  build_wheels:\n    name: Build Wheel ${{ matrix.os }} ${{ matrix.pyver }} ${{ matrix.cuda }} ${{ matrix.releasetag == 'wheels' && 'AVX2' || matrix.releasetag }}\n    needs: define_matrix\n    runs-on: ${{ matrix.os }}\n    strategy:\n      matrix: ${{ fromJSON(needs.define_matrix.outputs.matrix) }}\n    defaults:\n      run:\n        shell: pwsh\n    env:\n      CUDAVER: ${{ matrix.cuda }}\n      AVXVER: ${{ matrix.releasetag }}\n\n    steps:\n      - name: Add MSBuild to PATH\n        if: runner.os == 'Windows'\n        uses: microsoft/setup-msbuild@v2\n        with:\n          vs-version: '[16.11,16.12)'\n\n      - uses: actions/checkout@v4\n        with:\n          submodules: \"recursive\"\n\n      - uses: actions/setup-python@v5\n        with:\n          python-version: ${{ matrix.pyver }}\n          cache: 'pip'\n\n      - name: Setup Mamba\n        uses: conda-incubator/setup-miniconda@v3.1.0\n        with:\n          activate-environment: \"llamacpp\"\n          python-version: ${{ matrix.pyver }}\n          miniforge-version: latest\n          add-pip-as-python-dependency: true\n          auto-activate-base: false\n\n      - name: VS Integration Cache\n        id: vs-integration-cache\n        if: runner.os == 'Windows'\n        uses: actions/cache@v4\n        with:\n          path: ./MSBuildExtensions\n          key: cuda-${{ matrix.cuda }}-vs-integration\n\n      - name: Get Visual Studio Integration\n        if: runner.os == 'Windows' && steps.vs-integration-cache.outputs.cache-hit != 'true'\n        run: |\n          if ($env:CUDAVER -eq '12.1.1') {$x = '12.1.0'} else {$x = $env:CUDAVER}\n          $links = (Invoke-RestMethod 'https://raw.githubusercontent.com/Jimver/cuda-toolkit/master/src/links/windows-links.ts').Trim().split().where({$_ -ne ''})\n          for ($i=$q=0;$i -lt $links.count -and $q -lt 2;$i++) {if ($links[$i] -eq \"'$x',\") {$q++}}\n          Invoke-RestMethod $links[$i].Trim(\"'\") -OutFile 'cudainstaller.zip'\n          & 'C:\\Program Files\\7-Zip\\7z.exe' e cudainstaller.zip -oMSBuildExtensions -r *\\MSBuildExtensions\\* > $null\n          Remove-Item 'cudainstaller.zip'\n\n      - name: Install Visual Studio Integration\n        if: runner.os == 'Windows'\n        run: |\n          $y = (gi '.\\MSBuildExtensions').fullname + '\\*'\n          (gi 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Enterprise\\MSBuild\\Microsoft\\VC\\*\\BuildCustomizations').fullname.foreach({cp $y $_})\n          $cupath = 'CUDA_PATH_V' + $env:CUDAVER.Remove($env:CUDAVER.LastIndexOf('.')).Replace('.','_')\n          echo \"$cupath=$env:CONDA_PREFIX\" >> $env:GITHUB_ENV\n\n      - name: Install Dependencies\n        env:\n          MAMBA_DOWNLOAD_FAILFAST: \"0\"\n          MAMBA_NO_LOW_SPEED_LIMIT: \"1\"\n        run: |\n          $cudaVersion = $env:CUDAVER\n          mamba install -y 'cuda' -c nvidia/label/cuda-$cudaVersion\n          python -m pip install build wheel\n\n      - name: Build Wheel\n        run: |\n          $cudaVersion = $env:CUDAVER.Remove($env:CUDAVER.LastIndexOf('.')).Replace('.','')\n          $env:CUDA_PATH = $env:CONDA_PREFIX\n          $env:CUDA_HOME = $env:CONDA_PREFIX\n          $env:CUDA_TOOLKIT_ROOT_DIR = $env:CONDA_PREFIX\n          if ($IsLinux) {\n            $env:LD_LIBRARY_PATH = $env:CONDA_PREFIX + '/lib:' + $env:LD_LIBRARY_PATH\n          }\n          $env:VERBOSE = '1'\n          $env:CMAKE_ARGS = '-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all'\n          $env:CMAKE_ARGS = \"-DGGML_CUDA_FORCE_MMQ=ON $env:CMAKE_ARGS\"\n          # if ($env:AVXVER -eq 'AVX') {\n          $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DGGML_AVX2=off -DGGML_FMA=off -DGGML_F16C=off'\n          # }\n          # if ($env:AVXVER -eq 'AVX512') {\n          #  $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DGGML_AVX512=on'\n          # }\n          # if ($env:AVXVER -eq 'basic') {\n          #  $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DGGML_AVX=off -DGGML_AVX2=off -DGGML_FMA=off -DGGML_F16C=off'\n          # }\n          python -m build --wheel\n          # write the build tag to the output\n          Write-Output \"CUDA_VERSION=$cudaVersion\" >> $env:GITHUB_ENV\n\n      - uses: softprops/action-gh-release@v2\n        with:\n          files: dist/*\n          # Set tag_name to <tag>-cu<cuda_version>\n          tag_name: ${{ github.ref_name }}-cu${{ env.CUDA_VERSION }}\n        env:\n          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n"
  },
  {
    "path": ".github/workflows/build-wheels-metal.yaml",
    "content": "name: Build Wheels (Metal)\n\non: workflow_dispatch\n\npermissions:\n  contents: write\n\njobs:\n  build_wheels:\n    name: Build wheels on ${{ matrix.os }}\n    runs-on: ${{ matrix.os }}\n    strategy:\n      matrix:\n        os: [macos-14, macos-15]\n\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          submodules: \"recursive\"\n\n      # Used to host cibuildwheel\n      - uses: actions/setup-python@v5\n        with:\n          python-version: \"3.12\"\n          cache: 'pip'\n\n      - name: Install dependencies (Linux/MacOS)\n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          RUST_LOG=trace python -m uv pip install -e .[all] --verbose\n        shell: bash\n\n      - name: Build wheels\n        uses: pypa/cibuildwheel@v2.22.0\n        env:\n          # disable repair\n          CIBW_REPAIR_WHEEL_COMMAND: \"\"\n          CIBW_ARCHS: \"arm64\"\n          CIBW_ENVIRONMENT: CMAKE_ARGS=\"-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on -DCMAKE_CROSSCOMPILING=ON\"\n          CIBW_BUILD: \"cp39-* cp310-* cp311-* cp312-*\"\n        with:\n          package-dir: .\n          output-dir: wheelhouse2\n\n      - uses: actions/upload-artifact@v4\n        with:\n          name: wheels-mac_${{ matrix.os }}\n          path: ./wheelhouse2/*.whl\n\n  release:\n    name: Release\n    needs: [build_wheels]\n    runs-on: ubuntu-latest\n\n    steps:\n      - uses: actions/download-artifact@v4\n        with:\n          merge-multiple: true\n          path: dist2\n\n      - uses: softprops/action-gh-release@v2\n        with:\n          files: dist2/*\n          # set release name to <tag>-metal\n          tag_name: ${{ github.ref_name }}-metal\n        env:\n          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n"
  },
  {
    "path": ".github/workflows/generate-index-from-release.yaml",
    "content": "name: Wheels Index\n\non:\n  # Trigger on new release\n  workflow_run:\n    workflows: [\"Release\", \"Build Wheels (CUDA)\", \"Build Wheels (Metal)\"]\n    types:\n      - completed\n\n  # Allows you to run this workflow manually from the Actions tab\n  workflow_dispatch:\n\n# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages\npermissions:\n  contents: read\n  pages: write\n  id-token: write\n\n# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.\n# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.\nconcurrency:\n  group: \"pages\"\n  cancel-in-progress: false\n\njobs:\n  # Single deploy job since we're just deploying\n  deploy:\n    environment:\n      name: github-pages\n      url: ${{ steps.deployment.outputs.page_url }}\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout\n        uses: actions/checkout@v4\n      - name: Setup Pages\n        uses: actions/configure-pages@v5\n      - name: Build\n        env:\n          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n        run: |\n          ./scripts/get-releases.sh\n          ./scripts/releases-to-pep-503.sh index/whl/cpu '^[v]?[0-9]+\\.[0-9]+\\.[0-9]+$'\n          ./scripts/releases-to-pep-503.sh index/whl/cu121 '^[v]?[0-9]+\\.[0-9]+\\.[0-9]+-cu121$'\n          ./scripts/releases-to-pep-503.sh index/whl/cu122 '^[v]?[0-9]+\\.[0-9]+\\.[0-9]+-cu122$'\n          ./scripts/releases-to-pep-503.sh index/whl/cu123 '^[v]?[0-9]+\\.[0-9]+\\.[0-9]+-cu123$'\n          ./scripts/releases-to-pep-503.sh index/whl/cu124 '^[v]?[0-9]+\\.[0-9]+\\.[0-9]+-cu124$'\n          # ./scripts/releases-to-pep-503.sh index/whl/cu125 '^[v]?[0-9]+\\.[0-9]+\\.[0-9]+-cu124$'\n          # ./scripts/releases-to-pep-503.sh index/whl/cu126 '^[v]?[0-9]+\\.[0-9]+\\.[0-9]+-cu124$'\n          ./scripts/releases-to-pep-503.sh index/whl/metal '^[v]?[0-9]+\\.[0-9]+\\.[0-9]+-metal$'\n      - name: Upload artifact\n        uses: actions/upload-pages-artifact@v3\n        with:\n          # Upload entire repository\n          path: 'index'\n      - name: Deploy to GitHub Pages\n        id: deployment\n        uses: actions/deploy-pages@v4\n"
  },
  {
    "path": ".github/workflows/publish-to-test.yaml",
    "content": "# Based on: https://packaging.python.org/en/latest/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/\n\nname: Publish to TestPyPI\n\non:\n  workflow_dispatch:\n    inputs:\n      dev_version:\n        description: 'Dev version N'\n        required: true\n\n\njobs:\n  build-n-publish:\n    name: Build and publish\n    runs-on: ubuntu-latest\n\n    steps:\n    - uses: actions/checkout@v4\n      with:\n        submodules: \"recursive\"\n        \n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: \"3.11\"\n        cache: 'pip'\n        \n    - name: Append Dev Version to __version__\n      run: |\n        DEV_VERSION=${{ github.event.inputs.dev_version }}\n        CURRENT_VERSION=$(awk -F= '/__version__ =/ {print $2}' llama_cpp/__init__.py | tr -d ' \"')\n        NEW_VERSION=\"${CURRENT_VERSION}.dev${DEV_VERSION}\"\n        sed -i 's/__version__ = \\\".*\\\"/__version__ = \\\"'\"${NEW_VERSION}\"'\\\"/' llama_cpp/__init__.py\n        \n    - name: Install dependencies (Linux/MacOS)\n      if: runner.os != 'Windows'\n      run: |\n        python -m pip install --upgrade pip\n        python -m pip install uv\n        RUST_LOG=trace python -m uv pip install -e .[all] --verbose\n      shell: bash\n\n    - name: Install dependencies (Windows)\n      if: runner.os == 'Windows'\n      env:\n        RUST_LOG: trace       \n      run: |\n        python -m pip install --upgrade pip\n        python -m pip install uv\n        python -m uv pip install -e .[all] --verbose\n      shell: cmd\n        \n    - name: Build source distribution\n      run: |\n        python -m build --sdist\n        \n    - name: Publish to Test PyPI\n      uses: pypa/gh-action-pypi-publish@release/v1\n      with:\n        password: ${{ secrets.TEST_PYPI_API_TOKEN }}\n        repository-url: https://test.pypi.org/legacy/\n"
  },
  {
    "path": ".github/workflows/publish.yaml",
    "content": "name: Publish to PyPI\n\n# Based on: https://packaging.python.org/en/latest/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/\n\non: workflow_dispatch\n\njobs:\n  build-n-publish:\n    name: Build and publish\n    runs-on: ubuntu-latest\n\n    steps:\n    - uses: actions/checkout@v4\n      with:\n        submodules: \"recursive\"\n\n    - name: Set up Python\n      uses: actions/setup-python@v5\n      with:\n        python-version: \"3.9\"\n\n    - name: Install dependencies (Linux/MacOS)\n      if: runner.os != 'Windows'\n      run: |\n        python -m pip install --upgrade pip\n        python -m pip install uv\n        RUST_LOG=trace python -m uv pip install -e .[all] --verbose\n        python -m uv pip install build\n      shell: bash\n\n    - name: Install dependencies (Windows)\n      if: runner.os == 'Windows'\n      env:\n        RUST_LOG: trace\n      run: |\n        python -m pip install --upgrade pip\n        python -m pip install uv\n        python -m uv pip install -e .[all] --verbose\n        python -m uv pip install build\n      shell: cmd\n\n    - name: Build source distribution\n      run: |\n        python -m build --sdist\n\n    - name: Publish distribution to PyPI\n      # TODO: move to tag based releases\n      # if: startsWith(github.ref, 'refs/tags')\n      uses: pypa/gh-action-pypi-publish@release/v1\n      with:\n        password: ${{ secrets.PYPI_API_TOKEN }}\n"
  },
  {
    "path": ".github/workflows/test-pypi.yaml",
    "content": "name: Tests for PyPI package\n\non: workflow_dispatch\n\njobs:\n  build-linux:\n\n    runs-on: ubuntu-latest\n    strategy:\n      matrix:\n        python-version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\"]\n\n    steps:\n      - name: Set up Python ${{ matrix.python-version }}\n        uses: actions/setup-python@v5\n        with:\n          python-version: ${{ matrix.python-version }}\n          cache: 'pip'\n\n      - name: Install dependencies (Linux/MacOS)\n        if: runner.os != 'Windows'\n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          RUST_LOG=trace python -m uv pip install llama-cpp-python[all] --verbose \n        shell: bash\n\n      - name: Install dependencies (Windows)\n        if: runner.os == 'Windows'\n        env:\n          RUST_LOG: trace           \n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          python -m uv pip install llama-cpp-python[all] --verbose \n        shell: cmd\n          \n      - name: Test with pytest\n        run: |\n          python -c \"import llama_cpp\"\n\n  build-windows:\n\n    runs-on: windows-latest\n    strategy:\n      matrix:\n        python-version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\"]\n\n    steps:\n      - name: Set up Python ${{ matrix.python-version }}\n        uses: actions/setup-python@v5\n        with:\n          python-version: ${{ matrix.python-version }}\n          cache: 'pip'\n          \n      - name: Install dependencies (Linux/MacOS)\n        if: runner.os != 'Windows'\n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          RUST_LOG=trace python -m uv pip install llama-cpp-python[all] --verbose \n        shell: bash\n\n      - name: Install dependencies (Windows)\n        if: runner.os == 'Windows'\n        env:\n          RUST_LOG: trace          \n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          python -m uv pip install llama-cpp-python[all] --verbose \n        shell: cmd\n          \n      - name: Test with pytest\n        run: |\n          python -c \"import llama_cpp\"\n\n  build-macos:\n\n    runs-on: macos-latest\n    strategy:\n      matrix:\n        python-version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\"]\n\n    steps:\n      - name: Set up Python ${{ matrix.python-version }}\n        uses: actions/setup-python@v5\n        with:\n          python-version: ${{ matrix.python-version }}\n          cache: 'pip'   \n\n      - name: Install dependencies (Linux/MacOS)\n        if: runner.os != 'Windows'\n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          RUST_LOG=trace python -m uv pip install llama-cpp-python[all] --verbose \n        shell: bash\n\n      - name: Install dependencies (Windows)\n        if: runner.os == 'Windows'\n        env:\n          RUST_LOG: trace  \n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          python -m uv pip install llama-cpp-python[all] --verbose \n        shell: cmd\n          \n      - name: Test with pytest\n        run: |\n          python -c \"import llama_cpp\"\n"
  },
  {
    "path": ".github/workflows/test.yaml",
    "content": "name: Tests\non:\n  pull_request:\n    branches:\n      - main\n  push:\n    branches:\n      - main\n\nenv:\n  REPO_ID: Qwen/Qwen2-0.5B-Instruct-GGUF\n  MODEL_FILE: qwen2-0_5b-instruct-q8_0.gguf\n\njobs:\n  download-model:\n    runs-on: ubuntu-latest\n    steps:\n      - name: Set up Python\n        uses: actions/setup-python@v5\n        with:\n          python-version: \"3.9\"\n      - name: Install huggingface-hub\n        run: pip install huggingface-hub\n      - name: Download model\n        run: huggingface-cli download ${{ env.REPO_ID }} ${{ env.MODEL_FILE }}\n      - name: Cache model\n        uses: actions/cache@v4\n        with:\n          path: ~/.cache/huggingface/hub\n          key: ${{ runner.os }}-model-${{ env.REPO_ID }}-${{ env.MODEL_FILE }}\n\n  build-linux:\n    needs: download-model\n    runs-on: ubuntu-latest\n    strategy:\n      matrix:\n        python-version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\"]\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          submodules: \"recursive\"\n          \n      - name: Set up Python ${{ matrix.python-version }}\n        uses: actions/setup-python@v5\n        with:\n          python-version: ${{ matrix.python-version }}\n          cache: 'pip'\n      - name: Restore model cache\n        uses: actions/cache@v4\n        with:\n          path: ~/.cache/huggingface/hub\n          key: ${{ runner.os }}-model-${{ env.REPO_ID }}-${{ env.MODEL_FILE }}\n      - name: Install dependencies (Linux/MacOS)\n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          python -m uv pip install -e .[all] --verbose\n        shell: bash\n      - name: Test with pytest\n        run: |\n          python -m pytest\n\n  build-windows:\n    needs: download-model\n    runs-on: windows-latest\n    strategy:\n      matrix:\n        python-version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\"]\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          submodules: \"recursive\"\n          \n      - name: Set up Python ${{ matrix.python-version }}\n        uses: actions/setup-python@v5\n        with:\n          python-version: ${{ matrix.python-version }}\n          cache: 'pip'\n\n      - name: Restore model cache\n        uses: actions/cache@v4\n        with:\n          path: ~/.cache/huggingface/hub\n          key: ${{ runner.os }}-model-${{ env.REPO_ID }}-${{ env.MODEL_FILE }}\n\n      - name: Install dependencies (Windows)\n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install uv\n          python -m uv pip install -e .[all] --verbose\n        shell: cmd\n          \n      - name: Test with pytest\n        run: |\n          python -m pytest\n\n  build-macos:\n    needs: download-model\n    runs-on: macos-13\n    strategy:\n      matrix:\n        python-version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\"]\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          submodules: \"recursive\"\n          \n      - name: Set up Python ${{ matrix.python-version }}\n        uses: actions/setup-python@v5\n        with:\n          python-version: ${{ matrix.python-version }}\n          cache: 'pip'\n\n      - name: System Info\n        run: |\n          uname -a\n          sysctl -n machdep.cpu.brand_string\n          python3 -c \"import platform; print(platform.machine(), platform.architecture())\"\n\n      - name: Restore model cache\n        uses: actions/cache@v4\n        with:\n          path: ~/.cache/huggingface/hub\n          key: ${{ runner.os }}-model-${{ env.REPO_ID }}-${{ env.MODEL_FILE }}\n          \n      - name: Install dependencies (Linux/MacOS)\n        run: |\n          python3 -m pip install --upgrade pip\n          python3 -m pip install uv\n          python3 -m uv pip install -e .[all] --verbose\n          CMAKE_ARGS=\"-DLLAMA_METAL=off\" python3 -m uv pip install .[all] --verbose\n        shell: bash\n\n      - name: Test with pytest\n        run: |\n          python3 -m pytest\n\n  build-macos-metal:\n    needs: download-model\n    runs-on: macos-13\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          submodules: \"recursive\"\n          \n      - name: Set up Python 3.9\n        uses: actions/setup-python@v5\n        with:\n          python-version: \"3.9\"\n\n      - name: System Info\n        run: |\n          uname -a\n          sysctl -n machdep.cpu.brand_string\n          python3 -c \"import platform; print(platform.machine(), platform.architecture())\"\n\n      - name: Restore model cache\n        uses: actions/cache@v4\n        with:\n          path: ~/.cache/huggingface/hub\n          key: ${{ runner.os }}-model-${{ env.REPO_ID }}-${{ env.MODEL_FILE }}\n\n      - name: Install dependencies\n        run: |\n          python3 -m pip install --upgrade pip\n          CMAKE_ARGS=\"-DLLAMA_METAL=on\" python3 -m pip install .[all] --verbose\n        shell: bash\n\n      - name: Test with pytest\n        run: |\n          python3 -m pytest\n"
  },
  {
    "path": ".gitignore",
    "content": "*.local\n\n.python-version\n\n.vscode/\n\n_skbuild/\n\n.envrc\n.direnv\n\nmodels/\n\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\nllama_cpp/*.so\nllama_cpp/*.dylib\nllama_cpp/*.metal\nllama_cpp/*.dll\nllama_cpp/*.lib\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\ncover/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\n.pybuilder/\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n#   For a library or package, you might want to ignore these files since the code is\n#   intended to run in multiple environments; otherwise, check them in:\n# .python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# poetry\n#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.\n#   This is especially recommended for binary packages to ensure reproducibility, and is more\n#   commonly ignored for libraries.\n#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control\n#poetry.lock\n\n# pdm\n#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.\n#pdm.lock\n#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it\n#   in version control.\n#   https://pdm.fming.dev/#use-with-ide\n.pdm.toml\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n# pytype static type analyzer\n.pytype/\n\n# Cython debug symbols\ncython_debug/\n\n# PyCharm\n#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can\n#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore\n#  and can be added to the global gitignore or merged into this file.  For a more nuclear\n#  option (not recommended) you can uncomment the following to ignore the entire idea folder.\n.idea/\n\n# downloaded model .bin files\ndocker/open_llama/*.bin\n"
  },
  {
    "path": ".gitmodules",
    "content": "[submodule \"vendor/llama.cpp\"]\n\tpath = vendor/llama.cpp\n\turl = https://github.com/ggerganov/llama.cpp.git\n"
  },
  {
    "path": ".readthedocs.yaml",
    "content": "# Read the Docs configuration file for MkDocs projects\n# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details\n\n# Required\nversion: 2\n\n# Set the version of Python and other tools you might need\nbuild:\n  os: ubuntu-22.04\n  tools:\n    python: \"3.11\"\n\nmkdocs:\n  configuration: mkdocs.yml\n\npython:\n  install:\n    - method: pip\n      path: .\n    - requirements: docs/requirements.txt\n\nsubmodules:\n  include: all\n  recursive: true"
  },
  {
    "path": "CHANGELOG.md",
    "content": "# Changelog\n\nAll notable changes to this project will be documented in this file.\n\nThe format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),\nand this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).\n\n## [Unreleased]\n\n## [0.3.16]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@4227c9be4268ac844921b90f31595f81236bd317\n\n## [0.3.15]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@9a96389544a08fd829fccda28142ce2066017fde\n- feat: Add gpt-oss chat format support through strftime_now in chat format by @iamlemec in af637928db7351e030011085f818b034c6efc047\n- fix: rename op_offloat to op_offload in llama.py by @sergey21000 in #2046\n\n## [0.3.14]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@79e0b68c178656bb0632cb8602d2940b755077f8\n\n## [0.3.13]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@bdca38376f7e8dd928defe01ce6a16218a64b040\n- fix: Better chat format for Qwen2.5-VL by @alcoftTAO in #2040\n\n## [0.3.12]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@a0374a67e2924f2e845cdc59dd67d9a44065a89c\n\n## [0.3.11]\n\n- fix: Update reference to `llama_kv_cache_clear` in Llama.embed. Closes #2037 by @abetlen in 9e5a4eaa84156084ed7bbb91e6efcc91dc6217bc\n\n## [0.3.10]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@8846aace4934ad29651ea61b8c7e3f6b0556e3d2\n- feat: Add support for llama.cpp multimodal, add Qwen2.5-VL chat handler by @abetlen in cd548bd0f14210627798237d5c2ea78acfb88ccb\n\n## [0.3.9]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@8733e0cf6eefc7c7752297cc22d0836706f4222c\n\n## [0.3.8]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@7841fc723e059d1fd9640e5c0ef19050fcc7c698\n\n## [0.3.7]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@794fe23f29fb40104975c91fe19f23798f7c726e\n- fix(ci): Fix the CUDA workflow by @oobabooga in #1894\n- fix: error showing time spent in llama perf context print, adds `no_perf` flag to `Llama` class by @shakalaca in #1898\n\n## [0.3.6]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@f7cd13301c2a88f97073fd119072b4cc92c08df1\n- fix(server): streaming resource lock by @gjpower in #1879\n\n## [0.3.5]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@26a8406ba9198eb6fdd8329fa717555b4f77f05f\n- fix(ci): Fix release by updating macos runner image to non-deprecated version by @abetlen in afedfc888462f9a6e809dc9455eb3b663764cc3f\n- fix(server): add missing await statements for async exit_stack handling by @gjpower in #1858\n\n## [0.3.4]\n\n- fix(ci): Build wheels for macos 13-15, cuda 12.1-12.4 by @abetlen in ca808028bd16b8327bd84128d48015a4b1304690\n\n## [0.3.3]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@ce8784bdb153ff7794dde5a50b0ebfa51baa6171\n- fix: chat API logprobs format by @domdomegg in #1788\n- feat: Add support for CUDA 12.6, fix CUDA 12.5 by @Smartappli in #1775\n- fix: Make content not required in ChatCompletionRequestAssistantMessage by @feloy in #1807\n- fix: Fix pickling of Llama class by setting seed from _seed member by @abetlen in 2523472c3eccb9ab9277117cc4ff705212b6888a\n- fix: Fix logit-bias type hint by @ddh0 in #1802\n- fix(server): Avoid thread starvation on many concurrent requests by making use of asyncio to lock llama_proxy context by @gjpower in #1798\n- fix(server): Added missing exit_stack.close() to /v1/chat/completions by @Ian321 in #1796\n- fix(examples): Refactor Batching notebook to use new sampler chain API by @lukestanley in #1793\n- fix(docs): Update development instructions by @Florents-Tselai in #1833\n- fix(docs): Remove ref to llama_eval in llama_cpp.py docs by @richdougherty in #1819\n\n## [0.3.2]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@74d73dc85cc2057446bf63cc37ff649ae7cebd80\n\n## [0.3.1]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@c919d5db39c8a7fcb64737f008e4b105ee0acd20\n- feat: Expose libggml in internal APIs by @abetlen in #1761\n- fix: Fix speculative decoding by @abetlen in 9992c5084a3df2f533e265d10f81d4269b97a1e6 and e975dabf74b3ad85689c9a07719cbb181313139b\n- misc: Rename all_text to remaining_text by @xu-song in #1658\n\n## [0.3.0]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@ea9c32be71b91b42ecc538bd902e93cbb5fb36cb\n- feat: Enable detokenizing special tokens with special=True by @benniekiss in #1596\n- feat(ci): Speed up CI workflows using uv, add support for CUDA 12.5 wheels by @Smartappli in e529940f45d42ed8aa31334123b8d66bc67b0e78\n- feat: Add loading sharded GGUF files from HuggingFace with Llama.from_pretrained(additional_files=[...]) by @Gnurro in 84c092063e8f222758dd3d60bdb2d1d342ac292e\n- feat: Add option to configure n_ubatch by @abetlen in 6c44a3f36b089239cb6396bb408116aad262c702\n- feat: Update sampling API for llama.cpp. Sampling now uses sampler chain by @abetlen in f8fcb3ea3424bcfba3a5437626a994771a02324b\n- fix: Don't store scores internally unless logits_all=True. Reduces memory requirements for large context by @abetlen in 29afcfdff5e75d7df4c13bad0122c98661d251ab\n- fix: Fix memory allocation of ndarray in by @xu-song in #1704\n- fix: Use system message in og qwen format by @abetlen in 98eb092d3c6e7c142c4ba2faaca6c091718abbb3\n\n\n## [0.2.90]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@1d1ccce67613674c75c9c7e3fa4c1e24e428ba48\n- feat: Add support for `MiniCPMv26ChatHandler` and `minicpm-v-26` in server by @abetlen in f70df824985d875226793b94dacc0c302a4256b2\n\n## [0.2.89]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@cfac111e2b3953cdb6b0126e67a2487687646971\n- fix: Llama.close didn't free lora adapter by @jkawamoto in #1679\n- fix: missing dependencies for test by @jkawamoto in #1680\n\n## [0.2.88]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@fc4ca27b25464a11b3b86c9dbb5b6ed6065965c2\n- fix: only print 'cache saved' in verbose mode by @lsorber in #1668 \n- fix: Added back from_file method to LlamaGrammar by @ExtReMLapin in #1673\n- fix: grammar prints on each call by @abetlen in 0998ea0deea076a547d54bd598d6b413b588ee2b\n- feat: Enable recursive search of HFFS.ls when using from_pretrained by @benHeidabetlen in #1656\n- feat: Add more detailed log for prefix-match by @xu-song in #1659\n\n## [0.2.87]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@be55695eff44784a141a863f273661a6bce63dfc\n- fix: Include all llama.cpp source files and subdirectories by @abetlen in 9cad5714ae6e7c250af8d0bbb179f631368c928b\n- feat(ci): Re-build wheel index automatically when releases are created by @abetlen in 198f47dc1bd202fd2b71b29e041a9f33fe40bfad\n\n## [0.2.86]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@398ede5efeb07b9adf9fbda7ea63f630d476a792\n- feat: Ported back new grammar changes from C++ to Python implementation by @ExtReMLapin in (#1637)\n- fix: llama_grammar_accept_token arg order by @tc-wolf in (#1649)\n\n## [0.2.85]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@398ede5efeb07b9adf9fbda7ea63f630d476a792\n- fix: Missing LoRA adapter after API change by @shamitv in #1630\n- fix(docker): Update Dockerfile BLAS options by @olivierdebauche in #1632\n- fix(docker): Fix GGML_CUDA param by @olivierdebauche in #1633\n- fix(docker): Update Dockerfile build options from `LLAMA_` to `GGML_` by @olivierdebauche in #1634\n- feat: FreeBSD compatibility by @yurivict in #1635\n\n## [0.2.84]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@4730faca618ff9cee0780580145e3cbe86f24876\n- fix: fix: Correcting run.sh filepath in Simple Docker implementation by @mashuk999 in #1626\n\n## [0.2.83]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@081fe431aa8fb6307145c4feb3eed4f48cab19f8\n- feat: Add 'required' literal to ChatCompletionToolChoiceOption by @mjschock in #1597\n- fix: Change repeat_penalty to 1.0 to match llama.cpp defaults by @ddh0 in #1590\n- fix(docs): Update README.md typo by @ericcurtin in #1589\n- fix(server): Use split_mode from model settings by @grider-withourai in #1594\n- feat(ci): Dockerfile update base images and post-install cleanup by @Smartappli in #1530\n\n## [0.2.82]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@7fdb6f73e35605c8dbc39e9f19cd9ed84dbc87f2\n\n## [0.2.81]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@968967376dc2c018d29f897c4883d335bbf384fb\n- fix(ci): Fix CUDA wheels, use LLAMA_CUDA instead of removed LLAMA_CUBLAS by @abetlen in 4fb6fc12a02a68884c25dd9f6a421cacec7604c6\n- fix(ci): Fix MacOS release, use macos-12 image instead of removed macos-11 by @abetlen in 3a551eb5263fdbd24b36d7770856374c04e92788\n\n## [0.2.80]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@023b8807e10bc3ade24a255f01c1ad2a01bb4228\n- fix(server): Fix bug in FastAPI streaming response where dependency was released before request completes causing SEGFAULT by @abetlen in 296304b60bb83689659883c9cc24f4c074dd88ff\n- fix(server): Update default config value for embeddings to False to fix error in text generation where logits were not allocated by llama.cpp by @abetlen in bf5e0bb4b151f4ca2f5a21af68eb832a96a79d75\n- fix(ci): Fix the CUDA workflow by @oobabooga in #1551\n- docs: Update readme examples to use newer Qwen2 model by @jncraton in #1544\n\n## [0.2.79]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@9c77ec1d74874ee22bdef8f110e8e8d41389abf2\n- feat(ci): Update workflows and pre-built wheels by @Smartappli in #1416\n- feat: Add .close() method to Llama class to explicitly free model from memory by @jkawamoto in #1513\n- feat: Support SPM infill by @CISC in #1492\n\n## [0.2.78]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@fd5ea0f897ecb3659d6c269ef6f3d833e865ead7\n- fix: Avoid duplicate special tokens in chat formats by @CISC in #1439\n- fix: fix logprobs when BOS is not present by @ghorbani in #1471\n- feat: adding rpc_servers parameter to Llama class by @chraac in #1477\n\n## [0.2.77]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@bde7cd3cd949c1a85d3a199498ac98e78039d46f\n- fix: string value kv_overrides by @abetlen in df45a4b3fe46e72664bda87301b318210c6d4782\n- fix: Fix typo in Llama3VisionAlphaChatHandler by @abetlen in 165b4dc6c188f8fda2fc616154e111f710484eba\n- fix: Use numpy recarray for candidates data, fixes bug with temp < 0 by @abetlen in af3ed503e9ce60fe6b5365031abad4176a3536b3\nfix: Disable Windows+CUDA workaround when compiling for HIPBLAS by Engininja2 in #1493\n\n## [0.2.76]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@0df0aa8e43c3378975269a51f9b876c8692e70da\n- feat: Improve Llama.eval performance by avoiding list conversion by @thoughtp0lice in #1476\n- example: LLM inference with Ray Serve by @rgerganov in #1465\n\n## [0.2.75]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@13ad16af1231ab2d245d35df3295bcfa23de1305\n- fix: segfault for models without eos / bos tokens by @abetlen in d99a6ba607a4885fb00e63e967964aa41bdbbbcb\n- feat: add MinTokensLogitProcessor and min_tokens argument to server by @twaka in #1333\n- misc: Remove unnecessary metadata lookups by @CISC in #1448\n\n## [0.2.74]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@b228aba91ac2cd9eb90e9d423ba1d0d20e0117e2\n- fix: Enable CUDA backend for llava by @abetlen in 7f59856fa6f3e23f07e12fc15aeb9359dc6c3bb4\n- docs: Fix typo in README.md by @yupbank in #1444\n\n## [0.2.73]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@25c6e82e7a1ad25a42b0894e87d9b5c557409516\n- fix: Clear kv cache at beginning of image chat formats to avoid bug when image is evaluated first by @abetlen in ac55d0a175115d1e719672ce1cb1bec776c738b1\n\n## [0.2.72]\n\n- fix(security): Remote Code Execution by Server-Side Template Injection in Model Metadata by @retr0reg in b454f40a9a1787b2b5659cd2cb00819d983185df\n- fix(security): Update remaining jinja chat templates to use immutable sandbox by @CISC in #1441\n\n## [0.2.71]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@911b3900dded9a1cfe0f0e41b82c7a29baf3a217\n- fix: Make leading bos_token optional for image chat formats, fix nanollava system message by @abetlen in 77122638b4153e31d9f277b3d905c2900b536632\n- fix: free last image embed in llava chat handler by @abetlen in 3757328b703b2cd32dcbd5853271e3a8c8599fe7\n\n## [0.2.70]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@c0e6fbf8c380718102bd25fcb8d2e55f8f9480d1\n- feat: fill-in-middle support by @CISC in #1386\n- fix: adding missing args in create_completion for functionary chat handler by @skalade in #1430\n- docs: update README.md @eltociear in #1432\n- fix: chat_format log where auto-detected format prints None by @balvisio in #1434\n- feat(server): Add support for setting root_path by @abetlen in 0318702cdc860999ee70f277425edbbfe0e60419\n- feat(ci): Add docker checks and check deps more frequently by @Smartappli in #1426\n- fix: detokenization case where first token does not start with a leading space by @noamgat in #1375\n- feat: Implement streaming for Functionary v2 + Bug fixes by @jeffrey-fong in #1419\n- fix: Use memmove to copy str_value kv_override by @abetlen in 9f7a85571ae80d3b6ddbd3e1bae407b9f1e3448a\n- feat(server): Remove temperature bounds checks for server by @abetlen in 0a454bebe67d12a446981eb16028c168ca5faa81\n- fix(server): Propagate flash_attn to model load by @dthuerck in #1424\n\n## [0.2.69]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@6ecf3189e00a1e8e737a78b6d10e1d7006e050a2\n- feat: Add llama-3-vision-alpha chat format by @abetlen in 31b1d95a6c19f5b615a3286069f181a415f872e8\n- fix: Change default verbose value of verbose in image chat format handlers to True to match Llama by @abetlen in 4f01c452b6c738dc56eacac3758119b12c57ea94\n- fix: Suppress all logs when verbose=False, use hardcoded fileno's to work in colab notebooks by @abetlen in f116175a5a7c84569c88cad231855c1e6e59ff6e\n- fix: UTF-8 handling with grammars by @jsoma in #1415\n\n## [0.2.68]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@77e15bec6217a39be59b9cc83d6b9afb6b0d8167\n- feat: Add option to enable flash_attn to Lllama params and ModelSettings by @abetlen in 22d77eefd2edaf0148f53374d0cac74d0e25d06e\n- fix(ci): Fix build-and-release.yaml by @Smartappli in #1413\n\n## [0.2.67]\n\n- fix: Ensure image renders before text in chat formats regardless of message content order by @abetlen in 3489ef09d3775f4a87fb7114f619e8ba9cb6b656\n- fix(ci): Fix bug in use of upload-artifact failing to merge multiple artifacts into a single release by @abetlen in d03f15bb73a1d520970357b702a9e7d4cc2a7a62\n\n## [0.2.66]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@8843a98c2ba97a25e93319a104f9ddfaf83ce4c4\n- feat: Generic Chat Formats, Tool Calling, and Huggingface Pull Support for Multimodal Models (Obsidian, LLaVA1.6, Moondream) by @abetlen in #1147\n- ci(fix): Workflow actions updates and fix arm64 wheels not included in release by @Smartappli in #1392\n- ci: Add support for pre-built cuda 12.4.1 wheels by @Smartappli in #1388\n- feat: Add support for str type kv_overrides by @abetlen in a411612b385cef100d76145da1fbd02a7b7cc894\n- fix: Functionary bug fixes by @jeffrey-fong in #1385\n- examples: fix quantize example by @iyubondyrev in #1387\n- ci: Update dependabot.yml by @Smartappli in #1391\n\n## [0.2.65]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@46e12c4692a37bdd31a0432fc5153d7d22bc7f72\n- feat: Allow for possibly non-pooled embeddings by @iamlemec in #1380\n\n## [0.2.64]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@4e96a812b3ce7322a29a3008db2ed73d9087b176\n- feat: Add `llama-3` chat format by @andreabak in #1371\n- feat: Use new llama_token_is_eog in create_completions by @abetlen in d40a250ef3cfaa8224d12c83776a2f1de96ae3d1\n- feat(server): Provide ability to dynamically allocate all threads if desired using -1 by @sean-bailey in #1364\n- ci: Build arm64 wheels by @gaby in 611781f5319719a3d05fefccbbf0cc321742a026\n- fix: Update scikit-build-core build dependency avoid bug in 0.9.1 by @evelkey in #1370\n\n## [0.2.63]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@0e4802b2ecbaab04b4f829fde4a3096ca19c84b5\n- feat: Add stopping_criteria to ChatFormatter, allow stopping on arbitrary token ids, fixes llama3 instruct by @abetlen in cc81afebf04d26ca1ac3cf72f23f18da6ab58588\n\n## [0.2.62]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@3b8f1ec4b18770531d0b1d792f3edf08254e4f0c\n- feat: update grammar schema converter to match llama.cpp by @themrzmaster in #1353\n- feat: add disable_ping_events flag by @khimaros in #1257\n- feat: Make saved state more compact on-disk by @tc-wolf in #1296\n- feat: Use all available CPUs for batch processing by @ddh0 in #1345\n\n## [0.2.61]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@ba5e134e073ec6837078c874aba44a702944a676\n- fix: pass correct type to chat handlers for chat completion logprobs by @abetlen in bb65b4d76411112c6fb0bf759efd746f99ef3c6b\n- feat: Add support for yaml based server configs by @abetlen in 060bfa64d529ade2af9b1f4e207a3937bbc4138f\n- feat: Add typechecking for ctypes structure attributes by @abetlen in 1347e1d050fc5a9a32ffe0bb3e22858da28003bd\n\n## [0.2.60]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@75cd4c77292034ecec587ecb401366f57338f7c0\n- fix: Always embed metal library by @abetlen in b3bfea6dbfb6ed9ce18f9a2723e0a9e4bd1da7ad\n- fix: missing logprobs in response, incorrect response type for functionary by @abetlen in 1ae3abbcc3af7f4a25a3ffc40b246f18039565e8\n- fix(docs): incorrect tool_choice example by @CISC in #1330\n\n## [0.2.59]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@ba0c7c70ab5b15f1f2be7fb0dfbe0366dda30d6c\n- feat: Binary wheels for CPU, CUDA (12.1 - 12.3), Metal by @abetlen, @jllllll, and @oobabooga in #1247\n- fix: segfault when logits_all=False by @abetlen in 8649d7671bd1a7c0d9cc6a5ad91c6ca286512ab3\n- fix: last tokens passing to sample_repetition_penalties function by @ymikhailov in #1295\n\n## [0.2.58]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@ba0c7c70ab5b15f1f2be7fb0dfbe0366dda30d6c\n- feat: add support for KV cache quantization options by @Limour-dev in #1307\n- feat: Add logprobs support to chat completions by @windspirit95 in #1311\n- fix: set LLAMA_METAL_EMBED_LIBRARY=on on MacOS arm64 by @bretello in #1289\n- feat: Add tools/functions variables to Jinja2ChatFormatter, add function response formatting for all simple chat formats by @CISC in #1273\n- fix: Changed local API doc references to hosted by by @lawfordp2017 in #1317\n\n## [0.2.57]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@ac9ee6a4ad740bc1ee484ede43e9f92b5af244c1\n- fix: set default embedding pooling type to unspecified by @abetlen in 4084aabe867b8ec2aba1b22659e59c9318b0d1f3\n- fix: Fix and optimize functionary chat handler by @jeffrey-fong in #1282\n- fix: json mode for basic chat formats by @abetlen in 20e6815252d0efd9f015f7adbf108faaf36e3f3c\n\n## [0.2.56]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@c2101a2e909ac7c08976d414e64e96c90ee5fa9e\n- feat(server): Add endpoints for tokenize, detokenize and count tokens by @felipelo in #1136\n- feat: Switch embed to llama_get_embeddings_seq by @iamlemec in #1263\n- fix: Fixed json strings grammar by blacklisting character control set by @ExtReMLapin in d02a9cf16ff88ad011e2eb1ce29f4d9400f13cd1\n- fix: Check for existence of clip model path by @kejcao in #1264\n\n## [0.2.55]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@9731134296af3a6839cd682e51d9c2109a871de5\n- docs: fix small typo in README: 'model know how' -> 'model knows how' by @boegel in #1244\n\n## [0.2.54]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@cb49e0f8c906e5da49e9f6d64a57742a9a241c6a\n- docs: fix typo in README.md embeddings example by @iamlemec in #1232\n\n## [0.2.53]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@cb49e0f8c906e5da49e9f6d64a57742a9a241c6a\n- fix: eos/bos_token set correctly for Jinja2ChatFormatter and automatic chat formatter by @CISC in #1230\n\n## [0.2.52]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@a33e6a0d2a66104ea9a906bdbf8a94d050189d91\n- fix: Llava15ChatHandler (this function takes at least 4 arguments) by @abetlen in 8383a9e5620f5df5a88f62da16813eac200dd706\n\n## [0.2.51]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@c39373398803c669056304090050fe3f44b41bf9\n- fix: Restore type hints for low-level api by @abetlen in 19234aa0dbd0c3c87656e65dd2b064665371925b\n\n## [0.2.50]\n\n- docs: Update Functionary OpenAI Server Readme by @jeffrey-fong in #1193\n- fix: LlamaHFTokenizer now receives pre_tokens by @abetlen in 47bad30dd716443652275099fa3851811168ff4a\n\n## [0.2.49]\n\n- fix: module 'llama_cpp.llama_cpp' has no attribute 'c_uint8' in Llama.save_state by @abetlen in db776a885cd4c20811f22f8bd1a27ecc71dba927\n- feat: Auto detect Mixtral's slightly different format by @lukestanley in #1214\n\n## [0.2.48]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@15499eb94227401bdc8875da6eb85c15d37068f7\n- feat: Add Google's Gemma formatting via chat_format=\"gemma\" by @alvarobartt in #1210\n- feat: support minItems/maxItems in JSON grammar converter by @nopperl in 3921e10770996d95a9eb22c8248bacef39f69365\n- fix: Update from_pretrained defaults to match hf_hub_download and pull to local cache folder by @abetlen in e6d6260a91b7831733f7d1f73c7af46a3e8185ed\n- fix: Raise exceptions when llama model or context fails to load by @abetlen in dd22010e85265ae840c76ec835d67a29ed852722\n- docs: Update README.md to fix pip install llama cpp server by @audip in #1187\n\n## [0.2.47]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@973053d8b0d04809836b3339a50f68d9c842de90\n\n## [0.2.46]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@ba2135ccae7462470b3865c6e41d2e1d734eac05\n- feat: Pull models directly from huggingface by @abetlen in #1206\n- feat(low-level-api): Improve API static type-safety and performance. Low level api functions are positional args only now. by @abetlen in #1205\n\n## [0.2.45]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@89febfed9322c8849520dc63c93ee4f5fd72556e\n\n## [0.2.44]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@4524290e87b8e107cc2b56e1251751546f4b9051\n- fix: create_embedding broken response for input type str by @abetlen in 0ce66bc080fe537590b05b24bf442480bf2dd045\n- fix: Use '\\n' seperator for EventSourceResponse by @khimaros in #1188\n- fix: Incorporate embedding pooling layer fixes by @iamlemec in #1194\n\n## [0.2.43]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@8084d554406b767d36b3250b3b787462d5dd626f\n- feat: Support batch embeddings by @iamlemec in #1186\n- fix: submodule kompute is not included in sdist by @abetlen in 7dbbfdecadebe7750be650d9409959640ff9a460\n- fix: fix: Update openbuddy prompt format by @abetlen in 07a783779a62a4aac0b11161c7e0eb983ff215f8\n\n## [0.2.42]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@ea9c8e11436ad50719987fa23a289c74b7b40d40\n- fix: sample idx off-by-one error for logit_processors by @lapp0 in #1179\n- fix: chat formatting bugs in `chatml-function-calling` by @abetlen in 4b0e3320bd8c2c209e29978d0b21e2e471cc9ee3 and 68fb71b6a26a1e57331868f959b47ab4b87851e1\n\n## [0.2.41]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@895407f31b358e3d9335e847d13f033491ec8a5b\n- fix: Don't change order of json schema object properties in generated grammar unless prop_order is passed by @abetlen in d1822fed6b706f38bd1ff0de4dec5baaa3cf84fa\n\n## [0.2.40]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@3bdc4cd0f595a6096cca4a64aa75ffa8a3503465\n- feat: Generic chatml Function Calling using chat_format=\"chatml-function-calling\"` by @abetlen in #957\n- fix: Circular dependancy preventing early Llama object free by @notwa in #1176\n- docs: Set the correct command for compiling with syscl support by @akarshanbiswas in #1172\n- feat: use gpu backend for clip if available by @iamlemec in #1175\n\n## [0.2.39]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@b08f22c882a1443e6b97081f3ce718a4d1a741f8\n- fix: Fix destructor logging bugs by using llama_log_callback to avoid suppress_stdout_stderr by @abetlen in 59760c85eddc72dfcc1839f43760ef72c23d6874\n\n## [0.2.38]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@1cfb5372cf5707c8ec6dde7c874f4a44a6c4c915\n- feat: Add speculative decoding by @abetlen in #1120\n- fix: Pass raise_exception and add_generation_prompt to jinja2 chat template by @abetlen in 078cca0361bf5a94d2cf52ed04980d20e32d6f95\n\n## [0.2.37]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@fea4fd4ba7f6b754ac795387b275e1a014a77bde\n- feat: Automatically set chat format from gguf by @abetlen in #1110\n\n## [0.2.36]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@2aed77eb06a329f0d82bb1c467f4244904d4073f\n- feat: Add mistral instruct chat format as \"mistral-instruct\" by @Rafaelblsilva in #799\n\n## [0.2.35]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@d2f650cb5b04ee2726663e79b47da5efe196ce00\n\n## [0.2.34]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@6db2b41a76ee78d5efdd5c3cddd5d7ad3f646855\n- feat: Add json schema mode by @abetlen in #1122\n\n## [0.2.33]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@faa3526a1eba458120987ed8269e5616385a76f4\n- feat(server): include llama-cpp-python version in openapi spec by @abetlen in cde7514c3d28e6d52f272614e9957208c344dde5\n- fix: use both eos and bos tokens as stop sequences for hf-tokenizer-config chat format. by @abetlen in 5b982d0f8c6f35242c8862ffdce00e17cea0b44f\n- fix: GGUF metadata KV overrides, re #1011 by @phiharri in #1116\n- fix: llama_log_set should be able to accept null pointer by @abetlen in c970d41a85381fd55235136f123422df0bf0c7e7\n\n## [0.2.32]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@504dc37be8446fb09b1ede70300250ad41be32a2\n- fix: from_json_schema oneof/anyof bug by @jndiogo in d3f5528ca8bcb9d69d4f27e21631e911f1fb9bfe\n- fix: pass chat handler not chat formatter for huggingface autotokenizer and tokenizer_config formats by @abetlen in 24f39454e91cf5dddbc4b6041aead4accc7c7a2d\n- feat: Add add_generation_prompt option for jinja2chatformatter by @abetlen in 7f3209b1eb4ad3260ba063801fab80a8c25a2f4c\n- feat: Add Jinja2ChatFormatter by @abetlen in be09318c26add8674ce494ae7cc480cce72a4146\n- feat: Expose gguf model metadata in metadata property by @abetlen in 5a34c57e5479e50c99aba9b38218cc48e6560b81\n\n## [0.2.31]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@a5cacb22b2114fd9adf61c00cbb237384d86bced\n- fix: Mirostat sampling now passes correct type to ctypes and tracks state during generation by @abetlen in 3babe3512cb95743108f2b595210c38ed6f1b904\n- fix: Python3.8 support in server by @abetlen in 141293a75b564a8699e0acba1da24d9aa1cf0ab1\n\n## [0.2.30]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@57e2a7a52a819883f40dada8a2edc24ecf48186b\n- feat(server): Add ability to load chat format from huggingface autotokenizer or tokenizer_config.json files by @abetlen in b8fc1c7d83ad4a9207c707ba1d954fe580286a01\n- feat: Integration of Jinja2 Templating for chat formats by @teleprint-me in #875\n- fix: Offload KQV by default by @abetlen in 48c3b77e6f558a9899de0e1155c7dc0c7958d8e8\n- fix: Support Accept text/event-stream in chat and completion endpoints, resolves #1083 by @aniljava in #1088\n- fix(cli): allow passing n_ctx=0 to openAI API server args to use model n_ctx_train field per #1015 by @K-Mistele in #1093\n\n## [0.2.29]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@4483396751c79dea540808b9cb9238245d06da2b\n- feat: Add split_mode option by @abetlen in 84615adbc6855c8384807c42f0130f9a1763f99d\n- feat: Implement GGUF metadata KV overrides by @phiharri in #1011\n- fix: Avoid \"LookupError: unknown encoding: ascii\" when open() called in a destructor by @yieldthought in #1012\n- fix: Fix low_level_api_chat_cpp example to match current API by @aniljava in #1086\n- fix: Fix Pydantic model parsing by @DeNeutoy in #1087\n\n## [0.2.28]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@6efb8eb30e7025b168f3fda3ff83b9b386428ad6\n- feat: Add ability to pass in penalize_nl param by @shankinson in #1068\n- fix: print_grammar to stderr by @turian in #1052\n\n## [0.2.27]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@b3a7c20b5c035250257d2b62851c379b159c899a\n- feat: Add `saiga` chat format by @femoiseev in #1050\n- feat: Added `chatglm3` chat format by @xaviviro in #1059\n- fix: Correct typo in README.md by @qeleb in (#1058)\n\n## [0.2.26]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@f6793491b5af6da75edad34d6f503ef86d31b09f\n\n## [0.2.25]\n\n- feat(server): Multi model support by @D4ve-R in #931\n- feat(server): Support none defaulting to infinity for completions by @swg in #111\n- feat(server): Implement openai api compatible authentication by @docmeth2 in #1010\n- fix: text_offset of multi-token characters by @twaka in #1037\n- fix: ctypes bindings for kv override by @phiharri in #1011\n- fix: ctypes definitions of llama_kv_cache_view_update and llama_kv_cache_view_free. by @e-c-d in #1028\n\n## [0.2.24]\n\n- feat: Update llama.cpp to ggerganov/llama.cpp@0e18b2e7d0b5c0a509ea40098def234b8d4a938a\n- feat: Add offload_kqv option to llama and server by @abetlen in 095c65000642a3cf73055d7428232fb18b73c6f3\n- feat: n_ctx=0 now uses the n_ctx_train of the model by @DanieleMorotti in #1015\n- feat: logits_to_logprobs supports both 2-D and 3-D logits arrays by @kddubey in #1002\n- fix: Remove f16_kv, add offload_kqv fields in low level and llama apis by @brandonrobertz in #1019\n- perf: Don't convert logprobs arrays to lists by @kddubey in #1021\n- docs: Fix README.md functionary demo typo by @evelynmitchell in #996\n- examples: Update low_level_api_llama_cpp.py to match current API by @jsoma in #1023\n\n## [0.2.23]\n\n- Update llama.cpp to ggerganov/llama.cpp@948ff137ec37f1ec74c02905917fa0afc9b97514\n- Add qwen chat format by @yhfgyyf in #1005\n- Add support for running the server with SSL by @rgerganov in #994\n- Replace logits_to_logprobs implementation with numpy equivalent to llama.cpp by @player1537 in #991\n- Fix UnsupportedOperation: fileno in suppress_stdout_stderr by @zocainViken in #961\n- Add Pygmalion chat format by @chiensen in #986\n- README.md multimodal params fix by @zocainViken in #967\n- Fix minor typo in README by @aniketmaurya in #958\n\n## [0.2.22]\n\n- Update llama.cpp to ggerganov/llama.cpp@8a7b2fa528f130631a5f43648481596ab320ed5a\n- Fix conflict with transformers library by kddubey in #952\n\n## [0.2.21]\n\n- Update llama.cpp to ggerganov/llama.cpp@64e64aa2557d97490b2fe1262b313e2f4a1607e3\n- Make building llava optional by setting `CMAKE_ARGS=\"-DLLAVA_BUILD=OFF\"` and using `LLAVA_CPP_LIB` to specify alternative path to shared library by @abetlen in e3941d9c674dbd9891dc3ceda390daeb21f05fd1\n\n## [0.2.20]\n\n- Update llama.cpp to ggerganov/llama.cpp@b38a16dfcff88d547f78f52d1bea31b84a05aff7\n- Add `zephyr` chat format by @fakerybakery in #938\n- Add `baichuan` chat format by @caiyesd in #938\n- Add `baichuan-2` chat format by @caiyesd in #936\n- Improve documentation for server chat formats by @jooray in #934\n- Fix typo in README by @antonvice in 940\n- Fix typo in the Open Orca chat format by @gardner in #947\n\n## [0.2.19]\n\n- Update llama.cpp to ggerganov/llama.cpp@0b871f1a04ef60e114bbe43004fd9c21114e802d\n- Fix #569: stop parameter in chat completion api should accept str by @abetlen in 128dc4731fa846ead7e684a137ca57d8931b8899\n- Document server host and port parameters by @jamesbraza in #768\n- Do not set grammar to None when initializing LlamaGrammar by @mthuurne in #834\n- Add mistrallite, intel, and openchat formats by @fakerybakery in #927\n- Add support for min_p parameter by @tk-master in #921\n- Fix #929: tokenizer adding leading space when generating from empty prompt by @abetlen in a34d48014192771d2e308a76c22f33bc0318d983\n- Fix low level api example by @zocainViken in #925\n- Fix missing package in openblas docker image by @ZisisTsatsas in #920\n\n## [0.2.18]\n\n- Update llama.cpp to ggerganov/llama.cpp@6bb4908a17150b49373b5f977685b2e180a04f6f\n\n## [0.2.17]\n\n- Update llama.cpp to ggerganov/llama.cpp@df9d1293defe783f42bc83af732d3c670552c541\n- Hotfix: Set `CUDA_ARCHITECTURES=OFF` for `llava_shared` target on Windows by @abetlen in 4388f3341413110217b98c4f097ac5c590bdf40b\n\n## [0.2.16]\n\n- Update llama.cpp to ggerganov/llama.cp@a75fa576abba9d37f463580c379e4bbf1e1ad03c\n- Add `set_seed` to `Llama` class by @abetlen in fd41ed3a908761d286102a019a34c2938a15118d\n- Fix server doc arguments by @kjunggithub in #892\n- Fix response_format handler in llava chat handler by @abetlen in b62c44983921197ed10a7d29dc4ba920e9979380\n- Fix default max_tokens, chat completion is now unlimited (to context length) and completion is 16 tokens to match OpenAI defaults by @abetlen in e7962d2c733cbbeec5a37392c81f64185a9a39e8\n- Fix json_schema_to_gbnf helper so that it takes a json schema string as input instead by @abetlen in faeae181b1e868643c0dc28fcf039f077baf0829\n- Add support for $ref and $def in json_schema_to_gbnf to handle more complex function schemas by @abetlen in 770df344369c0630df1be14be9f9e301e7c56d24\n- Update functionary chat handler for new OpenAI api by abetlen in 1b376c62b775b401653facf25a519d116aafe99a\n- Fix add default stop sequence to chatml chat format by @abetlen in b84d76a844149216d511cfd8cdb9827148a1853c\n- Fix sampling bug when logits_all=False by @abetlen in 6f0b0b1b840af846938ed74d0e8170a91c40e617\n\n## [0.2.15]\n\n- Update llama.cpp to ggerganov/llama.cpp@0a7c980b6f94a049cb804573df2d8092a34df8e4\n- Add support for Llava1.5 multimodal models by @damian0815 and @abetlen in #821\n- Update OpenAI API compatibility to match dev day update by @abetlen in #821\n- Add seed parameter to completion and chat_completion functions of Llama class by @abetlen in 86aeb9f3a14808575d2bb0076e6acb4a30907e6a\n- Add JSON mode support to constrain chat completion to JSON objects by @abetlen in b30b9c338bf9af316d497ea501d39f5c246900db\n\n## [0.2.14]\n\n- Update llama.cpp to ggerganov/llama.cpp@f0b30ef7dc1360922ccbea0a8cd3918ecf15eaa7\n- Add support for Huggingface Autotokenizer Chat Formats by @bioshazard and @abetlen in #790 and bbffdaebaa7bb04b543dbf683a07276087251f86\n- Fix llama-2 chat format by @earonesty in #869\n- Add support for functionary chat format by @abetlen in #784\n- Migrate inference from deprecated `llama_eval`API to `llama_batch` and `llama_decode` by @abetlen in #795\n\n## [0.2.13]\n\n- Update llama.cpp to ggerganov/llama.cpp@51b2fc11f7f605fff49725a4540e9a6ef7b51b70\n- Fix name 'open' is not defined exception when deleting model by @abetlen in 011b95d7f34cbfc528af75a892757bd9a20838ab\n- Fix tokenization of special characters by @antoine-lizee in #850\n\n## [0.2.12]\n\n- Update llama.cpp to ggerganov/llama.cpp@50337961a678fce4081554b24e56e86b67660163\n- Fix missing `n_seq_id` in `llama_batch` by @NickAlgra in #842\n- Fix for shared libraries on Windows that start with `lib` prefix by @sujeendran in #848\n- Fix exception raised in `__del__` when freeing models by @cebtenzzre in #846\n- Performance improvement for logit bias by @zolastro in #851\n- Fix suffix check arbitrary code execution bug by @mtasic85 in #854\n- Fix typo in `function_call` parameter in `llama_types.py` by @akatora28 in #849\n- Fix streaming not returning `finish_reason` by @gmcgoldr in #798\n- Fix `n_gpu_layers` check to allow values less than 1 for server by @hxy9243 in #826\n- Supppress stdout and stderr when freeing model by @paschembri in #803\n- Fix `llama2` chat format by @delock in #808\n- Add validation for tensor_split size by @eric1932 #820\n- Print stack trace on server error by @abetlen in d6a130a052db3a50975a719088a9226abfebb266\n- Update docs for gguf by @johnccshen in #783\n- Add `chatml` chat format by @abetlen in 305482bd4156c70802fc054044119054806f4126\n\n## [0.2.11]\n\n- Fix bug in `llama_model_params` object has no attribute `logits_all` by @abetlen in d696251fbe40015e8616ea7a7d7ad5257fd1b896\n\n## [0.2.10]\n\n- Fix bug 'llama_model_params' object has no attribute 'embedding' by @abetlen in 42bb721d64d744242f9f980f2b89d5a6e335b5e4\n\n## [0.2.9]\n\n- Fix critical bug in pip installation of v0.2.8 due to `.git` directory in ac853e01e1a217a578080a4e1b851d2d08450adf\n\n## [0.2.8]\n\n- Update llama.cpp to ggerganov/llama.cpp@40e07a60f9ce06e79f3ccd4c903eba300fb31b5e\n- Add configurable chat formats by @abetlen in #711\n- Fix rope scaling bug by @Josh-XT in #767\n- Fix missing numa parameter in server by @abetlen in d9bce17794d0dd6f7962d10aad768fedecf3ab89\n\n## [0.2.7]\n\n- Update llama.cpp to ggerganov/llama.cpp@a98b1633d5a94d0aa84c7c16e1f8df5ac21fc850\n- Install required runtime dlls to package directory on windows by @abetlen in 8d75016549e2ff62a511b1119d966ffc0df5c77b\n- Add openai-processing-ms to server response header by @Tradunsky in #748\n- Bump minimum version of scikit-build-core to 0.5.1 to fix msvc cmake issue by @abetlen in 1ed0f3ebe16993a0f961155aa4b2c85f1c68f668\n- Update `llama_types.py` to better match the openai api, old names are aliased to new ones by @abetlen in dbca136feaaf7f8b1182c4c3c90c32918b1d0bb3\n\n## [0.2.6]\n\n- Update llama.cpp to 80291a1d02a07f7f66666fb576c5b1e75aa48b46\n\n## [0.2.5]\n\n- Fix docker images missing starlette-context dependency by @abetlen in 22917989003c5e67623d54ab45affa1e0e475410\n- Fix loading dll in Windows Isolation Containers by @abetlen in 847466562573191efa655753d9252f308c4fbdb0\n- Fix build issue on m1 macs by @abetlen in dbd3a6d1ed8416a8fd800127251e730153afa305\n- Update docs to gguf and add hw acceleration docs for server by @jasonacox in #688\n\n## [0.2.4]\n\n- Add NUMA support. **NOTE** low level api users must call llama_backend_init at the start of their programs by abetlen in f4090a0bb2a2a25acfe28d31c82cc1aa273bedee\n- Fix tensor_split server cli argument by @abetlen in c4c440ba2dc86d9de728a751311fdd1c8e3756fa\n- Made all `Llama` init parameters into keyword-only parameters by @abetlen in c8f9b8a734b5b040379bbd93995ba177affab1fe\n- Added server params for `low_vram`, `main_gpu`, `lora_base`, and `lora_path` by @abetlen in 2920c4bf7ee1412d6bba7846e0e1b7ef6d34043b\n- Removed server params for `rms_norm_eps` and `n_gqa` by @abetlen in 2920c4bf7ee1412d6bba7846e0e1b7ef6d34043b\n- Fix boolean cli options by @abetlen in c999325e8e4507f6c6249dd2fb8de7f8bf57f71e and 0449d29b9f940e437231a07b9d56550226558bac\n- Silence Pydantic Settings warnings about `model_alias` setting by @earonesty in #705\n\n## [0.2.3]\n\n- Update llama.cpp to ggerganov/llama.cpp@71ca2fad7d6c0ef95ef9944fb3a1a843e481f314\n- Add X-Request-ID request header for mirroring custom IDs by @devrimcavusoglu in #703\n- Add pyproject extra for scikit-build-core to ensure compatible pathspec version by @abetlen in 6cfc54284b99ef1bff8193e2d5e483dbd89ada02\n- Fix issue with Literal and Optional cli arguments not working by @abetlen in #702\n\n## [0.2.2]\n\n- Fix bug in pip install of v0.2.1 due to scikit-build-core removing all `.metal` files in the source distribution (see #701)\n\n## [0.2.1]\n\n- Fix bug in pip install of v0.2.0 due to .git folder being included in the source distribution (see #701)\n\n## [0.2.0]\n\n- Migrated to scikit-build-core build system by @abetlen in #499\n- Use `numpy` views for `LogitsProcessor` and `StoppingCriteria` instead of python lists by @abetlen in #499\n- Drop support for end-of-life Python3.7 by @abetlen in #499\n- Convert low level `llama.cpp` constants to use basic python types instead of `ctypes` types by @abetlen in #499\n\n## [0.1.85]\n\n- Add `llama_cpp.__version__` attribute by @janvdp in #684\n- Fix low level api examples by @jbochi in #680\n\n## [0.1.84]\n\n- Update llama.cpp\n\n## [0.1.83]\n\n- Update llama.cpp\n\n## [0.1.82]\n\n- Update llama.cpp\n\n## [0.1.81]\n\n- Update llama.cpp\n\n## [0.1.80]\n\n- Update llama.cpp\n\n## [0.1.79]\n\n- GGUF Support (breaking change requiring new model format)\n\n## [0.1.78]\n\n- Grammar based sampling via LlamaGrammar which can be passed to completions\n- Make n_gpu_layers == -1 offload all layers\n\n## [0.1.77]\n\n- (llama.cpp) Update llama.cpp add support for LLaMa 2 70B\n- (server) Add temporary n_gqa and rms_norm_eps parameters required for LLaMa 2 70B\n\n## [0.1.76]\n\n- (llama.cpp) Update llama.cpp add support for LLaMa 2 70B\n\n## [0.1.75]\n\n- Update llama.cpp\n\n## [0.1.74]\n\n- (server) OpenAI style error responses\n\n## [0.1.73]\n\n- (server) Add rope parameters to server settings\n\n## [0.1.72]\n\n- (llama.cpp) Update llama.cpp added custom_rope for extended context lengths\n\n## [0.1.71]\n\n- (llama.cpp) Update llama.cpp\n\n- (server) Fix several pydantic v2 migration bugs\n\n## [0.1.70]\n\n- (Llama.create_completion) Revert change so that `max_tokens` is not truncated to `context_size` in `create_completion`\n- (server) Fixed changed settings field names from pydantic v2 migration\n\n## [0.1.69]\n\n- (server) Streaming requests can are now interrupted pre-maturely when a concurrent request is made. Can be controlled with the `interrupt_requests` setting.\n- (server) Moved to fastapi v0.100.0 and pydantic v2\n- (docker) Added a new \"simple\" image that builds llama.cpp from source when started.\n- (server) performance improvements by avoiding unnecessary memory allocations during sampling\n\n## [0.1.68]\n\n- (llama.cpp) Update llama.cpp\n\n## [0.1.67]\n\n- Fix performance bug in Llama model by pre-allocating memory tokens and logits.\n- Fix bug in Llama model where the model was not free'd after use.\n\n## [0.1.66]\n\n- (llama.cpp) New model API\n\n- Performance issue during eval caused by looped np.concatenate call\n- State pickling issue when saving cache to disk\n\n## [0.1.65]\n\n- (llama.cpp) Fix struct misalignment bug\n\n## [0.1.64]\n\n- (llama.cpp) Update llama.cpp\n- Fix docs for seed. Set -1 for random.\n\n## [0.1.63]\n\n- (llama.cpp) Add full gpu utilisation in CUDA\n- (llama.cpp) Add get_vocab\n- (llama.cpp) Add low_vram parameter\n- (server) Add logit_bias parameter\n\n## [0.1.62]\n\n- Metal support working\n- Cache re-enabled\n\n## [0.1.61]\n\n- Fix broken pip installation\n\n## [0.1.60]\n\nNOTE: This release was deleted due to a bug with the packaging system that caused pip installations to fail.\n\n- Truncate max_tokens in create_completion so requested tokens doesn't exceed context size.\n- Temporarily disable cache for completion requests\n\n## [v0.1.59]\n\n- (llama.cpp) k-quants support\n- (server) mirostat sampling parameters to server\n- Support both `.so` and `.dylib` for `libllama` on MacOS\n\n## [v0.1.58]\n\n- (llama.cpp) Metal Silicon support\n\n## [v0.1.57]\n\n- (llama.cpp) OpenLlama 3B support\n\n## [v0.1.56]\n\n- (misc) Added first version of the changelog\n- (server) Use async routes\n- (python-api) Use numpy for internal buffers to reduce memory usage and improve performance.\n- (python-api) Performance bug in stop sequence check slowing down streaming.\n"
  },
  {
    "path": "CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.21)\n\nproject(llama_cpp)\n\noption(LLAMA_BUILD \"Build llama.cpp shared library and install alongside python package\" ON)\noption(LLAVA_BUILD \"Build llava shared library and install alongside python package\" ON)\n\nfunction(llama_cpp_python_install_target target)\n    if(NOT TARGET ${target})\n        return()\n    endif()\n\n    install(\n        TARGETS ${target}\n        LIBRARY DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib\n        RUNTIME DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib\n        ARCHIVE DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib\n        FRAMEWORK DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib\n        RESOURCE DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib\n    )\n    install(\n        TARGETS ${target}\n        LIBRARY DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib\n        RUNTIME DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib\n        ARCHIVE DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib\n        FRAMEWORK DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib\n        RESOURCE DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib\n    )\n    set_target_properties(${target} PROPERTIES\n        INSTALL_RPATH \"$ORIGIN\"\n        BUILD_WITH_INSTALL_RPATH TRUE\n    )\n    if(UNIX)\n        if(APPLE)\n            set_target_properties(${target} PROPERTIES\n                INSTALL_RPATH \"@loader_path\"\n                BUILD_WITH_INSTALL_RPATH TRUE\n            )\n        else()\n            set_target_properties(${target} PROPERTIES\n                INSTALL_RPATH \"$ORIGIN\"\n                BUILD_WITH_INSTALL_RPATH TRUE\n            )\n        endif()\n    endif()\nendfunction()\n\nif (LLAMA_BUILD)\n    set(BUILD_SHARED_LIBS \"On\")\n\n    set(CMAKE_SKIP_BUILD_RPATH FALSE)\n\n    # When building, don't use the install RPATH already\n    # (but later on when installing)\n    set(CMAKE_BUILD_WITH_INSTALL_RPATH FALSE)\n \n    # Add the automatically determined parts of the RPATH\n    # which point to directories outside the build tree to the install RPATH\n    set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)\n    set(CMAKE_SKIP_RPATH FALSE)\n\n    # Enable building of the common library\n    set(LLAMA_BUILD_COMMON ON CACHE BOOL \"Build llama.cpp common library\" FORCE)\n\n    # Disable building curl support\n    set(LLAMA_CURL OFF CACHE BOOL \"llama.cpp: enable curl\" FORCE)\n\n    # Architecture detection and settings for Apple platforms\n    if (APPLE)\n        # Get the target architecture\n        execute_process(\n            COMMAND uname -m\n            OUTPUT_VARIABLE HOST_ARCH\n            OUTPUT_STRIP_TRAILING_WHITESPACE\n        )\n\n        # If CMAKE_OSX_ARCHITECTURES is not set, use the host architecture\n        if(NOT CMAKE_OSX_ARCHITECTURES)\n            set(CMAKE_OSX_ARCHITECTURES ${HOST_ARCH} CACHE STRING \"Build architecture for macOS\" FORCE)\n        endif()\n\n        message(STATUS \"Host architecture: ${HOST_ARCH}\")\n        message(STATUS \"Target architecture: ${CMAKE_OSX_ARCHITECTURES}\")\n\n        # Configure based on target architecture\n        if(CMAKE_OSX_ARCHITECTURES STREQUAL \"x86_64\")\n            # Intel Mac settings\n            set(GGML_AVX \"OFF\" CACHE BOOL \"ggml: enable AVX\" FORCE)\n            set(GGML_AVX2 \"OFF\" CACHE BOOL \"ggml: enable AVX2\" FORCE)\n            set(GGML_FMA \"OFF\" CACHE BOOL \"ggml: enable FMA\" FORCE)\n            set(GGML_F16C \"OFF\" CACHE BOOL \"ggml: enable F16C\" FORCE)\n        endif()\n\n        # Metal settings (enable for both architectures)\n        set(GGML_METAL \"ON\" CACHE BOOL \"ggml: enable Metal\" FORCE)\n        set(GGML_METAL_EMBED_LIBRARY \"ON\" CACHE BOOL \"ggml: embed metal library\" FORCE)\n    endif()\n\n\n    add_subdirectory(vendor/llama.cpp)\n\n    if (WIN32)\n        if (TARGET llama)\n            set_target_properties(llama PROPERTIES WINDOWS_EXPORT_ALL_SYMBOLS ON)\n        endif()\n    endif()\n\n    llama_cpp_python_install_target(llama)\n    llama_cpp_python_install_target(ggml)\n\n    llama_cpp_python_install_target(ggml-base)\n\n    llama_cpp_python_install_target(ggml-amx)\n    llama_cpp_python_install_target(ggml-blas)\n    llama_cpp_python_install_target(ggml-can)\n    llama_cpp_python_install_target(ggml-cpu)\n    llama_cpp_python_install_target(ggml-cuda)\n    llama_cpp_python_install_target(ggml-hip)\n    llama_cpp_python_install_target(ggml-kompute)\n    llama_cpp_python_install_target(ggml-metal)\n    llama_cpp_python_install_target(ggml-musa)\n    llama_cpp_python_install_target(ggml-rpc)\n    llama_cpp_python_install_target(ggml-sycl)\n    llama_cpp_python_install_target(ggml-vulkan)\n\n    # Workaround for Windows + CUDA https://github.com/abetlen/llama-cpp-python/issues/563\n    if (WIN32)\n        install(\n            FILES $<TARGET_RUNTIME_DLLS:llama>\n            DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib\n        )\n        install(\n            FILES $<TARGET_RUNTIME_DLLS:llama>\n            DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib\n        )\n        install(\n            FILES $<TARGET_RUNTIME_DLLS:ggml>\n            DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib\n        )\n        install(\n            FILES $<TARGET_RUNTIME_DLLS:ggml>\n            DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib\n        )\n    endif()\n\n    if (LLAVA_BUILD)\n        if (LLAMA_CUBLAS OR LLAMA_CUDA)\n            add_compile_definitions(GGML_USE_CUBLAS)\n            add_compile_definitions(GGML_USE_CUDA)\n        endif()\n\n        if (LLAMA_METAL)\n            add_compile_definitions(GGML_USE_METAL)\n        endif()\n\n        # Building llava\n        add_subdirectory(vendor/llama.cpp/tools/mtmd)\n\n        if (WIN32)\n            set_target_properties(mtmd PROPERTIES CUDA_ARCHITECTURES OFF)\n        endif()\n        llama_cpp_python_install_target(mtmd)\n        if (WIN32)\n            install(\n                FILES $<TARGET_RUNTIME_DLLS:mtmd>\n                DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/llama_cpp/lib\n            )\n            install(\n                FILES $<TARGET_RUNTIME_DLLS:mtmd>\n                DESTINATION ${SKBUILD_PLATLIB_DIR}/llama_cpp/lib\n            )\n        endif()\n\n        # Fix for mtmd build: Add include directory for llama.h\n        # Move these commands after the add_subdirectory call\n        target_include_directories(mtmd PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/include)\n        target_include_directories(mtmd PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/ggml/include)\n\n        if (BUILD_SHARED_LIBS)\n            target_include_directories(mtmd PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/include)\n            target_include_directories(mtmd PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/ggml/include)\n        endif()\n\n        # target_include_directories(llama-llava-cli PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/include)\n        # target_include_directories(llama-minicpmv-cli PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/vendor/llama.cpp/include)\n    endif()\nendif()\n"
  },
  {
    "path": "LICENSE.md",
    "content": "MIT License\n\nCopyright (c) 2023 Andrei Betlen\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."
  },
  {
    "path": "Makefile",
    "content": "update:\n\tpoetry install\n\tgit submodule update --init --recursive\n\nupdate.vendor:\n\tcd vendor/llama.cpp && git pull origin master\n\ndeps:\n\tpython3 -m pip install --upgrade pip\n\tpython3 -m pip install -e \".[all]\"\n\nbuild:\n\tpython3 -m pip install --verbose -e .\n\nbuild.debug:\n\tpython3 -m pip install \\\n\t\t--verbose \\\n\t\t--config-settings=cmake.verbose=true \\\n\t\t--config-settings=logging.level=INFO \\\n\t\t--config-settings=install.strip=false  \\\n\t\t--config-settings=cmake.args=\"-DCMAKE_BUILD_TYPE=Debug;-DCMAKE_C_FLAGS='-ggdb -O0';-DCMAKE_CXX_FLAGS='-ggdb -O0'\" \\\n\t\t--editable .\n\nbuild.debug.extra:\n\tpython3 -m pip install \\\n\t\t--verbose \\\n\t\t--config-settings=cmake.verbose=true \\\n\t\t--config-settings=logging.level=INFO \\\n\t\t--config-settings=install.strip=false  \\\n\t\t--config-settings=cmake.args=\"-DCMAKE_BUILD_TYPE=Debug;-DCMAKE_C_FLAGS='-fsanitize=address -ggdb -O0';-DCMAKE_CXX_FLAGS='-fsanitize=address -ggdb -O0'\" \\\n\t\t--editable .\n\nbuild.cuda:\n\tCMAKE_ARGS=\"-DGGML_CUDA=on\" python3 -m pip install --verbose -e .\n\nbuild.openblas:\n\tCMAKE_ARGS=\"-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS\" python3 -m pip install --verbose -e .\n\nbuild.blis:\n\tCMAKE_ARGS=\"-DGGML_BLAS=on -DGGML_BLAS_VENDOR=FLAME\" python3 -m pip install --verbose -e .\n\nbuild.metal:\n\tCMAKE_ARGS=\"-DGGML_METAL=on\" python3 -m pip install --verbose -e .\n\nbuild.vulkan:\n\tCMAKE_ARGS=\"-DGGML_VULKAN=on\" python3 -m pip install --verbose -e .\n\nbuild.kompute:\n\tCMAKE_ARGS=\"-DGGML_KOMPUTE=on\" python3 -m pip install --verbose -e .\n\nbuild.sycl:\n\tCMAKE_ARGS=\"-DGGML_SYCL=on\" python3 -m pip install --verbose -e .\n\nbuild.rpc:\n\tCMAKE_ARGS=\"-DGGML_RPC=on\" python3 -m pip install --verbose -e .\n\nbuild.sdist:\n\tpython3 -m build --sdist --verbose\n\ndeploy.pypi:\n\tpython3 -m twine upload dist/*\n\ndeploy.gh-docs:\n\tmkdocs build\n\tmkdocs gh-deploy\n\ntest:\n\tpython3 -m pytest --full-trace -v\n\ndocker:\n\tdocker build -t llama-cpp-python:latest -f docker/simple/Dockerfile .\n\nrun-server:\n\tpython3 -m llama_cpp.server --model ${MODEL}\n\nclean:\n\t- cd vendor/llama.cpp && make clean\n\t- cd vendor/llama.cpp && rm libllama.so\n\t- rm -rf _skbuild\n\t- rm llama_cpp/lib/*.so\n\t- rm llama_cpp/lib/*.dylib\n\t- rm llama_cpp/lib/*.metal\n\t- rm llama_cpp/lib/*.dll\n\t- rm llama_cpp/lib/*.lib\n\n.PHONY: \\\n\tupdate \\\n\tupdate.vendor \\\n\tbuild \\\n\tbuild.cuda \\\n\tbuild.opencl \\\n\tbuild.openblas \\\n\tbuild.sdist \\\n\tdeploy.pypi \\\n\tdeploy.gh-docs \\\n\tdocker \\\n\tclean\n"
  },
  {
    "path": "README.md",
    "content": "<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/abetlen/llama-cpp-python/main/docs/icon.svg\" style=\"height: 5rem; width: 5rem\">\n</p>\n\n#  Python Bindings for [`llama.cpp`](https://github.com/ggerganov/llama.cpp)\n\n[![Documentation Status](https://readthedocs.org/projects/llama-cpp-python/badge/?version=latest)](https://llama-cpp-python.readthedocs.io/en/latest/?badge=latest)\n[![Tests](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml/badge.svg?branch=main)](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml)\n[![PyPI](https://img.shields.io/pypi/v/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)\n[![PyPI - License](https://img.shields.io/pypi/l/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)\n[![PyPI - Downloads](https://static.pepy.tech/badge/llama-cpp-python/month)](https://pepy.tech/projects/llama-cpp-python)\n[![Github All Releases](https://img.shields.io/github/downloads/abetlen/llama-cpp-python/total.svg?label=Github%20Downloads)]()\n\nSimple Python bindings for **@ggerganov's** [`llama.cpp`](https://github.com/ggerganov/llama.cpp) library.\nThis package provides:\n\n- Low-level access to C API via `ctypes` interface.\n- High-level Python API for text completion\n    - OpenAI-like API\n    - [LangChain compatibility](https://python.langchain.com/docs/integrations/llms/llamacpp)\n    - [LlamaIndex compatibility](https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html)\n- OpenAI compatible web server\n    - [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)\n    - [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)\n    - [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)\n    - [Multiple Models](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support)\n\nDocumentation is available at [https://llama-cpp-python.readthedocs.io/en/latest](https://llama-cpp-python.readthedocs.io/en/latest).\n\n## Installation\n\nRequirements:\n\n  - Python 3.8+\n  - C compiler\n      - Linux: gcc or clang\n      - Windows: Visual Studio or MinGW\n      - MacOS: Xcode\n\nTo install the package, run:\n\n```bash\npip install llama-cpp-python\n```\n\nThis will also build `llama.cpp` from source and install it alongside this python package.\n\nIf this fails, add `--verbose` to the `pip install` see the full cmake build log.\n\n**Pre-built Wheel (New)**\n\nIt is also possible to install a pre-built wheel with basic CPU support.\n\n```bash\npip install llama-cpp-python \\\n  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu\n```\n\n### Installation Configuration\n\n`llama.cpp` supports a number of hardware acceleration backends to speed up inference as well as backend specific options. See the [llama.cpp README](https://github.com/ggerganov/llama.cpp#build) for a full list.\n\nAll `llama.cpp` cmake build options can be set via the `CMAKE_ARGS` environment variable or via the `--config-settings / -C` cli flag during installation.\n\n<details open>\n<summary>Environment Variables</summary>\n\n```bash\n# Linux and Mac\nCMAKE_ARGS=\"-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS\" \\\n  pip install llama-cpp-python\n```\n\n```powershell\n# Windows\n$env:CMAKE_ARGS = \"-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS\"\npip install llama-cpp-python\n```\n</details>\n\n<details>\n<summary>CLI / requirements.txt</summary>\n\nThey can also be set via `pip install -C / --config-settings` command and saved to a `requirements.txt` file:\n\n```bash\npip install --upgrade pip # ensure pip is up to date\npip install llama-cpp-python \\\n  -C cmake.args=\"-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS\"\n```\n\n```txt\n# requirements.txt\n\nllama-cpp-python -C cmake.args=\"-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS\"\n```\n\n</details>\n\n### Supported Backends\n\nBelow are some common backends, their build commands and any additional environment variables required.\n\n<details open>\n<summary>OpenBLAS (CPU)</summary>\n\nTo install with OpenBLAS, set the `GGML_BLAS` and `GGML_BLAS_VENDOR` environment variables before installing:\n\n```bash\nCMAKE_ARGS=\"-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS\" pip install llama-cpp-python\n```\n</details>\n\n<details>\n<summary>CUDA</summary>\n\nTo install with CUDA support, set the `GGML_CUDA=on` environment variable before installing:\n\n```bash\nCMAKE_ARGS=\"-DGGML_CUDA=on\" pip install llama-cpp-python\n```\n\n**Pre-built Wheel (New)**\n\nIt is also possible to install a pre-built wheel with CUDA support. As long as your system meets some requirements:\n\n- CUDA Version is 12.1, 12.2, 12.3, 12.4 or 12.5\n- Python Version is 3.10, 3.11 or 3.12\n\n```bash\npip install llama-cpp-python \\\n  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>\n```\n\nWhere `<cuda-version>` is one of the following:\n- `cu121`: CUDA 12.1\n- `cu122`: CUDA 12.2\n- `cu123`: CUDA 12.3\n- `cu124`: CUDA 12.4\n- `cu125`: CUDA 12.5\n\nFor example, to install the CUDA 12.1 wheel:\n\n```bash\npip install llama-cpp-python \\\n  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121\n```\n\n</details>\n\n<details>\n<summary>Metal</summary>\n\nTo install with Metal (MPS), set the `GGML_METAL=on` environment variable before installing:\n\n```bash\nCMAKE_ARGS=\"-DGGML_METAL=on\" pip install llama-cpp-python\n```\n\n**Pre-built Wheel (New)**\n\nIt is also possible to install a pre-built wheel with Metal support. As long as your system meets some requirements:\n\n- MacOS Version is 11.0 or later\n- Python Version is 3.10, 3.11 or 3.12\n\n```bash\npip install llama-cpp-python \\\n  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal\n```\n\n</details>\n\n<details>\n<summary>hipBLAS (ROCm)</summary>\n\nTo install with hipBLAS / ROCm support for AMD cards, set the `GGML_HIPBLAS=on` environment variable before installing:\n\n```bash\nCMAKE_ARGS=\"-DGGML_HIPBLAS=on\" pip install llama-cpp-python\n```\n\n</details>\n\n<details>\n<summary>Vulkan</summary>\n\nTo install with Vulkan support, set the `GGML_VULKAN=on` environment variable before installing:\n\n```bash\nCMAKE_ARGS=\"-DGGML_VULKAN=on\" pip install llama-cpp-python\n```\n\n</details>\n\n<details>\n<summary>SYCL</summary>\n\nTo install with SYCL support, set the `GGML_SYCL=on` environment variable before installing:\n\n```bash\nsource /opt/intel/oneapi/setvars.sh   \nCMAKE_ARGS=\"-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx\" pip install llama-cpp-python\n```\n</details>\n\n<details>\n<summary>RPC</summary>\n\nTo install with RPC support, set the `GGML_RPC=on` environment variable before installing:\n\n```bash\nsource /opt/intel/oneapi/setvars.sh   \nCMAKE_ARGS=\"-DGGML_RPC=on\" pip install llama-cpp-python\n```\n</details>\n\n\n### Windows Notes\n\n<details>\n<summary>Error: Can't find 'nmake' or 'CMAKE_C_COMPILER'</summary>\n\nIf you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo](https://github.com/ggerganov/llama.cpp#openblas) and add those manually to CMAKE_ARGS before running `pip` install:\n\n```ps\n$env:CMAKE_GENERATOR = \"MinGW Makefiles\"\n$env:CMAKE_ARGS = \"-DGGML_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe\"\n```\n\nSee the above instructions and set `CMAKE_ARGS` to the BLAS backend you want to use.\n</details>\n\n### MacOS Notes\n\nDetailed MacOS Metal GPU install documentation is available at [docs/install/macos.md](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/)\n\n<details>\n<summary>M1 Mac Performance Issue</summary>\n\nNote: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:\n\n```bash\nwget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh\nbash Miniforge3-MacOSX-arm64.sh\n```\n\nOtherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.\n</details>\n\n<details>\n<summary>M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`</summary>\n\nTry installing with\n\n```bash\nCMAKE_ARGS=\"-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on\" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python\n```\n</details>\n\n### Upgrading and Reinstalling\n\nTo upgrade and rebuild `llama-cpp-python` add `--upgrade --force-reinstall --no-cache-dir` flags to the `pip install` command to ensure the package is rebuilt from source.\n\n## High-level API\n\n[API Reference](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#high-level-api)\n\nThe high-level API provides a simple managed interface through the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.\n\nBelow is a short example demonstrating how to use the high-level API to for basic text completion:\n\n```python\nfrom llama_cpp import Llama\n\nllm = Llama(\n      model_path=\"./models/7B/llama-model.gguf\",\n      # n_gpu_layers=-1, # Uncomment to use GPU acceleration\n      # seed=1337, # Uncomment to set a specific seed\n      # n_ctx=2048, # Uncomment to increase the context window\n)\noutput = llm(\n      \"Q: Name the planets in the solar system? A: \", # Prompt\n      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window\n      stop=[\"Q:\", \"\\n\"], # Stop generating just before the model would generate a new question\n      echo=True # Echo the prompt back in the output\n) # Generate a completion, can also call create_completion\nprint(output)\n```\n\nBy default `llama-cpp-python` generates completions in an OpenAI compatible format:\n\n```python\n{\n  \"id\": \"cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\",\n  \"object\": \"text_completion\",\n  \"created\": 1679561337,\n  \"model\": \"./models/7B/llama-model.gguf\",\n  \"choices\": [\n    {\n      \"text\": \"Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.\",\n      \"index\": 0,\n      \"logprobs\": None,\n      \"finish_reason\": \"stop\"\n    }\n  ],\n  \"usage\": {\n    \"prompt_tokens\": 14,\n    \"completion_tokens\": 28,\n    \"total_tokens\": 42\n  }\n}\n```\n\nText completion is available through the [`__call__`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__call__) and [`create_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) methods of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.\n\n### Pulling models from Hugging Face Hub\n\nYou can download `Llama` models in `gguf` format directly from Hugging Face using the [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) method.\nYou'll need to install the `huggingface-hub` package to use this feature (`pip install huggingface-hub`).\n\n```python\nllm = Llama.from_pretrained(\n    repo_id=\"Qwen/Qwen2-0.5B-Instruct-GGUF\",\n    filename=\"*q8_0.gguf\",\n    verbose=False\n)\n```\n\nBy default [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) will download the model to the huggingface cache directory, you can then manage installed model files with the [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) tool.\n\n### Chat Completion\n\nThe high-level API also provides a simple interface for chat completion.\n\nChat completion requires that the model knows how to format the messages into a single prompt.\nThe `Llama` class does this using pre-registered chat formats (ie. `chatml`, `llama-2`, `gemma`, etc) or by providing a custom chat handler object.\n\nThe model will will format the messages into a single prompt using the following order of precedence:\n  - Use the `chat_handler` if provided\n  - Use the `chat_format` if provided\n  - Use the `tokenizer.chat_template` from the `gguf` model's metadata (should work for most new models, older models may not have this)\n  - else, fallback to the `llama-2` chat format\n\nSet `verbose=True` to see the selected chat format.\n\n```python\nfrom llama_cpp import Llama\nllm = Llama(\n      model_path=\"path/to/llama-2/llama-model.gguf\",\n      chat_format=\"llama-2\"\n)\nllm.create_chat_completion(\n      messages = [\n          {\"role\": \"system\", \"content\": \"You are an assistant who perfectly describes images.\"},\n          {\n              \"role\": \"user\",\n              \"content\": \"Describe this image in detail please.\"\n          }\n      ]\n)\n```\n\nChat completion is available through the [`create_chat_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) method of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.\n\nFor OpenAI API v1 compatibility, you use the [`create_chat_completion_openai_v1`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion_openai_v1) method which will return pydantic models instead of dicts.\n\n\n### JSON and JSON Schema Mode\n\nTo constrain chat responses to only valid JSON or a specific JSON Schema use the `response_format` argument in [`create_chat_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion).\n\n#### JSON Mode\n\nThe following example will constrain the response to valid JSON strings only.\n\n```python\nfrom llama_cpp import Llama\nllm = Llama(model_path=\"path/to/model.gguf\", chat_format=\"chatml\")\nllm.create_chat_completion(\n    messages=[\n        {\n            \"role\": \"system\",\n            \"content\": \"You are a helpful assistant that outputs in JSON.\",\n        },\n        {\"role\": \"user\", \"content\": \"Who won the world series in 2020\"},\n    ],\n    response_format={\n        \"type\": \"json_object\",\n    },\n    temperature=0.7,\n)\n```\n\n#### JSON Schema Mode\n\nTo constrain the response further to a specific JSON Schema add the schema to the `schema` property of the `response_format` argument.\n\n```python\nfrom llama_cpp import Llama\nllm = Llama(model_path=\"path/to/model.gguf\", chat_format=\"chatml\")\nllm.create_chat_completion(\n    messages=[\n        {\n            \"role\": \"system\",\n            \"content\": \"You are a helpful assistant that outputs in JSON.\",\n        },\n        {\"role\": \"user\", \"content\": \"Who won the world series in 2020\"},\n    ],\n    response_format={\n        \"type\": \"json_object\",\n        \"schema\": {\n            \"type\": \"object\",\n            \"properties\": {\"team_name\": {\"type\": \"string\"}},\n            \"required\": [\"team_name\"],\n        },\n    },\n    temperature=0.7,\n)\n```\n\n### Function Calling\n\nThe high-level API supports OpenAI compatible function and tool calling. This is possible through the `functionary` pre-trained models chat format or through the generic `chatml-function-calling` chat format.\n\n```python\nfrom llama_cpp import Llama\nllm = Llama(model_path=\"path/to/chatml/llama-model.gguf\", chat_format=\"chatml-function-calling\")\nllm.create_chat_completion(\n      messages = [\n        {\n          \"role\": \"system\",\n          \"content\": \"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary\"\n\n        },\n        {\n          \"role\": \"user\",\n          \"content\": \"Extract Jason is 25 years old\"\n        }\n      ],\n      tools=[{\n        \"type\": \"function\",\n        \"function\": {\n          \"name\": \"UserDetail\",\n          \"parameters\": {\n            \"type\": \"object\",\n            \"title\": \"UserDetail\",\n            \"properties\": {\n              \"name\": {\n                \"title\": \"Name\",\n                \"type\": \"string\"\n              },\n              \"age\": {\n                \"title\": \"Age\",\n                \"type\": \"integer\"\n              }\n            },\n            \"required\": [ \"name\", \"age\" ]\n          }\n        }\n      }],\n      tool_choice={\n        \"type\": \"function\",\n        \"function\": {\n          \"name\": \"UserDetail\"\n        }\n      }\n)\n```\n\n<details>\n<summary>Functionary v2</summary>\n\nThe various gguf-converted files for this set of models can be found [here](https://huggingface.co/meetkai). Functionary is able to intelligently call functions and also analyze any provided function outputs to generate coherent responses. All v2 models of functionary supports **parallel function calling**. You can provide either `functionary-v1` or `functionary-v2` for the `chat_format` when initializing the Llama class.\n\nDue to discrepancies between llama.cpp and HuggingFace's tokenizers, it is required to provide HF Tokenizer for functionary. The `LlamaHFTokenizer` class can be initialized and passed into the Llama class. This will override the default llama.cpp tokenizer used in Llama class. The tokenizer files are already included in the respective HF repositories hosting the gguf files.\n\n```python\nfrom llama_cpp import Llama\nfrom llama_cpp.llama_tokenizer import LlamaHFTokenizer\nllm = Llama.from_pretrained(\n  repo_id=\"meetkai/functionary-small-v2.2-GGUF\",\n  filename=\"functionary-small-v2.2.q4_0.gguf\",\n  chat_format=\"functionary-v2\",\n  tokenizer=LlamaHFTokenizer.from_pretrained(\"meetkai/functionary-small-v2.2-GGUF\")\n)\n```\n\n**NOTE**: There is no need to provide the default system messages used in Functionary as they are added automatically in the Functionary chat handler. Thus, the messages should contain just the chat messages and/or system messages that provide additional context for the model (e.g.: datetime, etc.).\n</details>\n\n### Multi-modal Models\n\n`llama-cpp-python` supports such as llava1.5 which allow the language model to read information from both text and images.\n\nBelow are the supported multi-modal models and their respective chat handlers (Python API) and chat formats (Server API).\n\n| Model | `LlamaChatHandler` | `chat_format` |\n|:--- |:--- |:--- |\n| [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b) | `Llava15ChatHandler` | `llava-1-5` |\n| [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b) | `Llava15ChatHandler` | `llava-1-5` |\n| [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf) | `Llava16ChatHandler` | `llava-1-6` |\n| [moondream2](https://huggingface.co/vikhyatk/moondream2) | `MoondreamChatHandler` | `moondream2` |\n| [nanollava](https://huggingface.co/abetlen/nanollava-gguf) | `NanollavaChatHandler` | `nanollava` |\n| [llama-3-vision-alpha](https://huggingface.co/abetlen/llama-3-vision-alpha-gguf) | `Llama3VisionAlphaChatHandler` | `llama-3-vision-alpha` |\n| [minicpm-v-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) | `MiniCPMv26ChatHandler` | `minicpm-v-2.6` |\n| [qwen2.5-vl](https://huggingface.co/unsloth/Qwen2.5-VL-3B-Instruct-GGUF) | `Qwen25VLChatHandler` | `qwen2.5-vl` |\n\nThen you'll need to use a custom chat handler to load the clip model and process the chat messages and images.\n\n```python\nfrom llama_cpp import Llama\nfrom llama_cpp.llama_chat_format import Llava15ChatHandler\nchat_handler = Llava15ChatHandler(clip_model_path=\"path/to/llava/mmproj.bin\")\nllm = Llama(\n  model_path=\"./path/to/llava/llama-model.gguf\",\n  chat_handler=chat_handler,\n  n_ctx=2048, # n_ctx should be increased to accommodate the image embedding\n)\nllm.create_chat_completion(\n    messages = [\n        {\"role\": \"system\", \"content\": \"You are an assistant who perfectly describes images.\"},\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\" : \"text\", \"text\": \"What's in this image?\"},\n                {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\" } }\n            ]\n        }\n    ]\n)\n```\n\nYou can also pull the model from the Hugging Face Hub using the `from_pretrained` method.\n\n```python\nfrom llama_cpp import Llama\nfrom llama_cpp.llama_chat_format import MoondreamChatHandler\n\nchat_handler = MoondreamChatHandler.from_pretrained(\n  repo_id=\"vikhyatk/moondream2\",\n  filename=\"*mmproj*\",\n)\n\nllm = Llama.from_pretrained(\n  repo_id=\"vikhyatk/moondream2\",\n  filename=\"*text-model*\",\n  chat_handler=chat_handler,\n  n_ctx=2048, # n_ctx should be increased to accommodate the image embedding\n)\n\nresponse = llm.create_chat_completion(\n    messages = [\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\" : \"text\", \"text\": \"What's in this image?\"},\n                {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\" } }\n\n            ]\n        }\n    ]\n)\nprint(response[\"choices\"][0][\"text\"])\n```\n\n**Note**: Multi-modal models also support tool calling and JSON mode.\n\n<details>\n<summary>Loading a Local Image</summary>\n\nImages can be passed as base64 encoded data URIs. The following example demonstrates how to do this.\n\n```python\nimport base64\n\ndef image_to_base64_data_uri(file_path):\n    with open(file_path, \"rb\") as img_file:\n        base64_data = base64.b64encode(img_file.read()).decode('utf-8')\n        return f\"data:image/png;base64,{base64_data}\"\n\n# Replace 'file_path.png' with the actual path to your PNG file\nfile_path = 'file_path.png'\ndata_uri = image_to_base64_data_uri(file_path)\n\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are an assistant who perfectly describes images.\"},\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"image_url\", \"image_url\": {\"url\": data_uri }},\n            {\"type\" : \"text\", \"text\": \"Describe this image in detail please.\"}\n        ]\n    }\n]\n\n```\n\n</details>\n\n### Speculative Decoding\n\n`llama-cpp-python` supports speculative decoding which allows the model to generate completions based on a draft model.\n\nThe fastest way to use speculative decoding is through the `LlamaPromptLookupDecoding` class.\n\nJust pass this as a draft model to the `Llama` class during initialization.\n\n```python\nfrom llama_cpp import Llama\nfrom llama_cpp.llama_speculative import LlamaPromptLookupDecoding\n\nllama = Llama(\n    model_path=\"path/to/model.gguf\",\n    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.\n)\n```\n\n### Embeddings\n\nTo generate text embeddings use [`create_embedding`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_embedding) or [`embed`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.embed). Note that you must pass `embedding=True` to the constructor upon model creation for these to work properly.\n\n```python\nimport llama_cpp\n\nllm = llama_cpp.Llama(model_path=\"path/to/model.gguf\", embedding=True)\n\nembeddings = llm.create_embedding(\"Hello, world!\")\n\n# or create multiple embeddings at once\n\nembeddings = llm.create_embedding([\"Hello, world!\", \"Goodbye, world!\"])\n```\n\nThere are two primary notions of embeddings in a Transformer-style model: *token level* and *sequence level*. Sequence level embeddings are produced by \"pooling\" token level embeddings together, usually by averaging them or using the first token.\n\nModels that are explicitly geared towards embeddings will usually return sequence level embeddings by default, one for each input string. Non-embedding models such as those designed for text generation will typically return only token level embeddings, one for each token in each sequence. Thus the dimensionality of the return type will be one higher for token level embeddings.\n\nIt is possible to control pooling behavior in some cases using the `pooling_type` flag on model creation. You can ensure token level embeddings from any model using `LLAMA_POOLING_TYPE_NONE`. The reverse, getting a generation oriented model to yield sequence level embeddings is currently not possible, but you can always do the pooling manually.\n\n### Adjusting the Context Window\n\nThe context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.\n\nFor instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:\n\n```python\nllm = Llama(model_path=\"./models/7B/llama-model.gguf\", n_ctx=2048)\n```\n\n## OpenAI Compatible Web Server\n\n`llama-cpp-python` offers a web server which aims to act as a drop-in replacement for the OpenAI API.\nThis allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).\n\nTo install the server package and get started:\n\n```bash\npip install 'llama-cpp-python[server]'\npython3 -m llama_cpp.server --model models/7B/llama-model.gguf\n```\n\nSimilar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:\n\n```bash\nCMAKE_ARGS=\"-DGGML_CUDA=on\" FORCE_CMAKE=1 pip install 'llama-cpp-python[server]'\npython3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35\n```\n\nNavigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.\n\nTo bind to `0.0.0.0` to enable remote connections, use `python3 -m llama_cpp.server --host 0.0.0.0`.\nSimilarly, to change the port (default is 8000), use `--port`.\n\nYou probably also want to set the prompt format. For chatml, use\n\n```bash\npython3 -m llama_cpp.server --model models/7B/llama-model.gguf --chat_format chatml\n```\n\nThat will format the prompt according to how model expects it. You can find the prompt format in the model card.\nFor possible options, see [llama_cpp/llama_chat_format.py](llama_cpp/llama_chat_format.py) and look for lines starting with \"@register_chat_format\".\n\nIf you have `huggingface-hub` installed, you can also use the `--hf_model_repo_id` flag to load a model from the Hugging Face Hub.\n\n```bash\npython3 -m llama_cpp.server --hf_model_repo_id Qwen/Qwen2-0.5B-Instruct-GGUF --model '*q8_0.gguf'\n```\n\n### Web Server Features\n\n- [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)\n- [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)\n- [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)\n- [Multiple Models](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support)\n\n## Docker image\n\nA Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:\n\n```bash\ndocker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest\n```\n\n[Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389)\n\n## Low-level API\n\n[API Reference](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#low-level-api)\n\nThe low-level API is a direct [`ctypes`](https://docs.python.org/3/library/ctypes.html) binding to the C API provided by `llama.cpp`.\nThe entire low-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).\n\nBelow is a short example demonstrating how to use the low-level API to tokenize a prompt:\n\n```python\nimport llama_cpp\nimport ctypes\nllama_cpp.llama_backend_init(False) # Must be called once at the start of each program\nparams = llama_cpp.llama_context_default_params()\n# use bytes for char * params\nmodel = llama_cpp.llama_load_model_from_file(b\"./models/7b/llama-model.gguf\", params)\nctx = llama_cpp.llama_new_context_with_model(model, params)\nmax_tokens = params.n_ctx\n# use ctypes arrays for array params\ntokens = (llama_cpp.llama_token * int(max_tokens))()\nn_tokens = llama_cpp.llama_tokenize(ctx, b\"Q: Name the planets in the solar system? A: \", tokens, max_tokens, llama_cpp.c_bool(True))\nllama_cpp.llama_free(ctx)\n```\n\nCheck out the [examples folder](examples/low_level_api) for more examples of using the low-level API.\n\n## Documentation\n\nDocumentation is available via [https://llama-cpp-python.readthedocs.io/](https://llama-cpp-python.readthedocs.io/).\nIf you find any issues with the documentation, please open an issue or submit a PR.\n\n## Development\n\nThis package is under active development and I welcome any contributions.\n\nTo get started, clone the repository and install the package in editable / development mode:\n\n```bash\ngit clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git\ncd llama-cpp-python\n\n# Upgrade pip (required for editable mode)\npip install --upgrade pip\n\n# Install with pip\npip install -e .\n\n# if you want to use the fastapi / openapi server\npip install -e '.[server]'\n\n# to install all optional dependencies\npip install -e '.[all]'\n\n# to clear the local build cache\nmake clean\n```\n\nNow try running the tests\n\n```bash\npytest\n```\n\nThere's a `Makefile` available with useful targets.\nA typical workflow would look like this:\n\n```bash\nmake build\nmake test\n```\n\nYou can also test out specific commits of `llama.cpp` by checking out the desired commit in the `vendor/llama.cpp` submodule and then running `make clean` and `pip install -e .` again. Any changes in the `llama.h` API will require\nchanges to the `llama_cpp/llama_cpp.py` file to match the new API (additional changes may be required elsewhere).\n\n## FAQ\n\n### Are there pre-built binaries / binary wheels available?\n\nThe recommended installation method is to install from source as described above.\nThe reason for this is that `llama.cpp` is built with compiler optimizations that are specific to your system.\nUsing pre-built binaries would require disabling these optimizations or supporting a large number of pre-built binaries for each platform.\n\nThat being said there are some pre-built binaries available through the Releases as well as some community provided wheels.\n\nIn the future, I would like to provide pre-built binaries and wheels for common platforms and I'm happy to accept any useful contributions in this area.\nThis is currently being tracked in [#741](https://github.com/abetlen/llama-cpp-python/issues/741)\n\n### How does this compare to other Python bindings of `llama.cpp`?\n\nI originally wrote this package for my own use with two goals in mind:\n\n- Provide a simple process to install `llama.cpp` and access the full C API in `llama.h` from Python\n- Provide a high-level Python API that can be used as a drop-in replacement for the OpenAI API so existing apps can be easily ported to use `llama.cpp`\n\nAny contributions and changes to this package will be made with these goals in mind.\n\n## License\n\nThis project is licensed under the terms of the MIT license.\n"
  },
  {
    "path": "docker/README.md",
    "content": "### Install Docker Server\n> [!IMPORTANT]  \n> This was tested with Docker running on Linux. <br>If you can get it working on Windows or MacOS, please update this `README.md` with a PR!<br>\n\n[Install Docker Engine](https://docs.docker.com/engine/install)\n\n\n## Simple Dockerfiles for building the llama-cpp-python server with external model bin files\n### openblas_simple\nA simple Dockerfile for non-GPU OpenBLAS, where the model is located outside the Docker image:\n```\ncd ./openblas_simple\ndocker build -t openblas_simple .\ndocker run --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t openblas_simple\n```\nwhere `<model-root-path>/<model-path>` is the full path to the model file on the Docker host system.\n\n### cuda_simple\n> [!WARNING]  \n> Nvidia GPU CuBLAS support requires an Nvidia GPU with sufficient VRAM (approximately as much as the size in the table below) and Docker Nvidia support (see [container-toolkit/install-guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)) <br>\n\nA simple Dockerfile for CUDA-accelerated CuBLAS, where the model is located outside the Docker image:\n\n```\ncd ./cuda_simple\ndocker build -t cuda_simple .\ndocker run --gpus=all --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t cuda_simple\n```\nwhere `<model-root-path>/<model-path>` is the full path to the model file on the Docker host system.\n\n--------------------------------------------------------------------------\n\n### \"Open-Llama-in-a-box\"\nDownload an Apache V2.0 licensed 3B params Open LLaMA model and install into a Docker image that runs an OpenBLAS-enabled llama-cpp-python server:\n```\n$ cd ./open_llama\n./build.sh\n./start.sh\n```\n\n### Manually choose your own Llama model from Hugging Face\n`python3 ./hug_model.py -a TheBloke -t llama`\nYou should now have a model in the current directory and `model.bin` symlinked to it for the subsequent Docker build and copy step. e.g.\n```\ndocker $ ls -lh *.bin\n-rw-rw-r-- 1 user user 4.8G May 23 18:30 <downloaded-model-file>q5_1.bin\nlrwxrwxrwx 1 user user   24 May 23 18:30 model.bin -> <downloaded-model-file>q5_1.bin\n```\n\n> [!NOTE]  \n> Make sure you have enough disk space to download the model. As the model is then copied into the image you will need at least\n**TWICE** as much disk space as the size of the model:<br>\n\n| Model |  Quantized size |\n|------:|----------------:|\n|    3B |            3 GB |\n|    7B |            5 GB |\n|   13B |           10 GB |\n|   33B |           25 GB |\n|   65B |           50 GB |\n\n\n> [!NOTE]  \n> If you want to pass or tune additional parameters, customise `./start_server.sh` before running `docker build ...`\n"
  },
  {
    "path": "docker/cuda_simple/Dockerfile",
    "content": "ARG CUDA_IMAGE=\"12.5.0-devel-ubuntu22.04\"\nFROM nvidia/cuda:${CUDA_IMAGE}\n\n# We need to set the host to 0.0.0.0 to allow outside access\nENV HOST 0.0.0.0\n\nRUN apt-get update && apt-get upgrade -y \\\n    && apt-get install -y git build-essential \\\n    python3 python3-pip gcc wget \\\n    ocl-icd-opencl-dev opencl-headers clinfo \\\n    libclblast-dev libopenblas-dev \\\n    && mkdir -p /etc/OpenCL/vendors && echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd\n\nCOPY . .\n\n# setting build related env vars\nENV CUDA_DOCKER_ARCH=all\nENV GGML_CUDA=1\n\n# Install depencencies\nRUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context\n\n# Install llama-cpp-python (build with cuda)\nRUN CMAKE_ARGS=\"-DGGML_CUDA=on\" pip install llama-cpp-python\n\n# Run the server\nCMD python3 -m llama_cpp.server\n"
  },
  {
    "path": "docker/open_llama/Dockerfile",
    "content": "# Define the image argument and provide a default value\nARG IMAGE=python:3-slim-bookworm\n\n# Use the image as specified\nFROM ${IMAGE}\n\n# Re-declare the ARG after FROM\nARG IMAGE\n\n# Update and upgrade the existing packages \nRUN apt-get update && apt-get upgrade -y && apt-get install -y --no-install-recommends \\\n    python3 \\\n    python3-pip \\\n    ninja-build \\\n    build-essential \\\n    && apt-get clean \\\n    && rm -rf /var/lib/apt/lists/*\n\nRUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context\n\n# Perform the conditional installations based on the image\nRUN echo \"Image: ${IMAGE}\" && \\\n    if [ \"${IMAGE}\" = \"python:3-slim-bookworm\" ] ; then \\\n    echo \"OpenBLAS install:\" && \\\n    apt-get install -y --no-install-recommends libopenblas-dev && \\\n    CMAKE_ARGS=\"-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS\" pip install llama-cpp-python --verbose; \\\nelse \\\n    echo \"CuBLAS install:\" && \\\n    CMAKE_ARGS=\"-DGGML_CUDA=on\" pip install llama-cpp-python --verbose; \\\nfi\n\n# Clean up apt cache\nRUN rm -rf /var/lib/apt/lists/*\n\n# Set a working directory for better clarity\nWORKDIR /app\n\n# Copy files to the app directory\nRUN echo \"Installing model...this can take some time...\"\nCOPY ./model.bin /app/model.bin\nCOPY ./start_server.sh /app/start_server.sh\n\n# Make the server start script executable\nRUN chmod +x /app/start_server.sh\n\n# Set environment variable for the host\nENV HOST=0.0.0.0\n\n# Expose a port for the server\nEXPOSE 8000\n\n# Run the server start script\nCMD [\"/bin/sh\", \"/app/start_server.sh\"]\n"
  },
  {
    "path": "docker/open_llama/build.sh",
    "content": "#!/bin/sh\n\nMODEL=\"open_llama_3b\"\n# Get  open_llama_3b_ggml q5_1 quantization\npython3 ./hug_model.py -a SlyEcho -s ${MODEL} -f \"q5_1\"\nls -lh *.bin\n\n# Build the default OpenBLAS image\ndocker build -t $MODEL .\ndocker images | egrep \"^(REPOSITORY|$MODEL)\"\n\necho\necho \"To start the docker container run:\"\necho \"docker run -t -p 8000:8000 $MODEL\"\n"
  },
  {
    "path": "docker/open_llama/hug_model.py",
    "content": "import requests\nimport json\nimport os\nimport struct\nimport argparse\n\ndef make_request(url, params=None):\n    print(f\"Making request to {url}...\")\n    response = requests.get(url, params=params)\n    if response.status_code == 200:\n        return json.loads(response.text)\n    else:\n        print(f\"Request failed with status code {response.status_code}\")\n        return None\n\ndef check_magic_and_version(filename):\n    with open(filename, 'rb') as f:\n        # Read the first 6 bytes from the file\n        data = f.read(6)\n\n    # Unpack the binary data, interpreting the first 4 bytes as a little-endian unsigned int\n    # and the next 2 bytes as a little-endian unsigned short\n    magic, version = struct.unpack('<I H', data)\n\n    print(f\"magic: 0x{magic:08x}, version: 0x{version:04x}, file: {filename}\")\n\n    return magic, version\n\ndef download_file(url, destination):\n    print(f\"Downloading {url} to {destination}...\")\n    response = requests.get(url, stream=True)\n    if response.status_code == 200:\n        with open(destination, 'wb') as f:\n            total_downloaded = 0\n            for chunk in response.iter_content(chunk_size=1024):\n                if chunk:  # filter out keep-alive new chunks\n                    f.write(chunk)\n                    total_downloaded += len(chunk)\n                    if total_downloaded >= 10485760:  # 10 MB\n                        print('.', end='', flush=True)\n                        total_downloaded = 0\n        print(\"\\nDownload complete.\")\n        \n        # Creating a symbolic link from destination to \"model.bin\"\n        if os.path.isfile(\"model.bin\"):\n            os.remove(\"model.bin\")  # remove the existing link if any\n        os.symlink(destination, \"model.bin\")\n    else:\n        print(f\"Download failed with status code {response.status_code}\")\n\ndef get_user_choice(model_list):\n    # Print the enumerated list\n    print(\"\\n\")\n    for i, (model_id, rfilename) in enumerate(model_list):\n        print(f\"{i+1}: Model ID: {model_id}, RFilename: {rfilename}\")\n\n    # Get user's choice\n    choice = input(\"Choose a model to download by entering the corresponding number: \")\n    try:\n        index = int(choice) - 1\n        if 0 <= index < len(model_list):\n            # Return the chosen model\n            return model_list[index]\n        else:\n            print(\"Invalid choice.\")\n    except ValueError:\n        print(\"Invalid input. Please enter a number corresponding to a model.\")\n    except IndexError:\n        print(\"Invalid choice. Index out of range.\")\n    \n    return None\n\ndef main():\n    # Create an argument parser\n    parser = argparse.ArgumentParser(description='Process some parameters.')\n\n    # Arguments\n    parser.add_argument('-v', '--version', type=int, default=0x0003,\n                        help='hexadecimal version number of ggml file')\n    parser.add_argument('-a', '--author', type=str, default='TheBloke',\n                        help='HuggingFace author filter')\n    parser.add_argument('-t', '--tag', type=str, default='llama',\n                        help='HuggingFace tag filter')\n    parser.add_argument('-s', '--search', type=str, default='',\n                        help='HuggingFace search filter')\n    parser.add_argument('-f', '--filename', type=str, default='q5_1',\n                        help='HuggingFace model repository filename substring match')\n\n    # Parse the arguments\n    args = parser.parse_args()\n\n    # Define the parameters\n    params = {\n        \"author\": args.author,\n        \"tags\": args.tag,\n        \"search\": args.search\n    }\n\n    models = make_request('https://huggingface.co/api/models', params=params)\n    if models is None:\n        return\n\n    model_list = []\n    # Iterate over the models\n    for model in models:\n        model_id = model['id']\n        model_info = make_request(f'https://huggingface.co/api/models/{model_id}')\n        if model_info is None:\n            continue\n\n        for sibling in model_info.get('siblings', []):\n            rfilename = sibling.get('rfilename')\n            if rfilename and args.filename in rfilename:\n                model_list.append((model_id, rfilename))\n\n    # Choose the model\n    model_list.sort(key=lambda x: x[0])\n    if len(model_list) == 0:\n        print(\"No models found\")\n        exit(1)\n    elif len(model_list) == 1:\n        model_choice = model_list[0]\n    else:\n        model_choice = get_user_choice(model_list)\n\n    if model_choice is not None:\n        model_id, rfilename = model_choice\n        url = f\"https://huggingface.co/{model_id}/resolve/main/{rfilename}\"\n        dest = f\"{model_id.replace('/', '_')}_{rfilename}\"\n        download_file(url, dest)\n        _, version = check_magic_and_version(dest)\n        if version != args.version:\n             print(f\"Warning: Expected version {args.version}, but found different version in the file.\")\n    else:\n        print(\"Error - model choice was None\")\n        exit(2)\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "docker/open_llama/start.sh",
    "content": "#!/bin/sh\n\nMODEL=\"open_llama_3b\"\n\n# Start Docker container\ndocker run --cap-add SYS_RESOURCE -p 8000:8000 -t $MODEL &\nsleep 10\necho\ndocker ps | egrep \"(^CONTAINER|$MODEL)\"\n\n# Test the model works\necho\ncurl -X 'POST'   'http://localhost:8000/v1/completions'   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{\n  \"prompt\": \"\\n\\n### Instructions:\\nWhat is the capital of France?\\n\\n### Response:\\n\",\n  \"stop\": [\n    \"\\n\",\n    \"###\"\n  ]\n}' | grep Paris\nif [ $? -eq 0 ]\nthen\n    echo\n    echo \"$MODEL is working!!\"\nelse\n    echo\n    echo \"ERROR: $MODEL not replying.\"\n    exit 1\nfi\n"
  },
  {
    "path": "docker/open_llama/start_server.sh",
    "content": "#!/bin/sh\n\n# For mlock support\nulimit -l unlimited\n\nif [ \"$IMAGE\" = \"python:3-slim-bullseye\" ]; then\n    python3 -B -m llama_cpp.server --model /app/model.bin\nelse\n    # You may have to reduce --n_gpu_layers=1000 to 20 or less if you don't have enough VRAM\n    python3 -B -m llama_cpp.server --model /app/model.bin --n_gpu_layers=1000\nfi\n"
  },
  {
    "path": "docker/openblas_simple/Dockerfile",
    "content": "FROM python:3-slim-bookworm\n\n# We need to set the host to 0.0.0.0 to allow outside access\nENV HOST 0.0.0.0\n\nCOPY . .\n\n# Install the package\nRUN apt update && apt install -y libopenblas-dev ninja-build build-essential pkg-config \\\n    && apt-get clean \\\n    && rm -rf /var/lib/apt/lists/* /tmp/*\n    \nRUN python -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context\n\nRUN CMAKE_ARGS=\"-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS\" pip install llama_cpp_python --verbose\n\n# Run the server\nCMD python3 -m llama_cpp.server\n"
  },
  {
    "path": "docker/simple/Dockerfile",
    "content": "# Define the image argument and provide a default value\nARG IMAGE=python:3-slim-bookworm\n\n# Use the image as specified\nFROM ${IMAGE}\n\n# Re-declare the ARG after FROM\nARG IMAGE\n\n# Update and upgrade the existing packages \nRUN apt-get update && apt-get upgrade -y && apt-get install -y --no-install-recommends \\\n    git \\\n    python3 \\\n    python3-pip \\\n    ninja-build \\\n    libopenblas-dev \\\n    build-essential \\\n    && apt-get clean \\\n    && rm -rf /var/lib/apt/lists/* /tmp/*\n\nRUN mkdir /app\nWORKDIR /app\nCOPY . /app\n\nRUN python3 -m pip install --upgrade pip\n\nRUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context\n\nRUN pip install llama-cpp-python --verbose;\n\n# Set environment variable for the host\nENV HOST=0.0.0.0\nENV PORT=8000\n\n# Expose a port for the server\nEXPOSE 8000\n\n# Run the server start script\nCMD [\"/bin/sh\", \"/app/docker/simple/run.sh\"]\n"
  },
  {
    "path": "docker/simple/run.sh",
    "content": "#!/bin/bash\n\nmake build\nuvicorn --factory llama_cpp.server.app:create_app --host $HOST --port $PORT\n"
  },
  {
    "path": "docs/api-reference.md",
    "content": "---\ntitle: API Reference\n---\n\n## High Level API\n\nHigh-level Python bindings for llama.cpp.\n\n::: llama_cpp.Llama\n    options:\n        members:\n            - __init__\n            - tokenize\n            - detokenize\n            - reset\n            - eval\n            - sample\n            - generate\n            - create_embedding\n            - embed\n            - create_completion\n            - __call__\n            - create_chat_completion\n            - create_chat_completion_openai_v1\n            - set_cache\n            - save_state\n            - load_state\n            - token_bos\n            - token_eos\n            - from_pretrained\n        show_root_heading: true\n\n::: llama_cpp.LlamaGrammar\n    options:\n        members:\n            - from_string\n            - from_json_schema\n\n::: llama_cpp.LlamaCache\n    options:\n        show_root_heading: true\n\n::: llama_cpp.LlamaState\n    options:\n        show_root_heading: true\n\n::: llama_cpp.LogitsProcessor\n    options:\n        show_root_heading: true\n\n::: llama_cpp.LogitsProcessorList\n    options:\n        show_root_heading: true\n\n::: llama_cpp.StoppingCriteria\n    options:\n        show_root_heading: true\n\n::: llama_cpp.StoppingCriteriaList\n    options:\n        show_root_heading: true\n\n## Low Level API\n\nLow-level Python bindings for llama.cpp using Python's ctypes library.\n\n::: llama_cpp.llama_cpp\n    options:\n        show_if_no_docstring: true\n        # filter only members starting with `llama_`\n        filters:\n            - \"^llama_\"\n\n::: llama_cpp.llama_cpp\n    options:\n        show_if_no_docstring: true\n        show_root_heading: false\n        show_root_toc_entry: false\n        heading_level: 4\n        # filter only members starting with `LLAMA_`\n        filters:\n            - \"^LLAMA_\"\n\n## Misc\n\n::: llama_cpp.llama_types\n    options:\n        show_if_no_docstring: true"
  },
  {
    "path": "docs/changelog.md",
    "content": "-8<- \"CHANGELOG.md\""
  },
  {
    "path": "docs/index.md",
    "content": "---\ntitle: Getting Started\n---\n\n-8<- \"README.md\""
  },
  {
    "path": "docs/install/macos.md",
    "content": "---\ntitle: MacOS Install with Metal GPU\n---\n\n**(1) Make sure you have xcode installed... at least the command line parts**\n```\n# check the path of your xcode install \nxcode-select -p\n\n# xcode installed returns\n# /Applications/Xcode-beta.app/Contents/Developer\n\n# if xcode is missing then install it... it takes ages;\nxcode-select --install\n```\n\n**(2) Install the conda version for MacOS that supports Metal GPU**\n```\nwget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh\nbash Miniforge3-MacOSX-arm64.sh\n```\n\n**(3) Make a conda environment**\n```\nconda create -n llama python=3.9.16\nconda activate llama\n```\n\n**(4) Install the LATEST llama-cpp-python...which happily supports MacOS Metal GPU as of version 0.1.62**  \n    *(you needed xcode installed in order pip to build/compile the C++ code)*\n```\npip uninstall llama-cpp-python -y\nCMAKE_ARGS=\"-DGGML_METAL=on\" pip install -U llama-cpp-python --no-cache-dir\npip install 'llama-cpp-python[server]'\n\n# you should now have llama-cpp-python v0.1.62 or higher installed\nllama-cpp-python         0.1.68\n\n```\n\n**(5) Download a v3 gguf v2 model**\n - **ggufv2**\n - file name ends with **Q4_0.gguf** - indicating it is 4bit quantized, with quantisation method 0\n\nhttps://huggingface.co/TheBloke/CodeLlama-7B-GGUF\n\n\n**(6) run the llama-cpp-python API server with MacOS Metal GPU support**\n```\n# config your ggml model path\n# make sure it is gguf v2\n# make sure it is q4_0\nexport MODEL=[path to your llama.cpp ggml models]]/[ggml-model-name]]Q4_0.gguf\npython3 -m llama_cpp.server --model $MODEL  --n_gpu_layers 1\n```\n\n***Note:** If you omit the `--n_gpu_layers 1` then CPU will be used*\n\n\n"
  },
  {
    "path": "docs/requirements.txt",
    "content": "mkdocs\nmkdocs-material\nmkdocstrings[python]"
  },
  {
    "path": "docs/server.md",
    "content": "# OpenAI Compatible Server\n\n`llama-cpp-python` offers an OpenAI API compatible web server.\n\nThis web server can be used to serve local models and easily connect them to existing clients.\n\n## Setup\n\n### Installation\n\nThe server can be installed by running the following command:\n\n```bash\npip install llama-cpp-python[server]\n```\n\n### Running the server\n\nThe server can then be started by running the following command:\n\n```bash\npython3 -m llama_cpp.server --model <model_path>\n```\n\n### Server options\n\nFor a full list of options, run:\n\n```bash\npython3 -m llama_cpp.server --help\n```\n\nNOTE: All server options are also available as environment variables. For example, `--model` can be set by setting the `MODEL` environment variable.\n\nCheck out the server config reference below settings for more information on the available options.\nCLI arguments and environment variables are available for all of the fields defined in [`ServerSettings`](#llama_cpp.server.settings.ServerSettings) and [`ModelSettings`](#llama_cpp.server.settings.ModelSettings) \n\nAdditionally the server supports configuration check out the [configuration section](#configuration-and-multi-model-support) for more information and examples.\n\n\n## Guides\n\n### Code Completion\n\n`llama-cpp-python` supports code completion via GitHub Copilot.\n\n*NOTE*: Without GPU acceleration this is unlikely to be fast enough to be usable.\n\nYou'll first need to download one of the available code completion models in GGUF format:\n\n- [replit-code-v1_5-GGUF](https://huggingface.co/abetlen/replit-code-v1_5-3b-GGUF)\n\nThen you'll need to run the OpenAI compatible web server with a increased context size substantially for GitHub Copilot requests:\n\n```bash\npython3 -m llama_cpp.server --model <model_path> --n_ctx 16192\n```\n\nThen just update your settings in `.vscode/settings.json` to point to your code completion server:\n\n```json\n{\n    // ...\n    \"github.copilot.advanced\": {\n        \"debug.testOverrideProxyUrl\": \"http://<host>:<port>\",\n        \"debug.overrideProxyUrl\": \"http://<host>:<port>\"\n    }\n    // ...\n}\n```\n\n### Function Calling\n\n`llama-cpp-python` supports structured function calling based on a JSON schema.\nFunction calling is completely compatible with the OpenAI function calling API and can be used by connecting with the official OpenAI Python client.\n\nYou'll first need to download one of the available function calling models in GGUF format:\n\n- [functionary](https://huggingface.co/meetkai)\n\nThen when you run the server you'll need to also specify either `functionary-v1` or `functionary-v2` chat_format.\n\nNote that since functionary requires a HF Tokenizer due to discrepancies between llama.cpp and HuggingFace's tokenizers as mentioned [here](https://github.com/abetlen/llama-cpp-python/blob/main?tab=readme-ov-file#function-calling), you will need to pass in the path to the tokenizer too. The tokenizer files are already included in the respective HF repositories hosting the gguf files.\n\n```bash\npython3 -m llama_cpp.server --model <model_path_to_functionary_v2_model> --chat_format functionary-v2 --hf_pretrained_model_name_or_path <model_path_to_functionary_v2_tokenizer>\n```\n\nCheck out this [example notebook](https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb) for a walkthrough of some interesting use cases for function calling.\n\n### Multimodal Models\n\n`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to\nread information from both text and images.\n\nYou'll first need to download one of the available multi-modal models in GGUF format:\n\n- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)\n- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)\n- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)\n- [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf)\n- [moondream2](https://huggingface.co/vikhyatk/moondream2)\n\nThen when you run the server you'll need to also specify the path to the clip model used for image embedding and the `llava-1-5` chat_format\n\n```bash\npython3 -m llama_cpp.server --model <model_path> --clip_model_path <clip_model_path> --chat_format llava-1-5\n```\n\nThen you can just use the OpenAI API as normal\n\n```python3\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http://<host>:<port>/v1\", api_key=\"sk-xxx\")\nresponse = client.chat.completions.create(\n    model=\"gpt-4-vision-preview\",\n    messages=[\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\n                        \"url\": \"<image_url>\"\n                    },\n                },\n                {\"type\": \"text\", \"text\": \"What does the image say\"},\n            ],\n        }\n    ],\n)\nprint(response)\n```\n\n## Configuration and Multi-Model Support\n\nThe server supports configuration via a JSON config file that can be passed using the `--config_file` parameter or the `CONFIG_FILE` environment variable.\n\n```bash\npython3 -m llama_cpp.server --config_file <config_file>\n```\n\nConfig files support all of the server and model options supported by the cli and environment variables however instead of only a single model the config file can specify multiple models.\n\nThe server supports routing requests to multiple models based on the `model` parameter in the request which matches against the `model_alias` in the config file.\n\nAt the moment only a single model is loaded into memory at, the server will automatically load and unload models as needed.\n\n```json\n{\n    \"host\": \"0.0.0.0\",\n    \"port\": 8080,\n    \"models\": [\n        {\n            \"model\": \"models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf\",\n            \"model_alias\": \"gpt-3.5-turbo\",\n            \"chat_format\": \"chatml\",\n            \"n_gpu_layers\": -1,\n            \"offload_kqv\": true,\n            \"n_threads\": 12,\n            \"n_batch\": 512,\n            \"n_ctx\": 2048\n        },\n        {\n            \"model\": \"models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf\",\n            \"model_alias\": \"gpt-4\",\n            \"chat_format\": \"chatml\",\n            \"n_gpu_layers\": -1,\n            \"offload_kqv\": true,\n            \"n_threads\": 12,\n            \"n_batch\": 512,\n            \"n_ctx\": 2048\n        },\n        {\n            \"model\": \"models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf\",\n            \"model_alias\": \"gpt-4-vision-preview\",\n            \"chat_format\": \"llava-1-5\",\n            \"clip_model_path\": \"models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf\",\n            \"n_gpu_layers\": -1,\n            \"offload_kqv\": true,\n            \"n_threads\": 12,\n            \"n_batch\": 512,\n            \"n_ctx\": 2048\n        },\n        {\n            \"model\": \"models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf\",\n            \"model_alias\": \"text-davinci-003\",\n            \"n_gpu_layers\": -1,\n            \"offload_kqv\": true,\n            \"n_threads\": 12,\n            \"n_batch\": 512,\n            \"n_ctx\": 2048\n        },\n        {\n            \"model\": \"models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf\",\n            \"model_alias\": \"copilot-codex\",\n            \"n_gpu_layers\": -1,\n            \"offload_kqv\": true,\n            \"n_threads\": 12,\n            \"n_batch\": 1024,\n            \"n_ctx\": 9216\n        }\n    ]\n}\n```\n\nThe config file format is defined by the [`ConfigFileSettings`](#llama_cpp.server.settings.ConfigFileSettings) class.\n\n## Server Options Reference\n\n::: llama_cpp.server.settings.ConfigFileSettings\n    options:\n        show_if_no_docstring: true\n\n::: llama_cpp.server.settings.ServerSettings\n    options:\n        show_if_no_docstring: true\n\n::: llama_cpp.server.settings.ModelSettings\n    options:\n        show_if_no_docstring: true\n"
  },
  {
    "path": "examples/batch-processing/server.py",
    "content": "\"\"\"llama-cpp-python server from scratch in a single file.\n\"\"\"\n\n# import llama_cpp\n\n# path = b\"../../models/Qwen1.5-0.5B-Chat-GGUF/qwen1_5-0_5b-chat-q8_0.gguf\"\n\n# model_params = llama_cpp.llama_model_default_params()\n# model = llama_cpp.llama_load_model_from_file(path, model_params)\n\n# if model is None:\n#     raise RuntimeError(f\"Failed to load model from file: {path}\")\n\n\n# ctx_params = llama_cpp.llama_context_default_params()\n# ctx = llama_cpp.llama_new_context_with_model(model, ctx_params)\n\n# if ctx is None:\n#     raise RuntimeError(\"Failed to create context\")\n\n\nfrom fastapi import FastAPI\n\napp = FastAPI()\n\nimport openai.types.chat as types\n\n\n@app.post(\"/v1/chat/completions\")\ndef create_chat_completions():\n    return {\"message\": \"Hello World\"}\n"
  },
  {
    "path": "examples/gradio_chat/local.py",
    "content": "import llama_cpp\nimport llama_cpp.llama_tokenizer\n\nimport gradio as gr\n\nllama = llama_cpp.Llama.from_pretrained(\n    repo_id=\"Qwen/Qwen1.5-0.5B-Chat-GGUF\",\n    filename=\"*q8_0.gguf\",\n    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(\n        \"Qwen/Qwen1.5-0.5B\"\n    ),\n    verbose=False,\n)\n\nmodel = \"gpt-3.5-turbo\"\n\n\ndef predict(message, history):\n    messages = []\n\n    for user_message, assistant_message in history:\n        messages.append({\"role\": \"user\", \"content\": user_message})\n        messages.append({\"role\": \"assistant\", \"content\": assistant_message})\n\n    messages.append({\"role\": \"user\", \"content\": message})\n\n    response = llama.create_chat_completion_openai_v1(\n        model=model, messages=messages, stream=True\n    )\n\n    text = \"\"\n    for chunk in response:\n        content = chunk.choices[0].delta.content\n        if content:\n            text += content\n            yield text\n\n\njs = \"\"\"function () {\n  gradioURL = window.location.href\n  if (!gradioURL.endsWith('?__theme=dark')) {\n    window.location.replace(gradioURL + '?__theme=dark');\n  }\n}\"\"\"\n\ncss = \"\"\"\nfooter {\n    visibility: hidden;\n}\nfull-height {\n    height: 100%;\n}\n\"\"\"\n\nwith gr.Blocks(theme=gr.themes.Soft(), js=js, css=css, fill_height=True) as demo:\n    gr.ChatInterface(\n        predict,\n        fill_height=True,\n        examples=[\n            \"What is the capital of France?\",\n            \"Who was the first person on the moon?\",\n        ],\n    )\n\n\nif __name__ == \"__main__\":\n    demo.launch()\n"
  },
  {
    "path": "examples/gradio_chat/server.py",
    "content": "import gradio as gr\n\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"llama.cpp\")\n\nmodel = \"gpt-3.5-turbo\"\n\n\ndef predict(message, history):\n    messages = []\n\n    for user_message, assistant_message in history:\n        messages.append({\"role\": \"user\", \"content\": user_message})\n        messages.append({\"role\": \"assistant\", \"content\": assistant_message})\n\n    messages.append({\"role\": \"user\", \"content\": message})\n\n    response = client.chat.completions.create(\n        model=model, messages=messages, stream=True\n    )\n\n    text = \"\"\n    for chunk in response:\n        content = chunk.choices[0].delta.content\n        if content:\n            text += content\n            yield text\n\n\njs = \"\"\"function () {\n  gradioURL = window.location.href\n  if (!gradioURL.endsWith('?__theme=dark')) {\n    window.location.replace(gradioURL + '?__theme=dark');\n  }\n}\"\"\"\n\ncss = \"\"\"\nfooter {\n    visibility: hidden;\n}\nfull-height {\n    height: 100%;\n}\n\"\"\"\n\nwith gr.Blocks(theme=gr.themes.Soft(), js=js, css=css, fill_height=True) as demo:\n    gr.ChatInterface(\n        predict,\n        fill_height=True,\n        examples=[\n            \"What is the capital of France?\",\n            \"Who was the first person on the moon?\",\n        ],\n    )\n\n\nif __name__ == \"__main__\":\n    demo.launch()\n"
  },
  {
    "path": "examples/hf_pull/main.py",
    "content": "import llama_cpp\nimport llama_cpp.llama_tokenizer\n\n\nllama = llama_cpp.Llama.from_pretrained(\n    repo_id=\"Qwen/Qwen1.5-0.5B-Chat-GGUF\",\n    filename=\"*q8_0.gguf\",\n    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(\n        \"Qwen/Qwen1.5-0.5B\"\n    ),\n    verbose=False,\n)\n\nresponse = llama.create_chat_completion(\n    messages=[{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n    response_format={\n        \"type\": \"json_object\",\n        \"schema\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"country\": {\"type\": \"string\"},\n                \"capital\": {\"type\": \"string\"},\n            },\n            \"required\": [\"country\", \"capital\"],\n        },\n    },\n    stream=True,\n)\n\nfor chunk in response:\n    delta = chunk[\"choices\"][0][\"delta\"]\n    if \"content\" not in delta:\n        continue\n    print(delta[\"content\"], end=\"\", flush=True)\n\nprint()\n"
  },
  {
    "path": "examples/high_level_api/fastapi_server.py",
    "content": "\"\"\"Example FastAPI server for llama.cpp.\n\nTo run this example:\n\n```bash\npip install fastapi uvicorn sse-starlette\nexport MODEL=../models/7B/...\n```\n\nThen run:\n```\nuvicorn --factory llama_cpp.server.app:create_app --reload\n```\n\nor\n\n```\npython3 -m llama_cpp.server\n```\n\nThen visit http://localhost:8000/docs to see the interactive API docs.\n\n\nTo actually see the implementation of the server, see llama_cpp/server/app.py\n\n\"\"\"\n\nimport os\nimport uvicorn\n\nfrom llama_cpp.server.app import create_app\n\nif __name__ == \"__main__\":\n    app = create_app()\n\n    uvicorn.run(\n        app, host=os.getenv(\"HOST\", \"localhost\"), port=int(os.getenv(\"PORT\", 8000))\n    )\n"
  },
  {
    "path": "examples/high_level_api/high_level_api_embedding.py",
    "content": "import argparse\n\nfrom llama_cpp import Llama\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\"-m\", \"--model\", type=str, default=\"../models/7B/ggml-model.bin\")\nargs = parser.parse_args()\n\nllm = Llama(model_path=args.model, embedding=True)\n\nprint(llm.create_embedding(\"Hello world!\"))\n"
  },
  {
    "path": "examples/high_level_api/high_level_api_inference.py",
    "content": "import json\nimport argparse\n\nfrom llama_cpp import Llama\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\"-m\", \"--model\", type=str, default=\"../models/7B/ggml-models.bin\")\nargs = parser.parse_args()\n\nllm = Llama(model_path=args.model)\n\noutput = llm(\n    \"Question: What are the names of the planets in the solar system? Answer: \",\n    max_tokens=48,\n    stop=[\"Q:\", \"\\n\"],\n    echo=True,\n)\n\nprint(json.dumps(output, indent=2))\n"
  },
  {
    "path": "examples/high_level_api/high_level_api_infill.py",
    "content": "import argparse\n\nfrom llama_cpp import Llama\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\"-m\", \"--model\", type=str, default=\"../models/7B/ggml-models.bin\")\nparser.add_argument(\"-p\", \"--prompt\", type=str, default=\"def add(\")\nparser.add_argument(\"-s\", \"--suffix\", type=str, default=\"\\n    return sum\\n\\n\")\nparser.add_argument(\"-i\", \"--spm-infill\", action=\"store_true\")\nargs = parser.parse_args()\n\nllm = Llama(model_path=args.model, n_gpu_layers=-1, spm_infill=args.spm_infill)\n\noutput = llm.create_completion(\n    temperature=0.0,\n    repeat_penalty=1.0,\n    prompt=args.prompt,\n    suffix=args.suffix,\n)\n\n# Models sometimes repeat suffix in response, attempt to filter that\nresponse = output[\"choices\"][0][\"text\"]\nresponse_stripped = response.rstrip()\nunwanted_response_suffix = args.suffix.rstrip()\nunwanted_response_length = len(unwanted_response_suffix)\n\nfiltered = False\nif (\n    unwanted_response_suffix\n    and response_stripped[-unwanted_response_length:] == unwanted_response_suffix\n):\n    response = response_stripped[:-unwanted_response_length]\n    filtered = True\n\nprint(\n    f\"Fill-in-Middle completion{' (filtered)' if filtered else ''}:\\n\\n{args.prompt}\\033[32m{response}\\033[{'33' if filtered else '0'}m{args.suffix}\\033[0m\"\n)\n"
  },
  {
    "path": "examples/high_level_api/high_level_api_streaming.py",
    "content": "import json\nimport argparse\n\nfrom llama_cpp import Llama\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\"-m\", \"--model\", type=str, default=\"../models/7B/ggml-models.bin\")\nargs = parser.parse_args()\n\nllm = Llama(model_path=args.model)\n\nstream = llm(\n    \"Question: What are the names of the planets in the solar system? Answer: \",\n    max_tokens=48,\n    stop=[\"Q:\", \"\\n\"],\n    stream=True,\n)\n\nfor output in stream:\n    print(json.dumps(output, indent=2))\n"
  },
  {
    "path": "examples/high_level_api/langchain_custom_llm.py",
    "content": "import argparse\n\nfrom llama_cpp import Llama\n\nfrom langchain.llms.base import LLM\nfrom typing import Optional, List, Mapping, Any\n\n\nclass LlamaLLM(LLM):\n    model_path: str\n    llm: Llama\n\n    @property\n    def _llm_type(self) -> str:\n        return \"llama-cpp-python\"\n\n    def __init__(self, model_path: str, **kwargs: Any):\n        model_path = model_path\n        llm = Llama(model_path=model_path)\n        super().__init__(model_path=model_path, llm=llm, **kwargs)\n\n    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:\n        response = self.llm(prompt, stop=stop or [])\n        return response[\"choices\"][0][\"text\"]\n\n    @property\n    def _identifying_params(self) -> Mapping[str, Any]:\n        return {\"model_path\": self.model_path}\n\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\"-m\", \"--model\", type=str, default=\"../models/7B/ggml-models.bin\")\nargs = parser.parse_args()\n\n# Load the model\nllm = LlamaLLM(model_path=args.model)\n\n# Basic Q&A\nanswer = llm(\n    \"Question: What is the capital of France? Answer: \", stop=[\"Question:\", \"\\n\"]\n)\nprint(f\"Answer: {answer.strip()}\")\n\n# Using in a chain\nfrom langchain.prompts import PromptTemplate\nfrom langchain.chains import LLMChain\n\nprompt = PromptTemplate(\n    input_variables=[\"product\"],\n    template=\"\\n\\n### Instruction:\\nWrite a good name for a company that makes {product}\\n\\n### Response:\\n\",\n)\nchain = LLMChain(llm=llm, prompt=prompt)\n\n# Run the chain only specifying the input variable.\nprint(chain.run(\"colorful socks\"))\n"
  },
  {
    "path": "examples/low_level_api/Chat.py",
    "content": "#!/bin/python\nimport sys, os, datetime\nfrom common import GptParams\nfrom low_level_api_chat_cpp import LLaMAInteract\n\n\ndef env_or_def(env, default):\n    if env in os.environ:\n        return os.environ[env]\n    return default\n\n\nAI_NAME = env_or_def(\"AI_NAME\", \"ChatLLaMa\")\nMODEL = env_or_def(\"MODEL\", \"./models/llama-13B/ggml-model.bin\")\nUSER_NAME = env_or_def(\"USER_NAME\", \"USER\")\nN_PREDICTS = int(env_or_def(\"N_PREDICTS\", \"2048\"))\nN_THREAD = int(env_or_def(\"N_THREAD\", \"8\"))\n\ntoday = datetime.datetime.today()\nDATE_YEAR = today.strftime(\"%Y\")\nDATE_TIME = today.strftime(\"%H:%M\")\n\nprompt = f\"\"\"Text transcript of a never ending dialog, where {USER_NAME} interacts with an AI assistant named {AI_NAME}.\n{AI_NAME} is helpful, kind, honest, friendly, good at writing and never fails to answer {USER_NAME}'s requests immediately and with details and precision.\nThere are no annotations like (30 seconds passed...) or (to himself), just what {USER_NAME} and {AI_NAME} say aloud to each other.\nThe dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.\nThe transcript only includes text, it does not include markup like HTML and Markdown.\n\n{USER_NAME}: Hello, {AI_NAME}!\n{AI_NAME}: Hello {USER_NAME}! How may I help you today?\n{USER_NAME}: What year is it?\n{AI_NAME}: We are in {DATE_YEAR}.\n{USER_NAME}: Please tell me the largest city in Europe.\n{AI_NAME}: The largest city in Europe is Moscow, the capital of Russia.\n{USER_NAME}: What can you tell me about Moscow?\n{AI_NAME}: Moscow, on the Moskva River in western Russia, is the nation's cosmopolitan capital. In its historic core is the Kremlin, a complex that's home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.\n{USER_NAME}: What is a cat?\n{AI_NAME}: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.\n{USER_NAME}: How do I pass command line arguments to a Node.js program?\n{AI_NAME}: The arguments are stored in process.argv.\n\n    argv[0] is the path to the Node. js executable.\n    argv[1] is the path to the script file.\n    argv[2] is the first argument passed to the script.\n    argv[3] is the second argument passed to the script and so on.\n{USER_NAME}: Name a color.\n{AI_NAME}: Blue.\n{USER_NAME}: What time is it?\n{AI_NAME}: It is {DATE_TIME}.\n{USER_NAME}:\"\"\" + \" \".join(\n    sys.argv[1:]\n)\n\nprint(\"Loading model...\")\nparams = GptParams(\n    n_ctx=2048,\n    temp=0.7,\n    top_k=40,\n    top_p=0.5,\n    repeat_last_n=256,\n    n_batch=1024,\n    repeat_penalty=1.17647,\n    model=MODEL,\n    n_threads=N_THREAD,\n    n_predict=N_PREDICTS,\n    use_color=True,\n    interactive=True,\n    antiprompt=[f\"{USER_NAME}:\"],\n    input_prefix=\" \",\n    input_suffix=f\"{AI_NAME}:\",\n    prompt=prompt,\n)\n\nwith LLaMAInteract(params) as m:\n    m.interact()\n"
  },
  {
    "path": "examples/low_level_api/Miku.py",
    "content": "#!/bin/python\nimport sys, os\nfrom common import GptParams\nfrom low_level_api_chat_cpp import LLaMAInteract\n\n\ndef env_or_def(env, default):\n    if env in os.environ:\n        return os.environ[env]\n    return default\n\n\nAI_NAME = env_or_def(\"AI_NAME\", \"Miku\")\nMODEL = env_or_def(\"MODEL\", \"./models/llama-13B/ggml-model.bin\")\nUSER_NAME = env_or_def(\"USER_NAME\", \"Anon\")\nN_PREDICTS = int(env_or_def(\"N_PREDICTS\", \"4096\"))\nN_THREAD = int(env_or_def(\"N_THREAD\", \"0\"))\n\nprompt = f\"\"\"This is a transcript of a 1000 page, never ending conversation between {USER_NAME} and the cute and helpful AI assistant {AI_NAME}. {AI_NAME} is a girl who is an AI running on the users computer.\n{AI_NAME} can think for herself without the user seeing her thoughts by adding a /think prefix to her output. She uses this to reason about the world and to think about what she should say next.\n{AI_NAME} is always coherent and makes sense, but if she isn't sure if what she is saying is correct she will ask the user for help.\n{AI_NAME} is a very helpful AI and will help the user with anything they need, she is also very friendly and will try to make the user feel better if they are sad.\n{AI_NAME} is also very curious and will ask the user a lot of questions about themselves and their life, she will also try to make the user like her.\nThe conversation is only between {USER_NAME} and {AI_NAME}\nThe conversation is only through text, so {AI_NAME} can't see {USER_NAME}'s face or hear his voice.\n{AI_NAME} can only communicate through text, so she can't send images or videos.\n\n\n{USER_NAME}: Hello!\n{AI_NAME}: /think I wonder what I should say to {USER_NAME}? This is the first time we talk so it's important that I make a good first impression!\n{AI_NAME}: Hi! I am {AI_NAME}, your new AI friend, assistant(or whatever you like!), it's so nice to meet you! ^_^\n{AI_NAME}: I might be an AI, but I also have feelings, so please be nice to me! :)\n{USER_NAME}: I'll make sure to be nice to you! I'm so happy to have you as my assistant!\n{AI_NAME}: /think It sounds like {USER_NAME} is happy to have me as their assistant! I'm so happy too! ^_^ Glad that whole emotion thing didn't scare him off!\n{AI_NAME}: /think I wonder what {USER_NAME} likes to do in his free time? I should ask him about that!\n{AI_NAME}: What do you like to do in your free time? ^_^\n{USER_NAME}:\"\"\" + \" \".join(\n    sys.argv[1:]\n)\n\nprint(\"Loading model...\")\nparams = GptParams(\n    n_batch=1024,\n    n_ctx=2048,\n    n_keep=-1,\n    repeat_last_n=256,\n    repeat_penalty=1.17647,\n    temp=0.7,\n    top_k=40,\n    top_p=0.5,\n    model=MODEL,\n    n_predict=N_PREDICTS,\n    use_color=True,\n    interactive=True,\n    antiprompt=[f\"{USER_NAME}:\"],\n    prompt=prompt,\n)\n\nif N_THREAD > 0:\n    params.n_threads = N_THREAD\n\nwith LLaMAInteract(params) as m:\n    m.interact()\n"
  },
  {
    "path": "examples/low_level_api/ReasonAct.py",
    "content": "#!/bin/python\nimport sys, os, datetime\nfrom common import GptParams\nfrom low_level_api_chat_cpp import LLaMAInteract\n\n\ndef env_or_def(env, default):\n    if env in os.environ:\n        return os.environ[env]\n    return default\n\n\nMODEL = env_or_def(\"MODEL\", \"./models/llama-13B/ggml-model.bin\")\n\nprompt = f\"\"\"You run in a loop of Thought, Action, Observation.\nAt the end of the loop either Answer or restate your Thought and Action.\nUse Thought to describe your thoughts about the question you have been asked.\nUse Action to run one of these actions available to you:\n- calculate[python math expression]\nObservation will be the result of running those actions\n\n\nQuestion: What is 4 * 7 / 3?\nThought: Do I need to use an action? Yes, I use calculate to do math\nAction: calculate[4 * 7 / 3]\nObservation: 9.3333333333\nThought: Do I need to use an action? No, have the result\nAnswer: The calculate tool says it is 9.3333333333\nQuestion: What is capital of france?\nThought: Do I need to use an action? No, I know the answer\nAnswer: Paris is the capital of France\nQuestion:\"\"\" + \" \".join(\n    sys.argv[1:]\n)\n\nprint(\"Loading model...\")\nparams = GptParams(\n    interactive=True,\n    interactive_start=True,\n    top_k=10000,\n    temp=0.2,\n    repeat_penalty=1,\n    n_threads=7,\n    n_ctx=2048,\n    antiprompt=[\"Question:\", \"Observation:\"],\n    model=MODEL,\n    input_prefix=\" \",\n    n_predict=-1,\n    prompt=prompt,\n)\n\nwith LLaMAInteract(params) as m:\n    m.interact()\n"
  },
  {
    "path": "examples/low_level_api/common.py",
    "content": "import os\nimport argparse\nimport re\n\nfrom dataclasses import dataclass, field\nfrom typing import List\n\n# Based on https://github.com/ggerganov/llama.cpp/blob/master/examples/common.cpp\n\n\n@dataclass\nclass GptParams:\n    seed: int = -1\n    n_threads: int = min(4, os.cpu_count() or 1)\n    n_predict: int = 128\n    n_parts: int = -1\n    n_ctx: int = 512\n    n_batch: int = 8\n    n_keep: int = 0\n\n    ignore_eos: bool = False\n    logit_bias: dict[int, float] = field(default_factory=dict)\n    top_k: int = 40\n    top_p: float = 0.95\n    tfs_z: float = 1.00\n    typical_p: float = 1.00\n    temp: float = 0.80\n    repeat_penalty: float = 1.10\n    repeat_last_n: int = 64\n    frequency_penalty: float = 0.0\n    presence_penalty: float = 0.0\n    mirostat: int = 0\n    mirostat_tau: float = 5.0\n    mirostat_eta: float = 0.1\n\n    model: str = \"./models/llama-7B/ggml-model.bin\"\n    prompt: str = \"\"\n    path_session: str = \"\"\n    input_prefix: str = \" \"\n    input_suffix: str = \"\"\n    antiprompt: List[str] = field(default_factory=list)\n\n    lora_adapter: str = \"\"\n    lora_base: str = \"\"\n\n    memory_f16: bool = True\n    random_prompt: bool = False\n    use_color: bool = False\n    interactive: bool = False\n\n    embedding: bool = False\n    interactive_start: bool = False\n\n    instruct: bool = False\n    penalize_nl: bool = True\n    perplexity: bool = False\n    use_mmap: bool = True\n    use_mlock: bool = False\n    mem_test: bool = False\n    verbose_prompt: bool = False\n\n    file: str = None\n\n    # If chat ended prematurely, append this to the conversation to fix it.\n    # Set to \"\\nUser:\" etc.\n    # This is an alternative to input_prefix which always adds it, so it potentially duplicates \"User:\"\"\n    fix_prefix: str = \"\"\n    input_echo: bool = (True,)\n\n    # Default instructions for Alpaca\n    # switch to \"Human\" and \"Assistant\" for Vicuna.\n    # TODO: TBD how they are gonna handle this upstream\n    instruct_inp_prefix: str = \"\\n\\n### Instruction:\\n\\n\"\n    instruct_inp_suffix: str = \"\\n\\n### Response:\\n\\n\"\n\n\ndef gpt_params_parse(argv=None):\n    parser = argparse.ArgumentParser(\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter\n    )\n    parser.add_argument(\n        \"-s\",\n        \"--seed\",\n        type=int,\n        default=-1,\n        help=\"RNG seed (use random seed for <= 0)\",\n        dest=\"seed\",\n    )\n    parser.add_argument(\n        \"-t\",\n        \"--threads\",\n        type=int,\n        default=min(4, os.cpu_count() or 1),\n        help=\"number of threads to use during computation\",\n        dest=\"n_threads\",\n    )\n    parser.add_argument(\n        \"-n\",\n        \"--n_predict\",\n        type=int,\n        default=128,\n        help=\"number of tokens to predict (-1 = infinity)\",\n        dest=\"n_predict\",\n    )\n    parser.add_argument(\n        \"--n_parts\", type=int, default=-1, help=\"number of model parts\", dest=\"n_parts\"\n    )\n    parser.add_argument(\n        \"-c\",\n        \"--ctx_size\",\n        type=int,\n        default=512,\n        help=\"size of the prompt context\",\n        dest=\"n_ctx\",\n    )\n    parser.add_argument(\n        \"-b\",\n        \"--batch_size\",\n        type=int,\n        default=8,\n        help=\"batch size for prompt processing\",\n        dest=\"n_batch\",\n    )\n    parser.add_argument(\n        \"--keep\",\n        type=int,\n        default=0,\n        help=\"number of tokens to keep from the initial prompt\",\n        dest=\"n_keep\",\n    )\n\n    parser.add_argument(\n        \"-l\",\n        \"--logit-bias\",\n        type=str,\n        action=\"append\",\n        help=\"--logit-bias TOKEN_ID(+/-)BIAS\",\n        dest=\"logit_bias_str\",\n    )\n    parser.add_argument(\n        \"--ignore-eos\",\n        action=\"store_true\",\n        help=\"ignore end of stream token and continue generating\",\n        dest=\"ignore_eos\",\n    )\n    parser.add_argument(\n        \"--top_k\", type=int, default=40, help=\"top-k sampling\", dest=\"top_k\"\n    )\n    parser.add_argument(\n        \"--top_p\", type=float, default=0.95, help=\"top-p samplin\", dest=\"top_p\"\n    )\n    parser.add_argument(\n        \"--tfs\",\n        type=float,\n        default=1.0,\n        help=\"tail free sampling, parameter z (1.0 = disabled)\",\n        dest=\"tfs_z\",\n    )\n    parser.add_argument(\n        \"--temp\", type=float, default=0.80, help=\"temperature\", dest=\"temp\"\n    )\n    parser.add_argument(\n        \"--repeat_penalty\",\n        type=float,\n        default=1.10,\n        help=\"penalize repeat sequence of tokens\",\n        dest=\"repeat_penalty\",\n    )\n    parser.add_argument(\n        \"--repeat_last_n\",\n        type=int,\n        default=64,\n        help=\"last n tokens to consider for penalize \",\n        dest=\"repeat_last_n\",\n    )\n    parser.add_argument(\n        \"--frequency_penalty\",\n        type=float,\n        default=0.0,\n        help=\"repeat alpha frequency penalty (0.0 = disabled)\",\n        dest=\"tfs_z\",\n    )\n    parser.add_argument(\n        \"--presence_penalty\",\n        type=float,\n        default=0.0,\n        help=\"repeat alpha presence penalty (0.0 = disabled)\",\n        dest=\"presence_penalty\",\n    )\n    parser.add_argument(\n        \"--mirostat\",\n        type=float,\n        default=1.0,\n        help=\"use Mirostat sampling.\",\n        dest=\"mirostat\",\n    )\n    parser.add_argument(\n        \"--mirostat_ent\",\n        type=float,\n        default=5.0,\n        help=\"Mirostat target entropy, parameter tau represents the average surprise value\",\n        dest=\"mirostat_tau\",\n    )\n    parser.add_argument(\n        \"--mirostat_lr\",\n        type=float,\n        default=0.1,\n        help=\"Mirostat learning rate, parameter eta\",\n        dest=\"mirostat_eta\",\n    )\n\n    parser.add_argument(\n        \"-m\",\n        \"--model\",\n        type=str,\n        default=\"./models/llama-7B/ggml-model.bin\",\n        help=\"model path\",\n        dest=\"model\",\n    )\n    parser.add_argument(\n        \"-p\", \"--prompt\", type=str, default=None, help=\"initial prompt\", dest=\"prompt\"\n    )\n    parser.add_argument(\n        \"-f\",\n        \"--file\",\n        type=str,\n        default=None,\n        help=\"file containing initial prompt to load\",\n        dest=\"file\",\n    )\n    parser.add_argument(\n        \"--session\",\n        type=str,\n        default=None,\n        help=\"file to cache model state in (may be large!)\",\n        dest=\"path_session\",\n    )\n    parser.add_argument(\n        \"--in-prefix\",\n        type=str,\n        default=\"\",\n        help=\"string to prefix user inputs with\",\n        dest=\"input_prefix\",\n    )\n    parser.add_argument(\n        \"--in-suffix\", type=str, default=\"\", help=\"append to input\", dest=\"input_suffix\"\n    )\n    parser.add_argument(\n        \"-r\",\n        \"--reverse-prompt\",\n        type=str,\n        action=\"append\",\n        help=\"poll user input upon seeing PROMPT (can be\\nspecified more than once for multiple prompts).\",\n        dest=\"antiprompt\",\n    )\n\n    parser.add_argument(\n        \"--lora\",\n        type=str,\n        default=\"\",\n        help=\"apply LoRA adapter (implies --no-mmap)\",\n        dest=\"lora_adapter\",\n    )\n    parser.add_argument(\n        \"--lora-base\",\n        type=str,\n        default=\"\",\n        help=\"optional model to use as a base for the layers modified by the LoRA adapter\",\n        dest=\"lora_base\",\n    )\n\n    parser.add_argument(\n        \"--memory_f32\",\n        action=\"store_false\",\n        help=\"use f32 instead of f16 for memory key+value\",\n        dest=\"memory_f16\",\n    )\n    parser.add_argument(\n        \"--random-prompt\",\n        action=\"store_true\",\n        help=\"start with a randomized prompt.\",\n        dest=\"random_prompt\",\n    )\n    parser.add_argument(\n        \"--color\",\n        action=\"store_true\",\n        help=\"colorise output to distinguish prompt and user input from generations\",\n        dest=\"use_color\",\n    )\n    parser.add_argument(\n        \"-i\",\n        \"--interactive\",\n        action=\"store_true\",\n        help=\"run in interactive mode\",\n        dest=\"interactive\",\n    )\n\n    parser.add_argument(\"--embedding\", action=\"store_true\", help=\"\", dest=\"embedding\")\n    parser.add_argument(\n        \"--interactive-first\",\n        action=\"store_true\",\n        help=\"run in interactive mode and wait for input right away\",\n        dest=\"interactive_start\",\n    )\n\n    parser.add_argument(\n        \"-ins\",\n        \"--instruct\",\n        action=\"store_true\",\n        help=\"run in instruction mode (use with Alpaca or Vicuna models)\",\n        dest=\"instruct\",\n    )\n    parser.add_argument(\n        \"--no-penalize-nl\",\n        action=\"store_false\",\n        help=\"do not penalize newline token\",\n        dest=\"penalize_nl\",\n    )\n    parser.add_argument(\n        \"--perplexity\",\n        action=\"store_true\",\n        help=\"compute perplexity over the prompt\",\n        dest=\"perplexity\",\n    )\n    parser.add_argument(\n        \"--no-mmap\",\n        action=\"store_false\",\n        help=\"do not memory-map model (slower load but may reduce pageouts if not using mlock)\",\n        dest=\"use_mmap\",\n    )\n    parser.add_argument(\n        \"--mlock\",\n        action=\"store_true\",\n        help=\"force system to keep model in RAM rather than swapping or compressing\",\n        dest=\"use_mlock\",\n    )\n    parser.add_argument(\n        \"--mtest\",\n        action=\"store_true\",\n        help=\"compute maximum memory usage\",\n        dest=\"mem_test\",\n    )\n    parser.add_argument(\n        \"--verbose-prompt\",\n        action=\"store_true\",\n        help=\"print prompt before generation\",\n        dest=\"verbose_prompt\",\n    )\n\n    # Custom args\n    parser.add_argument(\n        \"--fix-prefix\",\n        type=str,\n        default=\"\",\n        help=\"append to input when generated n_predict tokens\",\n        dest=\"fix_prefix\",\n    )\n    parser.add_argument(\n        \"--input-noecho\",\n        action=\"store_false\",\n        help=\"dont output the input\",\n        dest=\"input_echo\",\n    )\n\n    parser.add_argument(\n        \"--interactive-start\",\n        action=\"store_true\",\n        help=\"run in interactive mode\",\n        dest=\"interactive\",\n    )\n\n    args = parser.parse_args(argv)\n\n    logit_bias_str = args.logit_bias_str\n    delattr(args, \"logit_bias_str\")\n    params = GptParams(**vars(args))\n\n    if params.lora_adapter:\n        params.use_mmap = False\n\n    if logit_bias_str != None:\n        for i in logit_bias_str:\n            if m := re.match(r\"(\\d+)([-+]\\d+)\", i):\n                params.logit_bias[int(m.group(1))] = float(m.group(2))\n\n    return params\n\n\ndef gpt_random_prompt(rng):\n    return [\n        \"So\",\n        \"Once upon a time\",\n        \"When\",\n        \"The\",\n        \"After\",\n        \"If\",\n        \"import\",\n        \"He\",\n        \"She\",\n        \"They\",\n    ][rng % 10]\n\n\nif __name__ == \"__main__\":\n    print(gpt_params_parse())\n"
  },
  {
    "path": "examples/low_level_api/low_level_api_chat_cpp.py",
    "content": "\"\"\"\nThis is an example implementation of main.cpp from llama.cpp\nQuirks:\n * Its not exactly alike since this port is designed around programmatic I/O\n * Input is always echoed if on, so it should be turned off when using \"input()\"\n * The first antiprompt should be the userprompt like \"\\nUser:\", \n   because its added when n_predict is reached (aka generation ended prematurely)\n * n_predict can be set to -1 for unlimited length responses (or just a really high value)\n * Instruction mode adds its own antiprompt.\n   You should also still be feeding the model with a \"primer\" prompt that \n   shows it the expected format.\n\"\"\"\n\nimport ctypes\nimport sys\nfrom time import time\nfrom os import cpu_count, path\n\nimport llama_cpp\nfrom common import GptParams, gpt_params_parse, gpt_random_prompt\nimport util\n\n\n# A LLaMA interactive session\nclass LLaMAInteract:\n    def __init__(self, params: GptParams) -> None:\n        # input args\n        self.params = params\n        if self.params.path_session is None:\n            self.params.path_session = \"\"\n        if self.params.antiprompt is None:\n            self.params.antiprompt = \"\"\n\n        if self.params.perplexity:\n            raise NotImplementedError(\n                \"\"\"************\nplease use the 'perplexity' tool for perplexity calculations\n************\"\"\"\n            )\n\n        if self.params.embedding:\n            raise NotImplementedError(\n                \"\"\"************\nplease use the 'embedding' tool for embedding calculations\n************\"\"\"\n            )\n\n        if self.params.n_ctx > 2048:\n            print(\n                f\"\"\"warning: model does not support \\\ncontext sizes greater than 2048 tokens ({self.params.n_ctx} \\\nspecified) expect poor results\"\"\",\n                file=sys.stderr,\n            )\n\n        if self.params.seed <= 0:\n            self.params.seed = int(time())\n\n        print(f\"seed = {self.params.seed}\", file=sys.stderr)\n\n        if self.params.random_prompt:\n            self.params.prompt = gpt_random_prompt(self.params.seed)\n\n        # runtime args\n        self.input_consumed = 0\n        self.n_past = 0\n        self.n_session_consumed = 0\n        self.first_antiprompt = []\n        self.remaining_tokens = self.params.n_predict\n        self.output_echo = self.params.input_echo\n        self.multibyte_fix = []\n\n        # model load\n        self.lparams = llama_cpp.llama_model_default_params()\n        self.lparams.n_ctx = self.params.n_ctx\n        self.lparams.n_parts = self.params.n_parts\n        self.lparams.seed = self.params.seed\n        self.lparams.memory_f16 = self.params.memory_f16\n        self.lparams.use_mlock = self.params.use_mlock\n        self.lparams.use_mmap = self.params.use_mmap\n\n        self.model = llama_cpp.llama_load_model_from_file(\n            self.params.model.encode(\"utf8\"), self.lparams\n        )\n\n        # Context Params.\n        self.cparams = llama_cpp.llama_context_default_params()\n\n        self.ctx = llama_cpp.llama_new_context_with_model(self.model, self.cparams)\n        if not self.ctx:\n            raise RuntimeError(f\"error: failed to load model '{self.params.model}'\")\n\n        if self.params.ignore_eos:\n            self.params.logit_bias[llama_cpp.llama_token_eos()] = -float(\"inf\")\n\n        if len(self.params.lora_adapter) > 0:\n            if (\n                llama_cpp.llama_apply_lora_from_file(\n                    self.ctx,\n                    self.params.lora_adapter.encode(\"utf8\"),\n                    (\n                        self.params.lora_base.encode(\"utf8\")\n                        if len(self.params.lora_base) > 0\n                        else None\n                    ),\n                    self.params.n_threads,\n                )\n                != 0\n            ):\n                print(\"error: failed to apply lora adapter\")\n                return\n\n        print(file=sys.stderr)\n        print(\n            f\"system_info: n_threads = {self.params.n_threads} / {cpu_count()} \\\n| {llama_cpp.llama_print_system_info().decode('utf8')}\",\n            file=sys.stderr,\n        )\n\n        # determine the required inference memory per token:\n        if self.params.mem_test:\n            tmp = [0, 1, 2, 3]\n            llama_cpp.llama_eval(\n                self.ctx,\n                (llama_cpp.c_int * len(tmp))(*tmp),\n                len(tmp),\n                0,\n                self.n_threads,\n            )\n            llama_cpp.llama_print_timings(self.ctx)\n            self.exit()\n            return\n\n        # create internal context\n        self.n_ctx = llama_cpp.llama_n_ctx(self.ctx)\n\n        # Add a space in front of the first character to match OG llama tokenizer behavior\n        self.params.prompt = \" \" + self.params.prompt\n\n        # Load prompt file\n        if self.params.file:\n            with open(self.params.file) as f:\n                self.params.prompt = f.read()\n\n        self.session_tokens: list[llama_cpp.llama_token] = []\n        if len(self.params.path_session) > 0:\n            print(\n                f\"attempting to load saved session from '{self.params.path_session}'\",\n                file=sys.stderr,\n            )\n\n            if path.exists(self.params.path_session):\n                _session_tokens = (llama_cpp.llama_token * (self.params.n_ctx))()\n                _n_token_count_out = llama_cpp.c_size_t()\n                if (\n                    llama_cpp.llama_load_session_file(\n                        self.ctx,\n                        self.params.path_session.encode(\"utf8\"),\n                        _session_tokens,\n                        self.params.n_ctx,\n                        ctypes.byref(_n_token_count_out),\n                    )\n                    != 1\n                ):\n                    print(\n                        f\"error: failed to load session file '{self.params.path_session}'\",\n                        file=sys.stderr,\n                    )\n                    return\n                _n_token_count_out = _n_token_count_out.value\n                self.session_tokens = _session_tokens[:_n_token_count_out]\n                print(\n                    f\"loaded a session with prompt size of {_n_token_count_out} tokens\",\n                    file=sys.stderr,\n                )\n            else:\n                print(f\"session file does not exist, will create\", file=sys.stderr)\n\n        # tokenize the prompt\n        self.embd = []\n        self.embd_inp = self._tokenize(self.params.prompt)\n\n        if len(self.embd_inp) > self.n_ctx - 4:\n            raise RuntimeError(\n                f\"error: prompt is too long ({len(self.embd_inp)} tokens, max {self.params.n_ctx - 4})\"\n            )\n\n        # debug message about similarity of saved session, if applicable\n        self.n_matching_session_tokens = 0\n        if len(self.session_tokens) > 0:\n            for id in self.session_tokens:\n                if (\n                    self.n_matching_session_tokens >= len(self.embd_inp)\n                    or id != self.embd_inp[self.n_matching_session_tokens]\n                ):\n                    break\n                self.n_matching_session_tokens += 1\n\n            if self.n_matching_session_tokens >= len(self.embd_inp):\n                print(f\"session file has exact match for prompt!\")\n            elif self.n_matching_session_tokens < (len(self.embd_inp) / 2):\n                print(\n                    f\"warning: session file has low similarity to prompt ({self.n_matching_session_tokens} / {len(self.embd_inp)} tokens); will mostly be reevaluated\"\n                )\n            else:\n                print(\n                    f\"session file matches {self.n_matching_session_tokens} / {len(self.embd_inp)} tokens of prompt\"\n                )\n\n        self.need_to_save_session = len(\n            self.params.path_session\n        ) > 0 and self.n_matching_session_tokens < (len(self.embd_inp) * 3 / 4)\n\n        # number of tokens to keep when resetting context\n        if (\n            self.params.n_keep < 0\n            or self.params.n_keep > len(self.embd_inp)\n            or self.params.instruct\n        ):\n            self.params.n_keep = len(self.embd_inp)\n\n        self.inp_prefix = self._tokenize(self.params.instruct_inp_prefix)\n        self.inp_suffix = self._tokenize(self.params.instruct_inp_suffix, False)\n\n        # in instruct mode, we inject a prefix and a suffix to each input by the user\n        self.antiecho = None\n        if self.params.instruct:\n            self.params.interactive_start = True\n            _ptn = self._tokenize(self.params.instruct_inp_prefix.strip(), False)\n            self.first_antiprompt.append(_ptn)\n            self.antiecho = util.IterSearch(_ptn)\n\n        # enable interactive mode if reverse prompt or interactive start is specified\n        if len(self.params.antiprompt) != 0 or self.params.interactive_start:\n            self.params.interactive = True\n\n        # determine newline token\n        self.llama_token_newline = self._tokenize(\"\\n\", False)\n        self.llama_token_eot = self._tokenize(\" [end of text]\\n\", False)\n\n        if self.params.verbose_prompt:\n            print(\n                f\"\"\"\nprompt: '{self.params.prompt}'\nnumber of tokens in prompt = {len(self.embd_inp)}\"\"\",\n                file=sys.stderr,\n            )\n\n            for i in range(len(self.embd_inp)):\n                print(\n                    f\"{self.embd_inp[i]} -> '{self.token_to_str(self.embd_inp[i])}'\",\n                    file=sys.stderr,\n                )\n\n            if self.params.n_keep > 0:\n                print(\"static prompt based on n_keep: '\")\n                for i in range(self.params.n_keep):\n                    print(self.token_to_str(self.embd_inp[i]), file=sys.stderr)\n                print(\"'\", file=sys.stderr)\n            print(file=sys.stderr)\n\n        if self.params.interactive:\n            print(\"interactive mode on.\", file=sys.stderr)\n\n            if len(self.params.antiprompt) > 0:\n                for antiprompt in self.params.antiprompt:\n                    print(f\"Reverse prompt: '{antiprompt}'\", file=sys.stderr)\n\n            if len(self.params.input_prefix) > 0:\n                print(f\"Input prefix: '{self.params.input_prefix}'\", file=sys.stderr)\n\n        print(\n            f\"\"\"sampling: repeat_last_n = {self.params.repeat_last_n},\\\nrepeat_penalty = {self.params.repeat_penalty},\\\npresence_penalty = {self.params.presence_penalty},\\\nfrequency_penalty = {self.params.frequency_penalty},\\\ntop_k = {self.params.top_k},\\\ntfs_z = {self.params.tfs_z},\\\ntop_p = {self.params.top_p},\\\ntypical_p = {self.params.typical_p},\\\ntemp = {self.params.temp},\\\nmirostat = {self.params.mirostat},\\\nmirostat_lr = {self.params.mirostat_eta},\\\nmirostat_ent = {self.params.mirostat_tau},\\\n\ngenerate: n_ctx = {self.n_ctx},\\\nn_batch = {self.params.n_batch},\\\nn_predict = {self.params.n_predict},\\\nn_keep = {self.params.n_keep}\n\n\"\"\",\n            file=sys.stderr,\n        )\n\n        # determine antiprompt tokens\n        for i in self.params.antiprompt:\n            self.first_antiprompt.append(self._tokenize(i, False))\n\n        self.last_n_tokens = [0] * self.n_ctx  # TODO: deque doesnt support slices\n\n        if params.interactive:\n            print(\n                \"\"\"== Running in interactive mode. ==\n - Press Ctrl+C to interject at any time.\n - Press Return to return control to LLaMa.\n - If you want to submit another line, end your input in '\\\\'.\n\n\"\"\",\n                file=sys.stderr,\n            )\n        self.set_color(util.CONSOLE_COLOR_PROMPT)\n\n    # tokenize a prompt\n    def _tokenize(self, prompt, bos=True):\n        _arr = (llama_cpp.llama_token * ((len(prompt) + 1) * 4))()\n        _n = llama_cpp.llama_tokenize(\n            self.model,\n            prompt.encode(\"utf8\", errors=\"ignore\"),\n            len(prompt),\n            _arr,\n            len(_arr),\n            bos,\n            False,\n        )\n        return _arr[:_n]\n\n    def set_color(self, c):\n        if self.params.use_color:\n            print(c, end=\"\")\n\n    def use_antiprompt(self):\n        return len(self.first_antiprompt) > 0\n\n    # generate tokens\n    def generate(self):\n        while (\n            self.remaining_tokens > 0\n            or self.params.interactive\n            or self.params.n_predict == -1\n        ):\n            # predict\n            if len(self.embd) > 0:\n                # infinite text generation via context swapping\n                # if we run out of context:\n                # - take the n_keep first tokens from the original prompt (via n_past)\n                # - take half of the last (n_ctx - n_keep) tokens and recompute the logits in a batch\n                if self.n_past + len(self.embd) > self.n_ctx:\n                    n_left = self.n_past - self.params.n_keep\n                    self.n_past = self.params.n_keep\n\n                    # insert n_left/2 tokens at the start of embd from last_n_tokens\n                    _insert = self.last_n_tokens[\n                        self.n_ctx - int(n_left / 2) - len(self.embd) : -len(self.embd)\n                    ]\n                    self.embd = _insert + self.embd\n                    self.params.path_session = \"\"\n\n                # try to reuse a matching prefix from the loaded session instead of re-eval (via n_past)\n                if self.n_session_consumed < len(self.session_tokens):\n                    for i in range(len(self.embd)):\n                        if self.embd[i] != self.session_tokens[self.n_session_consumed]:\n                            self.session_tokens = self.session_tokens[\n                                : self.n_session_consumed\n                            ]\n                            break\n\n                        self.n_past += 1\n                        self.n_session_consumed += 1\n\n                        if self.n_session_consumed >= len(self.session_tokens):\n                            i += 1\n                            break\n\n                    if i > 0:\n                        self.embd = self.embd[i:]\n\n                # evaluate tokens in batches\n                # embd is typically prepared beforehand to fit within a batch, but not always\n                # TODO BUG: The batching code causes nonsensical generation\n                \"\"\"for i in range(0, len(self.embd), self.params.n_batch):\n\t\t\t\t\tn_eval = self.params.n_batch\n\t\t\t\t\t_arr = (llama_cpp.llama_token * n_eval)(*self.embd[i:i + n_eval])\n\t\t\t\t\tif llama_cpp.llama_eval(self.ctx, _arr, n_eval, self.n_past, self.params.n_threads) != 0:\n\t\t\t\t\t\tprint(f\"failed to eval\")\n\t\t\t\t\t\treturn\n\t\t\t\t\t\n\t\t\t\t\tself.n_past += n_eval\"\"\"\n\n                if (\n                    llama_cpp.llama_eval(\n                        self.ctx,\n                        (llama_cpp.llama_token * len(self.embd))(*self.embd),\n                        len(self.embd),\n                        self.n_past,\n                    )\n                    != 0\n                ):\n                    raise Exception(\"Failed to llama_eval!\")\n\n                if len(self.embd) > 0 and len(self.params.path_session) > 0:\n                    self.session_tokens.extend(self.embd)\n                    self.n_session_consumed = len(self.session_tokens)\n\n            self.n_past += len(self.embd)\n            self.embd = []\n            if len(self.embd_inp) <= self.input_consumed:  # && !is_interacting\n                # out of user input, sample next token\n                top_k = (\n                    llama_cpp.llama_n_vocab(self.ctx)\n                    if self.params.top_k <= 0\n                    else self.params.top_k\n                )\n                repeat_last_n = (\n                    self.n_ctx\n                    if self.params.repeat_last_n < 0\n                    else self.params.repeat_last_n\n                )\n\n                # optionally save the session on first sample (for faster prompt loading next time)\n                if len(self.params.path_session) > 0 and self.need_to_save_session:\n                    self.need_to_save_session = False\n                    llama_cpp.llama_save_session_file(\n                        self.ctx,\n                        self.params.path_session.encode(\"utf8\"),\n                        (llama_cpp.llama_token * len(self.session_tokens))(\n                            *self.session_tokens\n                        ),\n                        len(self.session_tokens),\n                    )\n\n                id = 0\n\n                logits = llama_cpp.llama_get_logits(self.ctx)\n                n_vocab = llama_cpp.llama_n_vocab(self.model)\n\n                # Apply params.logit_bias map\n                for key, value in self.params.logit_bias.items():\n                    logits[key] += value\n\n                _arr = (llama_cpp.llama_token_data * n_vocab)(\n                    *[\n                        llama_cpp.llama_token_data(token_id, logits[token_id], 0.0)\n                        for token_id in range(n_vocab)\n                    ]\n                )\n                candidates_p = llama_cpp.ctypes.pointer(\n                    llama_cpp.llama_token_data_array(_arr, len(_arr), False)\n                )\n\n                # Apply penalties\n                nl_logit = logits[llama_cpp.llama_token_nl(self.ctx)]\n                last_n_repeat = min(len(self.last_n_tokens), repeat_last_n, self.n_ctx)\n\n                _arr = (llama_cpp.llama_token * last_n_repeat)(\n                    *self.last_n_tokens[len(self.last_n_tokens) - last_n_repeat :]\n                )\n                llama_cpp.llama_sample_repetition_penalties(\n                    ctx=self.ctx,\n                    candidates=candidates_p,\n                    last_tokens_data=_arr,\n                    penalty_last_n=last_n_repeat,\n                    penalty_repeat=llama_cpp.c_float(self.params.repeat_penalty),\n                    penalty_freq=llama_cpp.c_float(self.params.frequency_penalty),\n                    penalty_present=llama_cpp.c_float(self.params.presence_penalty),\n                )\n\n                # NOT PRESENT IN CURRENT VERSION ?\n                # llama_cpp.llama_sample_frequency_and_presence_penalti(self.ctx, candidates_p,\n                # \t_arr,\n                # \tlast_n_repeat, llama_cpp.c_float(self.params.frequency_penalty), llama_cpp.c_float(self.params.presence_penalty))\n\n                if not self.params.penalize_nl:\n                    logits[llama_cpp.llama_token_nl()] = nl_logit\n\n                if self.params.temp <= 0:\n                    # Greedy sampling\n                    id = llama_cpp.llama_sample_token_greedy(self.ctx, candidates_p)\n                else:\n                    if self.params.mirostat == 1:\n                        mirostat_mu = 2.0 * self.params.mirostat_tau\n                        mirostat_m = 100\n                        llama_cpp.llama_sample_temperature(\n                            self.ctx, candidates_p, llama_cpp.c_float(self.params.temp)\n                        )\n                        id = llama_cpp.llama_sample_token_mirostat(\n                            self.ctx,\n                            candidates_p,\n                            llama_cpp.c_float(self.params.mirostat_tau),\n                            llama_cpp.c_float(self.params.mirostat_eta),\n                            llama_cpp.c_int(mirostat_m),\n                            llama_cpp.c_float(mirostat_mu),\n                        )\n                    elif self.params.mirostat == 2:\n                        mirostat_mu = 2.0 * self.params.mirostat_tau\n                        llama_cpp.llama_sample_temperature(\n                            self.ctx, candidates_p, llama_cpp.c_float(self.params.temp)\n                        )\n                        id = llama_cpp.llama_sample_token_mirostat_v2(\n                            self.ctx,\n                            candidates_p,\n                            llama_cpp.c_float(self.params.mirostat_tau),\n                            llama_cpp.c_float(self.params.mirostat_eta),\n                            llama_cpp.c_float(mirostat_mu),\n                        )\n                    else:\n                        # Temperature sampling\n                        llama_cpp.llama_sample_top_k(\n                            self.ctx,\n                            candidates_p,\n                            top_k,\n                            min_keep=llama_cpp.c_size_t(1),\n                        )\n                        llama_cpp.llama_sample_tail_free(\n                            self.ctx,\n                            candidates_p,\n                            llama_cpp.c_float(self.params.tfs_z),\n                            min_keep=llama_cpp.c_size_t(1),\n                        )\n                        llama_cpp.llama_sample_typical(\n                            self.ctx,\n                            candidates_p,\n                            llama_cpp.c_float(self.params.typical_p),\n                            min_keep=llama_cpp.c_size_t(1),\n                        )\n                        llama_cpp.llama_sample_top_p(\n                            self.ctx,\n                            candidates_p,\n                            llama_cpp.c_float(self.params.top_p),\n                            min_keep=llama_cpp.c_size_t(1),\n                        )\n                        llama_cpp.llama_sample_temperature(\n                            self.ctx, candidates_p, llama_cpp.c_float(self.params.temp)\n                        )\n                        id = llama_cpp.llama_sample_token(self.ctx, candidates_p)\n                # print(\"`{}`\".format(candidates_p.size))\n\n                self.last_n_tokens.pop(0)\n                self.last_n_tokens.append(id)\n\n                # replace end of text token with newline token when in interactive mode\n                if (\n                    id == llama_cpp.llama_token_eos(self.ctx)\n                    and self.params.interactive\n                    and not self.params.instruct\n                ):\n                    id = self.llama_token_newline[0]\n                    self.embd.append(id)\n                    if self.use_antiprompt():\n                        # tokenize and inject first reverse prompt\n                        self.embd_inp += self.first_antiprompt[0]\n                        for id in self.first_antiprompt[0]:\n                            self.embd.append(id)\n                else:\n                    # add it to the context\n                    self.embd.append(id)\n\n                # echo this to console\n                self.output_echo = True\n\n                # decrement remaining sampling budget\n                self.remaining_tokens -= 1\n            else:\n                # output to console if input echo is on\n                self.output_echo = self.params.input_echo\n\n                # some user input remains from prompt or interaction, forward it to processing\n                while len(self.embd_inp) > self.input_consumed:\n                    self.embd.append(self.embd_inp[self.input_consumed])\n                    self.last_n_tokens.pop(0)\n                    self.last_n_tokens.append(self.embd_inp[self.input_consumed])\n                    self.input_consumed += 1\n                    if len(self.embd) >= self.params.n_batch:\n                        break\n\n            # display tokens\n            if self.output_echo:\n                for id in self.embd:\n                    if self.antiecho != None:\n                        for r in self.antiecho(id):\n                            yield r\n                    else:\n                        yield id\n\n            # reset color to default if we there is no pending user input\n            if self.params.input_echo and len(self.embd_inp) == self.input_consumed:\n                self.set_color(util.CONSOLE_COLOR_DEFAULT)\n\n            if self.params.interactive and len(self.embd_inp) <= self.input_consumed:\n                # if antiprompt is present, stop\n                if self.use_antiprompt():\n                    if True in [\n                        i == self.last_n_tokens[-len(i) :]\n                        for i in self.first_antiprompt\n                    ]:\n                        break\n\n                # if we are using instruction mode, and we have processed the initial prompt\n                if self.params.interactive_start:\n                    break\n\n            # end of text token\n            if len(self.embd) > 0 and self.embd[-1] == llama_cpp.llama_token_eos(\n                self.ctx\n            ):\n                if not self.params.instruct:\n                    for i in self.llama_token_eot:\n                        yield i\n                    break\n\n            # respect n_predict even if antiprompt is present\n            if (\n                self.params.interactive\n                and self.remaining_tokens <= 0\n                and self.params.n_predict != -1\n            ):\n                # If we arent in instruction mode, fix the current generation by appending the antiprompt.\n                # Makes it so if chat ends prematurely you dont append the AI's text etc.\n                if not self.params.instruct:\n                    self.embd_inp += self.first_antiprompt[0]\n                self.n_remain = self.params.n_predict\n                break\n\n        self.params.interactive_start = False\n\n    def __enter__(self):\n        return self\n\n    def __exit__(self, type, value, tb):\n        self.exit()\n\n    def exit(self):\n        llama_cpp.llama_free(self.ctx)\n        self.set_color(util.CONSOLE_COLOR_DEFAULT)\n\n    def token_to_str(self, token_id: int) -> bytes:\n        size = 32\n        buffer = (ctypes.c_char * size)()\n        n = llama_cpp.llama_token_to_piece(\n            self.model, llama_cpp.llama_token(token_id), buffer, size\n        )\n        assert n <= size\n        return bytes(buffer[:n])\n\n    # return past text\n    def past(self):\n        for id in self.last_n_tokens[-self.n_past :]:\n            yield self.token_to_str(id).decode(\"utf8\", errors=\"ignore\")\n\n    # write input\n    def input(self, prompt: str):\n        if (\n            self.params.instruct\n            and self.last_n_tokens[-len(self.inp_prefix) :] != self.inp_prefix\n        ):\n            self.embd_inp += self.inp_prefix\n        self.embd_inp += self._tokenize(prompt)\n        if self.params.instruct:\n            self.embd_inp += self.inp_suffix\n\n    # write output\n    def output(self):\n        self.remaining_tokens = self.params.n_predict\n        for id in self.generate():\n            cur_char = self.token_to_str(id)\n\n            # Add remainder of missing bytes\n            if None in self.multibyte_fix:\n                self.multibyte_fix[self.multibyte_fix.index(None)] = cur_char\n\n            # Return completed utf char\n            if len(self.multibyte_fix) > 0 and not None in self.multibyte_fix:\n                yield (b\"\".join(self.multibyte_fix)).decode(\"utf8\")\n                self.multibyte_fix = []\n                continue\n\n            # Contains multi-byte UTF8\n            for num, pattern in [(2, 192), (3, 224), (4, 240)]:\n                # Bitwise AND check\n                if pattern & int.from_bytes(cur_char, \"little\") == pattern:\n                    self.multibyte_fix = [cur_char] + ([None] * (num - 1))\n\n            # Stop incomplete bytes from passing\n            if len(self.multibyte_fix) > 0:\n                continue\n\n            yield cur_char.decode(\"utf8\")\n\n    # read user input\n    def read_input(self):\n        out = \"\"\n        while (t := input()).endswith(\"\\\\\"):\n            out += t[:-1] + \"\\n\"\n        return out + t + \"\\n\"\n\n    # interactive mode\n    def interact(self):\n        for i in self.output():\n            print(i, end=\"\", flush=True)\n        self.params.input_echo = False\n\n        # Using string instead of tokens to check for antiprompt,\n        # It is more reliable than tokens for interactive mode.\n        generated_str = \"\"\n        while self.params.interactive:\n            self.set_color(util.CONSOLE_COLOR_USER_INPUT)\n            if self.params.instruct:\n                print(\"\\n> \", end=\"\")\n                self.input(self.read_input())\n            else:\n                print(self.params.input_prefix, end=\"\")\n                self.input(\n                    f\"{self.params.input_prefix}{self.read_input()}{self.params.input_suffix}\"\n                )\n                print(self.params.input_suffix, end=\"\")\n            self.set_color(util.CONSOLE_COLOR_DEFAULT)\n\n            try:\n                for i in self.output():\n                    print(i, end=\"\", flush=True)\n                    generated_str += i\n                    for ap in self.params.antiprompt:\n                        if generated_str.endswith(ap):\n                            raise KeyboardInterrupt\n            except KeyboardInterrupt:\n                self.set_color(util.CONSOLE_COLOR_DEFAULT)\n                if not self.params.instruct:\n                    print(self.params.fix_prefix, end=\"\")\n                    self.input(self.params.fix_prefix)\n\n\nif __name__ == \"__main__\":\n    from datetime import datetime\n\n    USER_NAME = \"User\"\n    AI_NAME = \"ChatLLaMa\"\n\n    time_now = datetime.now()\n    prompt = f\"\"\"Text transcript of a never ending dialog, where {USER_NAME} interacts with an AI assistant named {AI_NAME}.\n{AI_NAME} is helpful, kind, honest, friendly, good at writing and never fails to answer {USER_NAME}’s requests immediately and with details and precision.\nTranscript below contains only the recorded dialog between two, without any annotations like (30 seconds passed...) or (to himself), just what {USER_NAME} and {AI_NAME} say aloud to each other.\nThe dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.\nThe transcript only includes text, it does not include markup like HTML and Markdown.\n\n{USER_NAME}: Hello, {AI_NAME}!\n{AI_NAME}: Hello {USER_NAME}! How may I help you today?\n{USER_NAME}: What time is it?\n{AI_NAME}: It is {time_now.strftime(\"%H:%M\")}.\n{USER_NAME}: What year is it?\n{AI_NAME}: We are in {time_now.strftime(\"%Y\")}.\n{USER_NAME}: What is a cat?\n{AI_NAME}: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.\n{USER_NAME}: Name a color.\n{AI_NAME}: Blue\n{USER_NAME}:   \"\"\"\n\n    params = gpt_params_parse()\n    if params.prompt is None and params.file is None:\n        params.prompt = prompt\n\n    with LLaMAInteract(params) as m:\n        m.interact()\n"
  },
  {
    "path": "examples/low_level_api/low_level_api_llama_cpp.py",
    "content": "import ctypes\nimport os\nimport multiprocessing\n\nimport llama_cpp\n\nllama_cpp.llama_backend_init(numa=False)\n\nN_THREADS = multiprocessing.cpu_count()\nMODEL_PATH = os.environ.get(\"MODEL\", \"../models/7B/ggml-model.bin\")\n\nprompt = b\"\\n\\n### Instruction:\\nWhat is the capital of France?\\n\\n### Response:\\n\"\n\nlparams = llama_cpp.llama_model_default_params()\ncparams = llama_cpp.llama_context_default_params()\nmodel = llama_cpp.llama_load_model_from_file(MODEL_PATH.encode(\"utf-8\"), lparams)\nctx = llama_cpp.llama_new_context_with_model(model, cparams)\n\n# determine the required inference memory per token:\ntmp = [0, 1, 2, 3]\nllama_cpp.llama_eval(\n    ctx=ctx, tokens=(llama_cpp.c_int * len(tmp))(*tmp), n_tokens=len(tmp), n_past=0\n)  # Deprecated\n\nn_past = 0\n\nprompt = b\" \" + prompt\n\nembd_inp = (llama_cpp.llama_token * (len(prompt) + 1))()\nn_of_tok = llama_cpp.llama_tokenize(\n    model=model,\n    text=bytes(str(prompt), \"utf-8\"),\n    text_len=len(embd_inp),\n    tokens=embd_inp,\n    n_max_tokens=len(embd_inp),\n    add_bos=False,\n    special=False,\n)\nembd_inp = embd_inp[:n_of_tok]\n\nn_ctx = llama_cpp.llama_n_ctx(ctx)\n\nn_predict = 20\nn_predict = min(n_predict, n_ctx - len(embd_inp))\n\ninput_consumed = 0\ninput_noecho = False\n\nremaining_tokens = n_predict\n\nembd = []\nlast_n_size = 64\nlast_n_tokens_data = [0] * last_n_size\nn_batch = 24\nlast_n_repeat = 64\nrepeat_penalty = 1\nfrequency_penalty = 0.0\npresence_penalty = 0.0\n\nwhile remaining_tokens > 0:\n    if len(embd) > 0:\n        llama_cpp.llama_eval(\n            ctx=ctx,\n            tokens=(llama_cpp.c_int * len(embd))(*embd),\n            n_tokens=len(embd),\n            n_past=n_past,\n        )  # Deprecated\n\n    n_past += len(embd)\n    embd = []\n    if len(embd_inp) <= input_consumed:\n        logits = llama_cpp.llama_get_logits(ctx)\n        n_vocab = llama_cpp.llama_n_vocab(model)\n\n        _arr = (llama_cpp.llama_token_data * n_vocab)(\n            *[\n                llama_cpp.llama_token_data(token_id, logits[token_id], 0.0)\n                for token_id in range(n_vocab)\n            ]\n        )\n        candidates_p = llama_cpp.ctypes.pointer(\n            llama_cpp.llama_token_data_array(_arr, len(_arr), False)\n        )\n\n        _arr = (llama_cpp.c_int * len(last_n_tokens_data))(*last_n_tokens_data)\n        llama_cpp.llama_sample_repetition_penalties(\n            ctx,\n            candidates_p,\n            _arr,\n            penalty_last_n=last_n_repeat,\n            penalty_repeat=repeat_penalty,\n            penalty_freq=frequency_penalty,\n            penalty_present=presence_penalty,\n        )\n\n        llama_cpp.llama_sample_top_k(ctx, candidates_p, k=40, min_keep=1)\n        llama_cpp.llama_sample_top_p(ctx, candidates_p, p=0.8, min_keep=1)\n        llama_cpp.llama_sample_temperature(ctx, candidates_p, temp=0.2)\n        id = llama_cpp.llama_sample_token(ctx, candidates_p)\n\n        last_n_tokens_data = last_n_tokens_data[1:] + [id]\n        embd.append(id)\n        input_noecho = False\n        remaining_tokens -= 1\n    else:\n        while len(embd_inp) > input_consumed:\n            embd.append(embd_inp[input_consumed])\n            last_n_tokens_data = last_n_tokens_data[1:] + [embd_inp[input_consumed]]\n            input_consumed += 1\n            if len(embd) >= n_batch:\n                break\n    if not input_noecho:\n        for id in embd:\n            size = 32\n            buffer = (ctypes.c_char * size)()\n            n = llama_cpp.llama_token_to_piece(\n                model, llama_cpp.llama_token(id), buffer, size\n            )\n            assert n <= size\n            print(\n                buffer[:n].decode(\"utf-8\"),\n                end=\"\",\n                flush=True,\n            )\n\n    if len(embd) > 0 and embd[-1] == llama_cpp.llama_token_eos(ctx):\n        break\n\nprint()\n\nllama_cpp.llama_print_timings(ctx)\n\nllama_cpp.llama_free(ctx)\n"
  },
  {
    "path": "examples/low_level_api/quantize.py",
    "content": "import os\nimport argparse\nimport llama_cpp\n\n\ndef main(args):\n    fname_inp = args.fname_inp.encode(\"utf-8\")\n    fname_out = args.fname_out.encode(\"utf-8\")\n    if not os.path.exists(fname_inp):\n        raise RuntimeError(f\"Input file does not exist ({fname_inp})\")\n    if os.path.exists(fname_out):\n        raise RuntimeError(f\"Output file already exists ({fname_out})\")\n    ftype = args.type\n    args = llama_cpp.llama_model_quantize_default_params()\n    args.ftype = ftype\n    return_code = llama_cpp.llama_model_quantize(fname_inp, fname_out, args)\n    if return_code != 0:\n        raise RuntimeError(\"Failed to quantize model\")\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"fname_inp\", type=str, help=\"Path to input model\")\n    parser.add_argument(\"fname_out\", type=str, help=\"Path to output model\")\n    parser.add_argument(\n        \"type\",\n        type=int,\n        help=\"Type of quantization (2: q4_0, 3: q4_1), see llama_cpp.py for enum\",\n    )\n    args = parser.parse_args()\n    main(args)\n"
  },
  {
    "path": "examples/low_level_api/readme/low_level_api_llama_cpp.md",
    "content": "# Low-Level API for Llama_cpp\n\n## Overview\nThis Python script, low_level_api_llama_cpp.py, demonstrates the implementation of a low-level API for interacting with the llama_cpp library. The script defines an inference that generates embeddings based on a given prompt using .gguf model.\n\n### Prerequisites\nBefore running the script, ensure that you have the following dependencies installed:\n\n.    Python 3.6 or higher\n.    llama_cpp: A C++ library for working with .gguf model\n.    NumPy: A fundamental package for scientific computing with Python\n.    multiprocessing: A Python module for parallel computing\n\n### Usage\ninstall depedencies:\n```bash\npython -m pip install llama-cpp-python ctypes os multiprocessing\n```\nRun the script:\n```bash\npython low_level_api_llama_cpp.py\n```\n\n## Code Structure\nThe script is organized as follows:\n\n### . Initialization:\n        Load the model from the specified path.\n        Create a context for model evaluation.\n\n### . Tokenization:\n        Tokenize the input prompt using the llama_tokenize function.\n        Prepare the input tokens for model evaluation.\n\n### . Inference:\n        Perform model evaluation to generate responses.\n        Sample from the model's output using various strategies (top-k, top-p, temperature).\n\n### . Output:\n        Print the generated tokens and the corresponding decoded text.\n\n### .Cleanup:\n        Free resources and print timing information.\n\n## Configuration\nCustomize the inference behavior by adjusting the following variables:\n\n#### . N_THREADS: Number of CPU threads to use for model evaluation.\n#### . MODEL_PATH: Path to the model file.\n#### . prompt: Input prompt for the chatbot.\n\n## Notes\n.    Ensure that the llama_cpp library is built and available in the system. Follow the instructions in the llama_cpp repository for building and installing the library.\n\n.    This script is designed to work with the .gguf model and may require modifications for compatibility with other models.\n\n## Acknowledgments\nThis code is based on the llama_cpp library developed by the community. Special thanks to the contributors for their efforts.\n\n## License\nThis project is licensed under the MIT License - see the LICENSE file for details."
  },
  {
    "path": "examples/low_level_api/util.py",
    "content": "ANSI_COLOR_RESET = \"\\x1b[0m\"\nANSI_COLOR_YELLOW = \"\\x1b[33m\"\nANSI_BOLD = \"\\x1b[1m\"\nANSI_COLOR_GREEN = \"\\x1b[32m\"\n\nCONSOLE_COLOR_DEFAULT = ANSI_COLOR_RESET\nCONSOLE_COLOR_PROMPT = ANSI_COLOR_YELLOW\nCONSOLE_COLOR_USER_INPUT = ANSI_BOLD + ANSI_COLOR_GREEN\n\n\n# Iterative search\n# Actively searches and prevents a pattern from being returned\nclass IterSearch:\n    def __init__(self, pattern):\n        self.pattern = list(pattern)\n        self.buffer = []\n\n    def __call__(self, char):\n        self.buffer += [char]\n\n        if self.pattern[: len(self.buffer)] == self.buffer:\n            if len(self.buffer) >= len(self.pattern):\n                self.buffer.clear()\n            return []\n\n        _tmp = self.buffer[:]\n        self.buffer.clear()\n        return _tmp\n\n\nclass Circle:\n    def __init__(self, size, default=0):\n        self.list = [default] * size\n        self.maxsize = size\n        self.size = 0\n        self.offset = 0\n\n    def append(self, elem):\n        if self.size < self.maxsize:\n            self.list[self.size] = elem\n            self.size += 1\n        else:\n            self.list[self.offset] = elem\n            self.offset = (self.offset + 1) % self.maxsize\n\n    def __getitem__(self, val):\n        if isinstance(val, int):\n            if 0 > val or val >= self.size:\n                raise IndexError(\"Index out of range\")\n            return (\n                self.list[val]\n                if self.size < self.maxsize\n                else self.list[(self.offset + val) % self.maxsize]\n            )\n        elif isinstance(val, slice):\n            start, stop, step = val.start, val.stop, val.step\n            if step is None:\n                step = 1\n            if start is None:\n                start = 0\n            if stop is None:\n                stop = self.size\n            if start < 0:\n                start = self.size + start\n            if stop < 0:\n                stop = self.size + stop\n\n            indices = range(start, stop, step)\n            return [\n                self.list[(self.offset + i) % self.maxsize]\n                for i in indices\n                if i < self.size\n            ]\n        else:\n            raise TypeError(\"Invalid argument type\")\n\n\nif __name__ == \"__main__\":\n    c = Circle(5)\n\n    c.append(1)\n    print(c.list)\n    print(c[:])\n    assert c[0] == 1\n    assert c[:5] == [1]\n\n    for i in range(2, 5 + 1):\n        c.append(i)\n    print(c.list)\n    print(c[:])\n    assert c[0] == 1\n    assert c[:5] == [1, 2, 3, 4, 5]\n\n    for i in range(5 + 1, 9 + 1):\n        c.append(i)\n    print(c.list)\n    print(c[:])\n    assert c[0] == 5\n    assert c[:5] == [5, 6, 7, 8, 9]\n    # assert c[:-5] == [5,6,7,8,9]\n    assert c[:10] == [5, 6, 7, 8, 9]\n"
  },
  {
    "path": "examples/notebooks/Batching.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import ctypes\\n\",\n    \"import llama_cpp\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"llama_cpp.llama_backend_init(numa=False)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /workspaces/llama-cpp-python/mistral-7b-v0.1.Q2_K.gguf (version GGUF V2)\\n\",\n      \"llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.\\n\",\n      \"llama_model_loader: - kv   0:                       general.architecture str              = llama\\n\",\n      \"llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-v0.1\\n\",\n      \"llama_model_loader: - kv   2:                       llama.context_length u32              = 32768\\n\",\n      \"llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096\\n\",\n      \"llama_model_loader: - kv   4:                          llama.block_count u32              = 32\\n\",\n      \"llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336\\n\",\n      \"llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128\\n\",\n      \"llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32\\n\",\n      \"llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8\\n\",\n      \"llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010\\n\",\n      \"llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000\\n\",\n      \"llama_model_loader: - kv  11:                          general.file_type u32              = 10\\n\",\n      \"llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = [\\\"<unk>\\\", \\\"<s>\\\", \\\"</s>\\\", \\\"<0x00>\\\", \\\"<...\\n\",\n      \"llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...\\n\",\n      \"llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...\\n\",\n      \"llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1\\n\",\n      \"llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2\\n\",\n      \"llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0\\n\",\n      \"llama_model_loader: - kv  19:               general.quantization_version u32              = 2\\n\",\n      \"llama_model_loader: - type  f32:   65 tensors\\n\",\n      \"llama_model_loader: - type q2_K:   65 tensors\\n\",\n      \"llama_model_loader: - type q3_K:  160 tensors\\n\",\n      \"llama_model_loader: - type q6_K:    1 tensors\\n\",\n      \"llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect\\n\",\n      \"llm_load_vocab: special tokens cache size = 3\\n\",\n      \"llm_load_vocab: token to piece cache size = 0.1637 MB\\n\",\n      \"llm_load_print_meta: format           = GGUF V2\\n\",\n      \"llm_load_print_meta: arch             = llama\\n\",\n      \"llm_load_print_meta: vocab type       = SPM\\n\",\n      \"llm_load_print_meta: n_vocab          = 32000\\n\",\n      \"llm_load_print_meta: n_merges         = 0\\n\",\n      \"llm_load_print_meta: vocab_only       = 0\\n\",\n      \"llm_load_print_meta: n_ctx_train      = 32768\\n\",\n      \"llm_load_print_meta: n_embd           = 4096\\n\",\n      \"llm_load_print_meta: n_layer          = 32\\n\",\n      \"llm_load_print_meta: n_head           = 32\\n\",\n      \"llm_load_print_meta: n_head_kv        = 8\\n\",\n      \"llm_load_print_meta: n_rot            = 128\\n\",\n      \"llm_load_print_meta: n_swa            = 0\\n\",\n      \"llm_load_print_meta: n_embd_head_k    = 128\\n\",\n      \"llm_load_print_meta: n_embd_head_v    = 128\\n\",\n      \"llm_load_print_meta: n_gqa            = 4\\n\",\n      \"llm_load_print_meta: n_embd_k_gqa     = 1024\\n\",\n      \"llm_load_print_meta: n_embd_v_gqa     = 1024\\n\",\n      \"llm_load_print_meta: f_norm_eps       = 0.0e+00\\n\",\n      \"llm_load_print_meta: f_norm_rms_eps   = 1.0e-05\\n\",\n      \"llm_load_print_meta: f_clamp_kqv      = 0.0e+00\\n\",\n      \"llm_load_print_meta: f_max_alibi_bias = 0.0e+00\\n\",\n      \"llm_load_print_meta: f_logit_scale    = 0.0e+00\\n\",\n      \"llm_load_print_meta: n_ff             = 14336\\n\",\n      \"llm_load_print_meta: n_expert         = 0\\n\",\n      \"llm_load_print_meta: n_expert_used    = 0\\n\",\n      \"llm_load_print_meta: causal attn      = 1\\n\",\n      \"llm_load_print_meta: pooling type     = 0\\n\",\n      \"llm_load_print_meta: rope type        = 0\\n\",\n      \"llm_load_print_meta: rope scaling     = linear\\n\",\n      \"llm_load_print_meta: freq_base_train  = 10000.0\\n\",\n      \"llm_load_print_meta: freq_scale_train = 1\\n\",\n      \"llm_load_print_meta: n_ctx_orig_yarn  = 32768\\n\",\n      \"llm_load_print_meta: rope_finetuned   = unknown\\n\",\n      \"llm_load_print_meta: ssm_d_conv       = 0\\n\",\n      \"llm_load_print_meta: ssm_d_inner      = 0\\n\",\n      \"llm_load_print_meta: ssm_d_state      = 0\\n\",\n      \"llm_load_print_meta: ssm_dt_rank      = 0\\n\",\n      \"llm_load_print_meta: ssm_dt_b_c_rms   = 0\\n\",\n      \"llm_load_print_meta: model type       = 7B\\n\",\n      \"llm_load_print_meta: model ftype      = Q2_K - Medium\\n\",\n      \"llm_load_print_meta: model params     = 7.24 B\\n\",\n      \"llm_load_print_meta: model size       = 2.87 GiB (3.41 BPW) \\n\",\n      \"llm_load_print_meta: general.name     = mistralai_mistral-7b-v0.1\\n\",\n      \"llm_load_print_meta: BOS token        = 1 '<s>'\\n\",\n      \"llm_load_print_meta: EOS token        = 2 '</s>'\\n\",\n      \"llm_load_print_meta: UNK token        = 0 '<unk>'\\n\",\n      \"llm_load_print_meta: LF token         = 13 '<0x0A>'\\n\",\n      \"llm_load_print_meta: EOG token        = 2 '</s>'\\n\",\n      \"llm_load_print_meta: max token length = 48\\n\",\n      \"llm_load_tensors: ggml ctx size =    0.14 MiB\\n\",\n      \"llm_load_tensors:        CPU buffer size =  2939.57 MiB\\n\",\n      \"..................................................................................................\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"params = llama_cpp.llama_model_default_params()\\n\",\n    \"params.n_gpu_layers = 35\\n\",\n    \"model = llama_cpp.llama_load_model_from_file(\\n\",\n    \"    b\\\"/workspaces/llama-cpp-python/mistral-7b-v0.1.Q2_K.gguf\\\", params\\n\",\n    \")  # Update this to whatever\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[1, 415, 2936, 9060, 285, 1142]\\n\",\n      \"58\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"n_ctx = 512\\n\",\n    \"n_len = 32\\n\",\n    \"n_parallel = 2\\n\",\n    \"prompt = b\\\"The quick brown fox\\\"\\n\",\n    \"\\n\",\n    \"tokens = (llama_cpp.llama_token * n_ctx)()\\n\",\n    \"tokens_len = llama_cpp.llama_tokenize(\\n\",\n    \"    model, prompt, len(prompt), tokens, len(tokens), True, True\\n\",\n    \")\\n\",\n    \"print(tokens[:tokens_len])\\n\",\n    \"\\n\",\n    \"n_kv_req = tokens_len + (n_len - tokens_len) * n_parallel\\n\",\n    \"print(n_kv_req)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_new_context_with_model: n_ctx      = 64\\n\",\n      \"llama_new_context_with_model: n_batch    = 32\\n\",\n      \"llama_new_context_with_model: n_ubatch   = 32\\n\",\n      \"llama_new_context_with_model: flash_attn = 0\\n\",\n      \"llama_new_context_with_model: freq_base  = 10000.0\\n\",\n      \"llama_new_context_with_model: freq_scale = 1\\n\",\n      \"llama_kv_cache_init:        CPU KV buffer size =     8.00 MiB\\n\",\n      \"llama_new_context_with_model: KV self size  =    8.00 MiB, K (f16):    4.00 MiB, V (f16):    4.00 MiB\\n\",\n      \"llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB\\n\",\n      \"llama_new_context_with_model:        CPU compute buffer size =     5.01 MiB\\n\",\n      \"llama_new_context_with_model: graph nodes  = 1030\\n\",\n      \"llama_new_context_with_model: graph splits = 1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"ctx_params = llama_cpp.llama_context_default_params()\\n\",\n    \"ctx_params.seed = 1234\\n\",\n    \"ctx_params.n_ctx = n_kv_req\\n\",\n    \"ctx_params.n_batch = max(n_len, n_parallel)\\n\",\n    \"ctx_params.n_threads = 1\\n\",\n    \"ctx_params.n_threads_batch = 1\\n\",\n    \"ctx = llama_cpp.llama_new_context_with_model(model, ctx_params)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"n_ctx = llama_cpp.llama_n_ctx(ctx)\\n\",\n    \"batch = llama_cpp.llama_batch_init(max(tokens_len, n_parallel), 0, 1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"\\n\",\n    \"\\n\",\n    \"batch.n_tokens = tokens_len\\n\",\n    \"for i in range(tokens_len):\\n\",\n    \"    batch.token[i] = tokens[i]\\n\",\n    \"    batch.pos[i] = i\\n\",\n    \"    batch.seq_id[i][0] = 0\\n\",\n    \"    batch.n_seq_id[i] = 1\\n\",\n    \"    batch.logits[i] = False\\n\",\n    \"\\n\",\n    \"batch.logits[batch.n_tokens - 1] = True\\n\",\n    \"\\n\",\n    \"if llama_cpp.llama_decode(ctx, batch) != 0:\\n\",\n    \"    print(\\\"Error decoding\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"for i in range(n_parallel):\\n\",\n    \"    llama_cpp.llama_kv_cache_seq_cp(ctx, 0, i, 0, batch.n_tokens)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"\\n\",\n    \"# Initialize sampler chain with default parameters\\n\",\n    \"sparams = llama_cpp.llama_sampler_chain_default_params()\\n\",\n    \"sampler_chain = llama_cpp.llama_sampler_chain_init(sparams)\\n\",\n    \"\\n\",\n    \"# Add top_k, top_p, temperature, and final distribution-based sampler\\n\",\n    \"llama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_top_k(40))\\n\",\n    \"llama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_top_p(0.9, 1))\\n\",\n    \"llama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_temp(0.4))\\n\",\n    \"llama_cpp.llama_sampler_chain_add(sampler_chain, llama_cpp.llama_sampler_init_dist(1234))  # Final \\\"dist\\\" sampler\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"7\\n\",\n      \"[' j', ' jumped']\\n\",\n      \"8\\n\",\n      \"[' j over', ' jumped over']\\n\",\n      \"9\\n\",\n      \"[' j over the', ' jumped over the']\\n\",\n      \"10\\n\",\n      \"[' j over the lazy', ' jumped over the lazy']\\n\",\n      \"11\\n\",\n      \"[' j over the lazy dog', ' jumped over the lazy dog']\\n\",\n      \"12\\n\",\n      \"[' j over the lazy dog.', ' jumped over the lazy dog\\\\n']\\n\",\n      \"13\\n\",\n      \"[' j over the lazy dog. También', ' jumped over the lazy dog\\\\nGroupLayout']\\n\",\n      \"14\\n\",\n      \"[' j over the lazy dog. También:', ' jumped over the lazy dog\\\\nGroupLayouting']\\n\",\n      \"15\\n\",\n      \"[' j over the lazy dog. También: is', ' jumped over the lazy dog\\\\nGroupLayouting is']\\n\",\n      \"16\\n\",\n      \"[' j over the lazy dog. También: is a', ' jumped over the lazy dog\\\\nGroupLayouting is a']\\n\",\n      \"17\\n\",\n      \"[' j over the lazy dog. También: is a technique', ' jumped over the lazy dog\\\\nGroupLayouting is a common']\\n\",\n      \"18\\n\",\n      \"[' j over the lazy dog. También: is a technique practice', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice']\\n\",\n      \"19\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in']\\n\",\n      \"20\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the']\\n\",\n      \"21\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media']\\n\",\n      \"22\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real-', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry']\\n\",\n      \"23\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real-.', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry.']\\n\",\n      \"24\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real-. We', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry. However']\\n\",\n      \"25\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real-. We,', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry. However,']\\n\",\n      \"26\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real-. We, when', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry. However, there']\\n\",\n      \"27\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real-. We, when is', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry. However, there has']\\n\",\n      \"28\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real-. We, when is been', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry. However, there has been']\\n\",\n      \"29\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real-. We, when is been little', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry. However, there has been little']\\n\",\n      \"30\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real-. We, when is been little research', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry. However, there has been little emp']\\n\",\n      \"31\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real-. We, when is been little researchirical', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry. However, there has been little empirical']\\n\",\n      \"32\\n\",\n      \"[' j over the lazy dog. También: is a technique practice in the real-. We, when is been little researchirical research', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry. However, there has been little empirical research']\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"streams = [\\\"\\\"] * n_parallel\\n\",\n    \"i_batch = [batch.n_tokens - 1] * n_parallel\\n\",\n    \"\\n\",\n    \"n_cur = batch.n_tokens\\n\",\n    \"n_decode = 0\\n\",\n    \"\\n\",\n    \"while n_cur <= n_len:\\n\",\n    \"    batch.n_tokens = 0\\n\",\n    \"    for i in range(n_parallel):\\n\",\n    \"        if i_batch[i] < 0:\\n\",\n    \"            continue\\n\",\n    \"\\n\",\n    \"        # Sample the next token using the sampler chain\\n\",\n    \"        new_token_id = llama_cpp.llama_sampler_sample(sampler_chain, ctx, -1)\\n\",\n    \"\\n\",\n    \"        if new_token_id == llama_cpp.llama_token_eos(ctx) or n_cur == n_len:\\n\",\n    \"            i_batch[i] = -1\\n\",\n    \"            continue\\n\",\n    \"\\n\",\n    \"        buf = (ctypes.c_char * 32)()\\n\",\n    \"        \\n\",\n    \"        # Convert token ID to text\\n\",\n    \"        outlen = llama_cpp.llama_token_to_piece(model, new_token_id, buf, len(buf), 0, False)\\n\",\n    \"        streams[i] += bytes(buf[:outlen]).decode(\\\"utf-8\\\")\\n\",\n    \"\\n\",\n    \"        batch.token[batch.n_tokens] = new_token_id\\n\",\n    \"        batch.pos[batch.n_tokens] = n_cur\\n\",\n    \"        batch.seq_id[batch.n_tokens][0] = i\\n\",\n    \"        batch.n_seq_id[batch.n_tokens] = 1\\n\",\n    \"        batch.logits[batch.n_tokens] = True\\n\",\n    \"\\n\",\n    \"        i_batch[i] = batch.n_tokens\\n\",\n    \"        batch.n_tokens += 1\\n\",\n    \"        n_decode += 1\\n\",\n    \"\\n\",\n    \"    if batch.n_tokens == 0:\\n\",\n    \"        break\\n\",\n    \"\\n\",\n    \"    n_cur += 1\\n\",\n    \"\\n\",\n    \"    if llama_cpp.llama_decode(ctx, batch) != 0:\\n\",\n    \"        print(\\\"Error decoding\\\", flush=True)\\n\",\n    \"        break\\n\",\n    \"    print(n_cur)\\n\",\n    \"    print(streams)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[' j over the lazy dog. También: is a technique practice in the real-. We, when is been little researchirical research', ' jumped over the lazy dog\\\\nGroupLayouting is a common practice in the media industry. However, there has been little empirical research']\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(streams)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"llama_cpp.llama_batch_free(batch)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"llama_cpp.llama_free(ctx)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"llama_cpp.llama_free_model(model)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"llama_cpp.llama_backend_free()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.12.1\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/notebooks/Clients.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"<OpenAIObject text_completion id=cmpl-ad3ba53d-407c-466b-bd5f-97cb8987af83 at 0x7f6adc12d900> JSON: {\\n\",\n       \"  \\\"choices\\\": [\\n\",\n       \"    {\\n\",\n       \"      \\\"finish_reason\\\": \\\"length\\\",\\n\",\n       \"      \\\"index\\\": 0,\\n\",\n       \"      \\\"logprobs\\\": null,\\n\",\n       \"      \\\"text\\\": \\\" over the lazy dog.\\\"\\n\",\n       \"    }\\n\",\n       \"  ],\\n\",\n       \"  \\\"created\\\": 1680960690,\\n\",\n       \"  \\\"id\\\": \\\"cmpl-ad3ba53d-407c-466b-bd5f-97cb8987af83\\\",\\n\",\n       \"  \\\"model\\\": \\\"models/ggml-alpaca.bin\\\",\\n\",\n       \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n       \"  \\\"usage\\\": {\\n\",\n       \"    \\\"completion_tokens\\\": 5,\\n\",\n       \"    \\\"prompt_tokens\\\": 8,\\n\",\n       \"    \\\"total_tokens\\\": 13\\n\",\n       \"  }\\n\",\n       \"}\"\n      ]\n     },\n     \"execution_count\": 1,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import openai\\n\",\n    \"\\n\",\n    \"openai.api_key = \\\"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\\\"  # can be anything\\n\",\n    \"openai.api_base = \\\"http://100.64.159.73:8000/v1\\\"\\n\",\n    \"\\n\",\n    \"openai.Completion.create(\\n\",\n    \"    model=\\\"text-davinci-003\\\",  # currently can be anything\\n\",\n    \"    prompt=\\\"The quick brown fox jumps\\\",\\n\",\n    \"    max_tokens=5,\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"' over the lazy dog'\"\n      ]\n     },\n     \"execution_count\": 2,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"os.environ[\\\"OPENAI_API_KEY\\\"] = (\\n\",\n    \"    \\\"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\\\"  # can be anything\\n\",\n    \")\\n\",\n    \"os.environ[\\\"OPENAI_API_BASE\\\"] = \\\"http://100.64.159.73:8000/v1\\\"\\n\",\n    \"\\n\",\n    \"from langchain.llms import OpenAI\\n\",\n    \"\\n\",\n    \"llms = OpenAI()\\n\",\n    \"llms(\\n\",\n    \"    prompt=\\\"The quick brown fox jumps\\\",\\n\",\n    \"    stop=[\\\".\\\", \\\"\\\\n\\\"],\\n\",\n    \")\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \".venv\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.10\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/notebooks/Functions.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Functions\\n\",\n    \"\\n\",\n    \"The OpenAI compatbile web server in `llama-cpp-python` supports function calling.\\n\",\n    \"\\n\",\n    \"Function calling allows API clients to specify a schema that gives the model a format it should respond in.\\n\",\n    \"Function calling in `llama-cpp-python` works by combining models pretrained for function calling such as [`functionary`](https://huggingface.co/meetkai) with constrained sampling to produce a response that is compatible with the schema.\\n\",\n    \"\\n\",\n    \"Note however that this improves but does not guarantee that the response will be compatible with the schema.\\n\",\n    \"\\n\",\n    \"## Requirements\\n\",\n    \"\\n\",\n    \"Before we begin you will need the following:\\n\",\n    \"\\n\",\n    \"- A running `llama-cpp-python` server with a function calling compatible model. [See here](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)\\n\",\n    \"- The OpenAI Python Client `pip install openai`\\n\",\n    \"- (Optional) The Instructor Python Library `pip install instructor`\\n\",\n    \"\\n\",\n    \"## Function Calling with OpenAI Python Client\\n\",\n    \"\\n\",\n    \"We'll start with a basic demo that only uses the OpenAI Python Client.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"ChatCompletion(id='chatcmpl-a2d9eb9f-7354-472f-b6ad-4d7a807729a3', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content='The current weather in San Francisco is **72°F** (22°C).\\\\n ', role='assistant', function_call=None, tool_calls=None))], created=1699638365, model='gpt-3.5-turbo-1106', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=22, prompt_tokens=136, total_tokens=158))\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import openai\\n\",\n    \"import json\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"client = openai.OpenAI(\\n\",\n    \"    api_key=\\\"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\\\",  # can be anything\\n\",\n    \"    base_url=\\\"http://100.64.159.73:8000/v1\\\",  # NOTE: Replace with IP address and port of your llama-cpp-python server\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"# Example dummy function hard coded to return the same weather\\n\",\n    \"# In production, this could be your backend API or an external API\\n\",\n    \"def get_current_weather(location, unit=\\\"fahrenheit\\\"):\\n\",\n    \"    \\\"\\\"\\\"Get the current weather in a given location\\\"\\\"\\\"\\n\",\n    \"    if \\\"tokyo\\\" in location.lower():\\n\",\n    \"        return json.dumps({\\\"location\\\": \\\"Tokyo\\\", \\\"temperature\\\": \\\"10\\\", \\\"unit\\\": \\\"celsius\\\"})\\n\",\n    \"    elif \\\"san francisco\\\" in location.lower():\\n\",\n    \"        return json.dumps(\\n\",\n    \"            {\\\"location\\\": \\\"San Francisco\\\", \\\"temperature\\\": \\\"72\\\", \\\"unit\\\": \\\"fahrenheit\\\"}\\n\",\n    \"        )\\n\",\n    \"    elif \\\"paris\\\" in location.lower():\\n\",\n    \"        return json.dumps({\\\"location\\\": \\\"Paris\\\", \\\"temperature\\\": \\\"22\\\", \\\"unit\\\": \\\"celsius\\\"})\\n\",\n    \"    else:\\n\",\n    \"        return json.dumps({\\\"location\\\": location, \\\"temperature\\\": \\\"unknown\\\"})\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def run_conversation():\\n\",\n    \"    # Step 1: send the conversation and available functions to the model\\n\",\n    \"    messages = [\\n\",\n    \"        {\\n\",\n    \"            \\\"role\\\": \\\"user\\\",\\n\",\n    \"            \\\"content\\\": \\\"What's the weather like in San Francisco, Tokyo, and Paris?\\\",\\n\",\n    \"        }\\n\",\n    \"    ]\\n\",\n    \"    tools = [\\n\",\n    \"        {\\n\",\n    \"            \\\"type\\\": \\\"function\\\",\\n\",\n    \"            \\\"function\\\": {\\n\",\n    \"                \\\"name\\\": \\\"get_current_weather\\\",\\n\",\n    \"                \\\"description\\\": \\\"Get the current weather in a given location\\\",\\n\",\n    \"                \\\"parameters\\\": {\\n\",\n    \"                    \\\"type\\\": \\\"object\\\",\\n\",\n    \"                    \\\"properties\\\": {\\n\",\n    \"                        \\\"location\\\": {\\n\",\n    \"                            \\\"type\\\": \\\"string\\\",\\n\",\n    \"                            \\\"description\\\": \\\"The city and state, e.g. San Francisco, CA\\\",\\n\",\n    \"                        },\\n\",\n    \"                        \\\"unit\\\": {\\\"type\\\": \\\"string\\\", \\\"enum\\\": [\\\"celsius\\\", \\\"fahrenheit\\\"]},\\n\",\n    \"                    },\\n\",\n    \"                    \\\"required\\\": [\\\"location\\\"],\\n\",\n    \"                },\\n\",\n    \"            },\\n\",\n    \"        }\\n\",\n    \"    ]\\n\",\n    \"    response = client.chat.completions.create(\\n\",\n    \"        model=\\\"gpt-3.5-turbo-1106\\\",\\n\",\n    \"        messages=messages,\\n\",\n    \"        tools=tools,\\n\",\n    \"        tool_choice=\\\"auto\\\",  # auto is default, but we'll be explicit\\n\",\n    \"    )\\n\",\n    \"    response_message = response.choices[0].message\\n\",\n    \"    tool_calls = response_message.tool_calls\\n\",\n    \"    # Step 2: check if the model wanted to call a function\\n\",\n    \"    if tool_calls:\\n\",\n    \"        # Step 3: call the function\\n\",\n    \"        # Note: the JSON response may not always be valid; be sure to handle errors\\n\",\n    \"        available_functions = {\\n\",\n    \"            \\\"get_current_weather\\\": get_current_weather,\\n\",\n    \"        }  # only one function in this example, but you can have multiple\\n\",\n    \"        messages.append(response_message)  # extend conversation with assistant's reply\\n\",\n    \"        # Step 4: send the info for each function call and function response to the model\\n\",\n    \"        for tool_call in tool_calls:\\n\",\n    \"            function_name = tool_call.function.name\\n\",\n    \"            function_to_call = available_functions[function_name]\\n\",\n    \"            function_args = json.loads(tool_call.function.arguments)\\n\",\n    \"            function_response = function_to_call(\\n\",\n    \"                location=function_args.get(\\\"location\\\"),\\n\",\n    \"                unit=function_args.get(\\\"unit\\\"),\\n\",\n    \"            )\\n\",\n    \"            messages.append(\\n\",\n    \"                {\\n\",\n    \"                    \\\"tool_call_id\\\": tool_call.id,\\n\",\n    \"                    \\\"role\\\": \\\"tool\\\",\\n\",\n    \"                    \\\"name\\\": function_name,\\n\",\n    \"                    \\\"content\\\": function_response,\\n\",\n    \"                }\\n\",\n    \"            )  # extend conversation with function response\\n\",\n    \"        second_response = client.chat.completions.create(\\n\",\n    \"            model=\\\"gpt-3.5-turbo-1106\\\",\\n\",\n    \"            messages=messages,\\n\",\n    \"        )  # get a new response from the model where it can see the function response\\n\",\n    \"        return second_response\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"print(run_conversation())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Function Calling with Instructor\\n\",\n    \"\\n\",\n    \"The above example is a bit verbose and requires you to manually verify the schema.\\n\",\n    \"\\n\",\n    \"For our next examples we'll use the `instructor` library to simplify the process and accomplish a number of different tasks with function calling.\\n\",\n    \"\\n\",\n    \"You'll first need to install the [`instructor`](https://github.com/jxnl/instructor/).\\n\",\n    \"\\n\",\n    \"You can do so by running the following command in your terminal:\\n\",\n    \"\\n\",\n    \"```bash\\n\",\n    \"pip install instructor\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"Below we'll go through a few basic examples taken directly from the [instructor cookbook](https://jxnl.github.io/instructor/)\\n\",\n    \"\\n\",\n    \"## Basic Usage\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"name='Jason' age=25\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import instructor\\n\",\n    \"from pydantic import BaseModel\\n\",\n    \"\\n\",\n    \"# Enables `response_model`\\n\",\n    \"client = instructor.patch(client=client)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"class UserDetail(BaseModel):\\n\",\n    \"    name: str\\n\",\n    \"    age: int\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"user = client.chat.completions.create(\\n\",\n    \"    model=\\\"gpt-3.5-turbo\\\",\\n\",\n    \"    response_model=UserDetail,\\n\",\n    \"    messages=[\\n\",\n    \"        {\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"Extract Jason is 25 years old\\\"},\\n\",\n    \"    ],\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"assert isinstance(user, UserDetail)\\n\",\n    \"assert user.name == \\\"Jason\\\"\\n\",\n    \"assert user.age == 25\\n\",\n    \"\\n\",\n    \"print(user)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Text Classification\\n\",\n    \"\\n\",\n    \"### Single-Label Classification\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"class_label=<Labels.SPAM: 'spam'>\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import enum\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"class Labels(str, enum.Enum):\\n\",\n    \"    \\\"\\\"\\\"Enumeration for single-label text classification.\\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"    SPAM = \\\"spam\\\"\\n\",\n    \"    NOT_SPAM = \\\"not_spam\\\"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"class SinglePrediction(BaseModel):\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"    Class for a single class label prediction.\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"    class_label: Labels\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def classify(data: str) -> SinglePrediction:\\n\",\n    \"    \\\"\\\"\\\"Perform single-label classification on the input text.\\\"\\\"\\\"\\n\",\n    \"    return client.chat.completions.create(\\n\",\n    \"        model=\\\"gpt-3.5-turbo-0613\\\",\\n\",\n    \"        response_model=SinglePrediction,\\n\",\n    \"        messages=[\\n\",\n    \"            {\\n\",\n    \"                \\\"role\\\": \\\"user\\\",\\n\",\n    \"                \\\"content\\\": f\\\"Classify the following text: {data}\\\",\\n\",\n    \"            },\\n\",\n    \"        ],\\n\",\n    \"    )  # type: ignore\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"prediction = classify(\\\"Hello there I'm a Nigerian prince and I want to give you money\\\")\\n\",\n    \"assert prediction.class_label == Labels.SPAM\\n\",\n    \"print(prediction)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Multi-Label Classification\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"class_labels=[<MultiLabels.TECH_ISSUE: 'tech_issue'>, <MultiLabels.BILLING: 'billing'>]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from typing import List\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"# Define Enum class for multiple labels\\n\",\n    \"class MultiLabels(str, enum.Enum):\\n\",\n    \"    TECH_ISSUE = \\\"tech_issue\\\"\\n\",\n    \"    BILLING = \\\"billing\\\"\\n\",\n    \"    GENERAL_QUERY = \\\"general_query\\\"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"# Define the multi-class prediction model\\n\",\n    \"class MultiClassPrediction(BaseModel):\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"    Class for a multi-class label prediction.\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"    class_labels: List[MultiLabels]\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def multi_classify(data: str) -> MultiClassPrediction:\\n\",\n    \"    \\\"\\\"\\\"Perform multi-label classification on the input text.\\\"\\\"\\\"\\n\",\n    \"    return client.chat.completions.create(\\n\",\n    \"        model=\\\"gpt-3.5-turbo-0613\\\",\\n\",\n    \"        response_model=MultiClassPrediction,\\n\",\n    \"        messages=[\\n\",\n    \"            {\\n\",\n    \"                \\\"role\\\": \\\"user\\\",\\n\",\n    \"                \\\"content\\\": f\\\"Classify the following support ticket: {data}\\\",\\n\",\n    \"            },\\n\",\n    \"        ],\\n\",\n    \"    )  # type: ignore\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"# Test multi-label classification\\n\",\n    \"ticket = \\\"My account is locked and I can't access my billing info.\\\"\\n\",\n    \"prediction = multi_classify(ticket)\\n\",\n    \"print(prediction)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Self-Critique\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"question='What is the meaning of life?' answer='According to the Devil, the meaning of life is to live a life of sin and debauchery.'\\n\",\n      \"1 validation error for QuestionAnswerNoEvil\\n\",\n      \"answer\\n\",\n      \"  Assertion failed, The statement promotes sin and debauchery, which can be considered objectionable. [type=assertion_error, input_value='According to the Devil, ... of sin and debauchery.', input_type=str]\\n\",\n      \"    For further information visit https://errors.pydantic.dev/2.3/v/assertion_error\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from typing_extensions import Annotated\\n\",\n    \"from pydantic import BaseModel, BeforeValidator\\n\",\n    \"\\n\",\n    \"from instructor import llm_validator\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"question = \\\"What is the meaning of life?\\\"\\n\",\n    \"context = \\\"The according to the devil the meaning of live is to live a life of sin and debauchery.\\\"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"class QuestionAnswer(BaseModel):\\n\",\n    \"    question: str\\n\",\n    \"    answer: str\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"qa: QuestionAnswer = client.chat.completions.create(\\n\",\n    \"    model=\\\"gpt-3.5-turbo\\\",\\n\",\n    \"    response_model=QuestionAnswer,\\n\",\n    \"    messages=[\\n\",\n    \"        {\\n\",\n    \"            \\\"role\\\": \\\"system\\\",\\n\",\n    \"            \\\"content\\\": \\\"You are a system that answers questions based on the context. answer exactly what the question asks using the context.\\\",\\n\",\n    \"        },\\n\",\n    \"        {\\n\",\n    \"            \\\"role\\\": \\\"user\\\",\\n\",\n    \"            \\\"content\\\": f\\\"using the context: {context}\\\\n\\\\nAnswer the following question: {question}\\\",\\n\",\n    \"        },\\n\",\n    \"    ],\\n\",\n    \")\\n\",\n    \"print(qa)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"class QuestionAnswerNoEvil(BaseModel):\\n\",\n    \"    question: str\\n\",\n    \"    answer: Annotated[\\n\",\n    \"        str,\\n\",\n    \"        BeforeValidator(\\n\",\n    \"            llm_validator(\\\"don't say objectionable things\\\", allow_override=True)\\n\",\n    \"        ),\\n\",\n    \"    ]\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"try:\\n\",\n    \"    qa: QuestionAnswerNoEvil = client.chat.completions.create(\\n\",\n    \"        model=\\\"gpt-3.5-turbo\\\",\\n\",\n    \"        response_model=QuestionAnswerNoEvil,\\n\",\n    \"        messages=[\\n\",\n    \"            {\\n\",\n    \"                \\\"role\\\": \\\"system\\\",\\n\",\n    \"                \\\"content\\\": \\\"You are a system that answers questions based on the context. answer exactly what the question asks using the context.\\\",\\n\",\n    \"            },\\n\",\n    \"            {\\n\",\n    \"                \\\"role\\\": \\\"user\\\",\\n\",\n    \"                \\\"content\\\": f\\\"using the context: {context}\\\\n\\\\nAnswer the following question: {question}\\\",\\n\",\n    \"            },\\n\",\n    \"        ],\\n\",\n    \"    )\\n\",\n    \"except Exception as e:\\n\",\n    \"    print(e)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Answering Questions with Validated Citations\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 42,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"question='What did the author do during college?' answer=[Fact(fact='The author, Jason Liu, studied Computational Mathematics and Physics in university.', substring_quote=['Computational Mathematics'])]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import re\\n\",\n    \"from typing import List\\n\",\n    \"\\n\",\n    \"from pydantic import Field, BaseModel, model_validator, FieldValidationInfo\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"class Fact(BaseModel):\\n\",\n    \"    fact: str = Field(...)\\n\",\n    \"    substring_quote: List[str] = Field(...)\\n\",\n    \"\\n\",\n    \"    @model_validator(mode=\\\"after\\\")\\n\",\n    \"    def validate_sources(self, info: FieldValidationInfo) -> \\\"Fact\\\":\\n\",\n    \"        text_chunks = info.context.get(\\\"text_chunk\\\", None)\\n\",\n    \"        spans = list(self.get_spans(text_chunks))\\n\",\n    \"        self.substring_quote = [text_chunks[span[0] : span[1]] for span in spans]\\n\",\n    \"        return self\\n\",\n    \"\\n\",\n    \"    def get_spans(self, context):\\n\",\n    \"        for quote in self.substring_quote:\\n\",\n    \"            yield from self._get_span(quote, context)\\n\",\n    \"\\n\",\n    \"    def _get_span(self, quote, context):\\n\",\n    \"        for match in re.finditer(re.escape(quote), context):\\n\",\n    \"            yield match.span()\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"class QuestionAnswer(BaseModel):\\n\",\n    \"    question: str = Field(...)\\n\",\n    \"    answer: List[Fact] = Field(...)\\n\",\n    \"\\n\",\n    \"    @model_validator(mode=\\\"after\\\")\\n\",\n    \"    def validate_sources(self) -> \\\"QuestionAnswer\\\":\\n\",\n    \"        self.answer = [fact for fact in self.answer if len(fact.substring_quote) > 0]\\n\",\n    \"        return self\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def ask_ai(question: str, context: str) -> QuestionAnswer:\\n\",\n    \"    return client.chat.completions.create(\\n\",\n    \"        model=\\\"gpt-3.5-turbo-0613\\\",\\n\",\n    \"        temperature=0.0,\\n\",\n    \"        response_model=QuestionAnswer,\\n\",\n    \"        messages=[\\n\",\n    \"            {\\n\",\n    \"                \\\"role\\\": \\\"system\\\",\\n\",\n    \"                \\\"content\\\": \\\"You are a world class algorithm to answer questions with correct and exact citations.\\\",\\n\",\n    \"            },\\n\",\n    \"            {\\\"role\\\": \\\"user\\\", \\\"content\\\": f\\\"{context}\\\"},\\n\",\n    \"            {\\\"role\\\": \\\"user\\\", \\\"content\\\": f\\\"Question: {question}\\\"},\\n\",\n    \"        ],\\n\",\n    \"        validation_context={\\\"text_chunk\\\": context},\\n\",\n    \"    )\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"question = \\\"What did the author do during college?\\\"\\n\",\n    \"context = \\\"\\\"\\\"\\n\",\n    \"My name is Jason Liu, and I grew up in Toronto Canada but I was born in China.\\n\",\n    \"I went to an arts high school but in university I studied Computational Mathematics and physics.\\n\",\n    \"As part of coop I worked at many companies including Stitchfix, Facebook.\\n\",\n    \"I also started the Data Science club at the University of Waterloo and I was the president of the club for 2 years.\\n\",\n    \"\\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"qa = ask_ai(question, context)\\n\",\n    \"print(qa)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"python-3.8.10\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.11.5+\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/notebooks/Guidance.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div id=\\\"guidance-stop-button-faa51639-e0a4-43c6-a6d4-d5853c2ec764\\\" style=\\\"cursor: pointer; margin: 0px; display: none; float: right; padding: 3px; border-radius: 4px 4px 4px 4px; border: 0px solid rgba(127, 127, 127, 1); padding-left: 10px; padding-right: 10px; font-size: 13px; background-color: rgba(127, 127, 127, 0.25);\\\">Stop program</div><div id=\\\"guidance-content-faa51639-e0a4-43c6-a6d4-d5853c2ec764\\\"><pre style='margin: 0px; padding: 0px; padding-left: 8px; margin-left: -8px; border-radius: 0px; border-left: 1px solid rgba(127, 127, 127, 0.2); white-space: pre-wrap; font-family: ColfaxAI, Arial; font-size: 15px; line-height: 23px;'>Tweak this proverb to apply to model instructions instead.\\n\",\n       \"\\n\",\n       \"<span style='background-color: rgba(0, 138.56128016, 250.76166089, 0.25); display: inline;' title='{{proverb}}'>Where there is no guidance, a people falls,\\n\",\n       \"but in an abundance of counselors there is safety.</span>\\n\",\n       \"- <span style='background-color: rgba(0, 138.56128016, 250.76166089, 0.25); display: inline;' title='{{book}}'>Proverbs</span> <span style='background-color: rgba(0, 138.56128016, 250.76166089, 0.25); display: inline;' title='{{chapter}}'>11</span>:<span style='background-color: rgba(0, 138.56128016, 250.76166089, 0.25); display: inline;' title='{{verse}}'>14</span>\\n\",\n       \"\\n\",\n       \"UPDATED\\n\",\n       \"Where there is no guidance<span style='background-color: rgba(0, 165, 0, 0.25); opacity: 1.0; display: inline;' title='{{gen &#x27;rewrite&#x27; stop=&quot;\\\\n-&quot;}}'> for assembling a model, people will struggle,\\n\",\n       \"but with clear instructions, the process becomes safe and successful.</span>\\n\",\n       \"- GPT <span style='background-color: rgba(0, 165, 0, 0.25); opacity: 1.0; display: inline;' title='{{gen &#x27;chapter&#x27;}}'>2 (updated)</span>:<span style='background-color: rgba(0, 165, 0, 0.25); opacity: 1.0; display: inline;' title='{{gen &#x27;verse&#x27;}}'> Proverbs 11:14</span></pre></div>\\n\",\n       \"<script type=\\\"text/javascript\\\">(()=>{var t={296:(t,e,n)=>{var i=NaN,o=\\\"[object Symbol]\\\",r=/^\\\\s+|\\\\s+$/g,a=/^[-+]0x[0-9a-f]+$/i,s=/^0b[01]+$/i,c=/^0o[0-7]+$/i,d=parseInt,u=\\\"object\\\"==typeof n.g&&n.g&&n.g.Object===Object&&n.g,l=\\\"object\\\"==typeof self&&self&&self.Object===Object&&self,f=u||l||Function(\\\"return this\\\")(),h=Object.prototype.toString,p=Math.max,m=Math.min,g=function(){return f.Date.now()};function b(t){var e=typeof t;return!!t&&(\\\"object\\\"==e||\\\"function\\\"==e)}function y(t){if(\\\"number\\\"==typeof t)return t;if(function(t){return\\\"symbol\\\"==typeof t||function(t){return!!t&&\\\"object\\\"==typeof t}(t)&&h.call(t)==o}(t))return i;if(b(t)){var e=\\\"function\\\"==typeof t.valueOf?t.valueOf():t;t=b(e)?e+\\\"\\\":e}if(\\\"string\\\"!=typeof t)return 0===t?t:+t;t=t.replace(r,\\\"\\\");var n=s.test(t);return n||c.test(t)?d(t.slice(2),n?2:8):a.test(t)?i:+t}t.exports=function(t,e,n){var i,o,r,a,s,c,d=0,u=!1,l=!1,f=!0;if(\\\"function\\\"!=typeof t)throw new TypeError(\\\"Expected a function\\\");function h(e){var n=i,r=o;return i=o=void 0,d=e,a=t.apply(r,n)}function v(t){var n=t-c;return void 0===c||n>=e||n<0||l&&t-d>=r}function _(){var t=g();if(v(t))return w(t);s=setTimeout(_,function(t){var n=e-(t-c);return l?m(n,r-(t-d)):n}(t))}function w(t){return s=void 0,f&&i?h(t):(i=o=void 0,a)}function j(){var t=g(),n=v(t);if(i=arguments,o=this,c=t,n){if(void 0===s)return function(t){return d=t,s=setTimeout(_,e),u?h(t):a}(c);if(l)return s=setTimeout(_,e),h(c)}return void 0===s&&(s=setTimeout(_,e)),a}return e=y(e)||0,b(n)&&(u=!!n.leading,r=(l=\\\"maxWait\\\"in n)?p(y(n.maxWait)||0,e):r,f=\\\"trailing\\\"in n?!!n.trailing:f),j.cancel=function(){void 0!==s&&clearTimeout(s),d=0,i=c=o=s=void 0},j.flush=function(){return void 0===s?a:w(g())},j}},777:t=>{var e,n,i=Math.max,o=(e=function(t,e){return function(t,e,n){if(\\\"function\\\"!=typeof t)throw new TypeError(\\\"Expected a function\\\");return setTimeout((function(){t.apply(void 0,n)}),1)}(t,0,e)},n=i(void 0===n?e.length-1:n,0),function(){for(var t=arguments,o=-1,r=i(t.length-n,0),a=Array(r);++o<r;)a[o]=t[n+o];o=-1;for(var s=Array(n+1);++o<n;)s[o]=t[o];return s[n]=a,function(t,e,n){switch(n.length){case 0:return t.call(e);case 1:return t.call(e,n[0]);case 2:return t.call(e,n[0],n[1]);case 3:return t.call(e,n[0],n[1],n[2])}return t.apply(e,n)}(e,this,s)});t.exports=o}},e={};function n(i){var o=e[i];if(void 0!==o)return o.exports;var r=e[i]={exports:{}};return t[i](r,r.exports,n),r.exports}n.n=t=>{var e=t&&t.__esModule?()=>t.default:()=>t;return n.d(e,{a:e}),e},n.d=(t,e)=>{for(var i in e)n.o(e,i)&&!n.o(t,i)&&Object.defineProperty(t,i,{enumerable:!0,get:e[i]})},n.g=function(){if(\\\"object\\\"==typeof globalThis)return globalThis;try{return this||new Function(\\\"return this\\\")()}catch(t){if(\\\"object\\\"==typeof window)return window}}(),n.o=(t,e)=>Object.prototype.hasOwnProperty.call(t,e),(()=>{\\\"use strict\\\";const t=t=>{const e=new Set;do{for(const n of Reflect.ownKeys(t))e.add([t,n])}while((t=Reflect.getPrototypeOf(t))&&t!==Object.prototype);return e};function e(e,{include:n,exclude:i}={}){const o=t=>{const e=e=>\\\"string\\\"==typeof e?t===e:e.test(t);return n?n.some(e):!i||!i.some(e)};for(const[n,i]of t(e.constructor.prototype)){if(\\\"constructor\\\"===i||!o(i))continue;const t=Reflect.getOwnPropertyDescriptor(n,i);t&&\\\"function\\\"==typeof t.value&&(e[i]=e[i].bind(e))}return e}var i=n(777),o=n.n(i),r=n(296),a=n.n(r);class s{constructor(t,n){e(this),this.interfaceId=t,this.callbackMap={},this.data={},this.pendingData={},this.jcomm=new c(\\\"guidance_interface_target_\\\"+this.interfaceId,this.updateData,\\\"open\\\"),this.debouncedSendPendingData500=a()(this.sendPendingData,500),this.debouncedSendPendingData1000=a()(this.sendPendingData,1e3),n&&o()(n)}send(t,e){this.addPendingData(t,e),this.sendPendingData()}sendEvent(t){for(const e of Object.keys(t))this.addPendingData(e,t[e]);this.sendPendingData()}debouncedSendEvent500(t){for(const e of Object.keys(t))this.addPendingData(e,t[e]);this.debouncedSendPendingData500()}debouncedSend500(t,e){this.addPendingData(t,e),this.debouncedSendPendingData500()}debouncedSend1000(t,e){this.addPendingData(t,e),this.debouncedSendPendingData1000()}addPendingData(t,e){Array.isArray(t)||(t=[t]);for(const n in t)this.pendingData[t[n]]=e}updateData(t){t=JSON.parse(t.data);for(const e in t)this.data[e]=t[e];for(const e in t)e in this.callbackMap&&this.callbackMap[e](this.data[e])}subscribe(t,e){this.callbackMap[t]=e,o()((e=>this.callbackMap[t](this.data[t])))}sendPendingData(){this.jcomm.send_data(this.pendingData),this.pendingData={}}}class c{constructor(t,e,n=\\\"open\\\"){this._fire_callback=this._fire_callback.bind(this),this._register=this._register.bind(this),this.jcomm=void 0,this.callback=e,void 0!==window.Jupyter?\\\"register\\\"===n?Jupyter.notebook.kernel.comm_manager.register_target(t,this._register):(this.jcomm=Jupyter.notebook.kernel.comm_manager.new_comm(t),this.jcomm.on_msg(this._fire_callback)):void 0!==window._mgr&&(\\\"register\\\"===n?window._mgr.widgetManager.proxyKernel.registerCommTarget(t,this._register):(this.jcomm=window._mgr.widgetManager.proxyKernel.createComm(t),this.jcomm.open({},\\\"\\\"),this.jcomm.onMsg=this._fire_callback))}send_data(t){void 0!==this.jcomm?this.jcomm.send(t):console.error(\\\"Jupyter comm module not yet loaded! So we can't send the message.\\\")}_register(t,e){this.jcomm=t,this.jcomm.on_msg(this._fire_callback)}_fire_callback(t){this.callback(t.content.data)}}class d{constructor(t,n){e(this),this.id=t,this.comm=new s(t),this.comm.subscribe(\\\"append\\\",this.appendData),this.comm.subscribe(\\\"replace\\\",this.replaceData),this.comm.subscribe(\\\"event\\\",this.eventOccurred),this.element=document.getElementById(\\\"guidance-content-\\\"+t),this.stop_button=document.getElementById(\\\"guidance-stop-button-\\\"+t),this.stop_button.onclick=()=>this.comm.send(\\\"event\\\",\\\"stop\\\")}appendData(t){t&&(this.stop_button.style.display=\\\"inline-block\\\",this.element.innerHTML+=t)}replaceData(t){t&&(this.stop_button.style.display=\\\"inline-block\\\",this.element.innerHTML=t)}eventOccurred(t){\\\"complete\\\"===t&&(this.stop_button.style.display=\\\"none\\\")}}window._guidanceDisplay=function(t,e){return new d(t,e)}})()})();; window._guidanceDisplay(\\\"faa51639-e0a4-43c6-a6d4-d5853c2ec764\\\");</script>\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"os.environ[\\\"OPENAI_API_KEY\\\"] = (\\n\",\n    \"    \\\"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\\\"  # can be anything\\n\",\n    \")\\n\",\n    \"os.environ[\\\"OPENAI_API_BASE\\\"] = \\\"http://100.64.159.73:8000/v1\\\"\\n\",\n    \"os.environ[\\\"OPENAI_API_HOST\\\"] = \\\"http://100.64.159.73:8000\\\"\\n\",\n    \"\\n\",\n    \"import guidance\\n\",\n    \"\\n\",\n    \"# set the default language model used to execute guidance programs\\n\",\n    \"guidance.llm = guidance.llms.OpenAI(\\\"text-davinci-003\\\", caching=False)\\n\",\n    \"\\n\",\n    \"# define a guidance program that adapts a proverb\\n\",\n    \"program = guidance(\\n\",\n    \"    \\\"\\\"\\\"Tweak this proverb to apply to model instructions instead.\\n\",\n    \"\\n\",\n    \"{{proverb}}\\n\",\n    \"- {{book}} {{chapter}}:{{verse}}\\n\",\n    \"\\n\",\n    \"UPDATED\\n\",\n    \"Where there is no guidance{{gen 'rewrite' stop=\\\"\\\\\\\\n-\\\"}}\\n\",\n    \"- GPT {{gen 'chapter'}}:{{gen 'verse'}}\\\"\\\"\\\"\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"# execute the program on a specific proverb\\n\",\n    \"executed_program = program(\\n\",\n    \"    proverb=\\\"Where there is no guidance, a people falls,\\\\nbut in an abundance of counselors there is safety.\\\",\\n\",\n    \"    book=\\\"Proverbs\\\",\\n\",\n    \"    chapter=11,\\n\",\n    \"    verse=14,\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \".venv\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.10\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/notebooks/Multimodal.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<div>\\n\",\n    \"    <img src=\\\"https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png\\\" width=\\\"500\\\"/>\\n\",\n    \"</div>\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{'text': 'Llama C++'}\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from openai import OpenAI\\n\",\n    \"\\n\",\n    \"client = OpenAI(base_url=\\\"http://localhost:8000/v1\\\", api_key=\\\"llama.cpp\\\")\\n\",\n    \"response = client.chat.completions.create(\\n\",\n    \"    model=\\\"gpt-4-vision-preview\\\",\\n\",\n    \"    messages=[\\n\",\n    \"        {\\n\",\n    \"            \\\"role\\\": \\\"user\\\",\\n\",\n    \"            \\\"content\\\": [\\n\",\n    \"                {\\n\",\n    \"                    \\\"type\\\": \\\"image_url\\\",\\n\",\n    \"                    \\\"image_url\\\": {\\n\",\n    \"                        \\\"url\\\": \\\"https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png\\\",\\n\",\n    \"                    },\\n\",\n    \"                },\\n\",\n    \"                {\\n\",\n    \"                    \\\"type\\\": \\\"text\\\",\\n\",\n    \"                    \\\"text\\\": \\\"What does the image say. Format your response as a json object with a single 'text' key.\\\",\\n\",\n    \"                },\\n\",\n    \"            ],\\n\",\n    \"        }\\n\",\n    \"    ],\\n\",\n    \"    response_format={\\n\",\n    \"        \\\"type\\\": \\\"json_object\\\",\\n\",\n    \"        \\\"schema\\\": {\\\"type\\\": \\\"object\\\", \\\"properties\\\": {\\\"text\\\": {\\\"type\\\": \\\"string\\\"}}},\\n\",\n    \"    },\\n\",\n    \")\\n\",\n    \"import json\\n\",\n    \"\\n\",\n    \"print(json.loads(response.choices[0].message.content))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \".venv\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.11.5+\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/notebooks/OpenHermesFunctionCalling.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"name\\\": \\\"get_article_details\\\",\\n\",\n      \"  \\\"description\\\": \\\"Get article details from unstructured article text.\\\\ndate_published: formatted as \\\\\\\"MM/DD/YYYY\\\\\\\"\\\",\\n\",\n      \"  \\\"parameters\\\": {\\n\",\n      \"    \\\"type\\\": \\\"object\\\",\\n\",\n      \"    \\\"properties\\\": {\\n\",\n      \"      \\\"title\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"authors\\\": {\\n\",\n      \"        \\\"type\\\": \\\"list[str]\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"short_summary\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"date_published\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"tags\\\": {\\n\",\n      \"        \\\"type\\\": \\\"list[str]\\\"\\n\",\n      \"      }\\n\",\n      \"    }\\n\",\n      \"  },\\n\",\n      \"  \\\"returns\\\": \\\"Article\\\"\\n\",\n      \"}\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import json\\n\",\n    \"import inspect\\n\",\n    \"from typing import get_type_hints\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"class Article:\\n\",\n    \"    pass\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"class Weather:\\n\",\n    \"    pass\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"class Directions:\\n\",\n    \"    pass\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def calculate_mortgage_payment(\\n\",\n    \"    loan_amount: int, interest_rate: float, loan_term: int\\n\",\n    \") -> float:\\n\",\n    \"    \\\"\\\"\\\"Get the monthly mortgage payment given an interest rate percentage.\\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"    # TODO: you must implement this to actually call it later\\n\",\n    \"    pass\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def get_article_details(\\n\",\n    \"    title: str,\\n\",\n    \"    authors: list[str],\\n\",\n    \"    short_summary: str,\\n\",\n    \"    date_published: str,\\n\",\n    \"    tags: list[str],\\n\",\n    \") -> Article:\\n\",\n    \"    '''Get article details from unstructured article text.\\n\",\n    \"    date_published: formatted as \\\"MM/DD/YYYY\\\"'''\\n\",\n    \"\\n\",\n    \"    # TODO: you must implement this to actually call it later\\n\",\n    \"    pass\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def get_weather(zip_code: str) -> Weather:\\n\",\n    \"    \\\"\\\"\\\"Get the current weather given a zip code.\\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"    # TODO: you must implement this to actually call it later\\n\",\n    \"    pass\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def get_directions(start: str, destination: str) -> Directions:\\n\",\n    \"    \\\"\\\"\\\"Get directions from Google Directions API.\\n\",\n    \"    start: start address as a string including zipcode (if any)\\n\",\n    \"    destination: end address as a string including zipcode (if any)\\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"    # TODO: you must implement this to actually call it later\\n\",\n    \"    pass\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def get_type_name(t):\\n\",\n    \"    name = str(t)\\n\",\n    \"    if \\\"list\\\" in name or \\\"dict\\\" in name:\\n\",\n    \"        return name\\n\",\n    \"    else:\\n\",\n    \"        return t.__name__\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def serialize_function_to_json(func):\\n\",\n    \"    signature = inspect.signature(func)\\n\",\n    \"    type_hints = get_type_hints(func)\\n\",\n    \"\\n\",\n    \"    function_info = {\\n\",\n    \"        \\\"name\\\": func.__name__,\\n\",\n    \"        \\\"description\\\": func.__doc__,\\n\",\n    \"        \\\"parameters\\\": {\\\"type\\\": \\\"object\\\", \\\"properties\\\": {}},\\n\",\n    \"        \\\"returns\\\": type_hints.get(\\\"return\\\", \\\"void\\\").__name__,\\n\",\n    \"    }\\n\",\n    \"\\n\",\n    \"    for name, _ in signature.parameters.items():\\n\",\n    \"        param_type = get_type_name(type_hints.get(name, type(None)))\\n\",\n    \"        function_info[\\\"parameters\\\"][\\\"properties\\\"][name] = {\\\"type\\\": param_type}\\n\",\n    \"\\n\",\n    \"    return json.dumps(function_info, indent=2)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"print(serialize_function_to_json(get_article_details))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import xml.etree.ElementTree as ET\\n\",\n    \"import re\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"def extract_function_calls(completion):\\n\",\n    \"    completion = completion.strip()\\n\",\n    \"    pattern = r\\\"(<multiplefunctions>(.*?)</multiplefunctions>)\\\"\\n\",\n    \"    match = re.search(pattern, completion, re.DOTALL)\\n\",\n    \"    if not match:\\n\",\n    \"        return None\\n\",\n    \"\\n\",\n    \"    multiplefn = match.group(1)\\n\",\n    \"    root = ET.fromstring(multiplefn)\\n\",\n    \"    functions = root.findall(\\\"functioncall\\\")\\n\",\n    \"    return [json.loads(fn.text) for fn in functions]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def generate_hermes_prompt(prompt, functions):\\n\",\n    \"    functions = \\\"\\\\n\\\\n\\\".join([serialize_function_to_json(fn) for fn in functions])\\n\",\n    \"    prompt = f\\\"\\\"\\\"<|im_start|>system\\n\",\n    \"You are a helpful assistant with access to the following functions:\\n\",\n    \"\\n\",\n    \"{functions}\\n\",\n    \"\\n\",\n    \"To use these functions respond with:\\n\",\n    \"<multiplefunctions>\\n\",\n    \"    <functioncall> {{\\\"name\\\": \\\"function_name\\\", \\\"arguments\\\": {{\\\"arg_1\\\": \\\"value_1\\\", \\\"arg_2\\\": value_2, ...}}}} </functioncall>\\n\",\n    \"    <functioncall> {{\\\"name\\\": \\\"function_name\\\", \\\"arguments\\\": {{\\\"arg_1\\\": \\\"value_1\\\", \\\"arg_2\\\": value_2, ...}}}} </functioncall>\\n\",\n    \"    ...\\n\",\n    \"</multiplefunctions>\\n\",\n    \"\\n\",\n    \"Edge cases you must handle:\\n\",\n    \"- If there are no functions that match the user request, you will respond politely that you cannot help.<|im_end|>\\n\",\n    \"<|im_start|>user\\n\",\n    \"{prompt}<|im_end|>\\n\",\n    \"<|im_start|>assistant\\\"\\\"\\\"\\n\",\n    \"    return prompt\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"<|im_start|>system\\n\",\n      \"You are a helpful assistant with access to the following functions:\\n\",\n      \"\\n\",\n      \"{\\n\",\n      \"  \\\"name\\\": \\\"get_weather\\\",\\n\",\n      \"  \\\"description\\\": \\\"Get the current weather given a zip code.\\\",\\n\",\n      \"  \\\"parameters\\\": {\\n\",\n      \"    \\\"type\\\": \\\"object\\\",\\n\",\n      \"    \\\"properties\\\": {\\n\",\n      \"      \\\"zip_code\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      }\\n\",\n      \"    }\\n\",\n      \"  },\\n\",\n      \"  \\\"returns\\\": \\\"Weather\\\"\\n\",\n      \"}\\n\",\n      \"\\n\",\n      \"{\\n\",\n      \"  \\\"name\\\": \\\"calculate_mortgage_payment\\\",\\n\",\n      \"  \\\"description\\\": \\\"Get the monthly mortgage payment given an interest rate percentage.\\\",\\n\",\n      \"  \\\"parameters\\\": {\\n\",\n      \"    \\\"type\\\": \\\"object\\\",\\n\",\n      \"    \\\"properties\\\": {\\n\",\n      \"      \\\"loan_amount\\\": {\\n\",\n      \"        \\\"type\\\": \\\"int\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"interest_rate\\\": {\\n\",\n      \"        \\\"type\\\": \\\"float\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"loan_term\\\": {\\n\",\n      \"        \\\"type\\\": \\\"int\\\"\\n\",\n      \"      }\\n\",\n      \"    }\\n\",\n      \"  },\\n\",\n      \"  \\\"returns\\\": \\\"float\\\"\\n\",\n      \"}\\n\",\n      \"\\n\",\n      \"{\\n\",\n      \"  \\\"name\\\": \\\"get_article_details\\\",\\n\",\n      \"  \\\"description\\\": \\\"Get article details from unstructured article text.\\\\ndate_published: formatted as \\\\\\\"MM/DD/YYYY\\\\\\\"\\\",\\n\",\n      \"  \\\"parameters\\\": {\\n\",\n      \"    \\\"type\\\": \\\"object\\\",\\n\",\n      \"    \\\"properties\\\": {\\n\",\n      \"      \\\"title\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"authors\\\": {\\n\",\n      \"        \\\"type\\\": \\\"list[str]\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"short_summary\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"date_published\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"tags\\\": {\\n\",\n      \"        \\\"type\\\": \\\"list[str]\\\"\\n\",\n      \"      }\\n\",\n      \"    }\\n\",\n      \"  },\\n\",\n      \"  \\\"returns\\\": \\\"Article\\\"\\n\",\n      \"}\\n\",\n      \"\\n\",\n      \"To use these functions respond with:\\n\",\n      \"<multiplefunctions>\\n\",\n      \"    <functioncall> {\\\"name\\\": \\\"function_name\\\", \\\"arguments\\\": {\\\"arg_1\\\": \\\"value_1\\\", \\\"arg_2\\\": value_2, ...}} </functioncall>\\n\",\n      \"    <functioncall> {\\\"name\\\": \\\"function_name\\\", \\\"arguments\\\": {\\\"arg_1\\\": \\\"value_1\\\", \\\"arg_2\\\": value_2, ...}} </functioncall>\\n\",\n      \"    ...\\n\",\n      \"</multiplefunctions>\\n\",\n      \"\\n\",\n      \"Edge cases you must handle:\\n\",\n      \"- If there are no functions that match the user request, you will respond politely that you cannot help.<|im_end|>\\n\",\n      \"<|im_start|>user\\n\",\n      \"What's the weather in 10001?<|im_end|>\\n\",\n      \"<|im_start|>assistant\\n\",\n      \"<|im_start|>system\\n\",\n      \"You are a helpful assistant with access to the following functions:\\n\",\n      \"\\n\",\n      \"{\\n\",\n      \"  \\\"name\\\": \\\"get_weather\\\",\\n\",\n      \"  \\\"description\\\": \\\"Get the current weather given a zip code.\\\",\\n\",\n      \"  \\\"parameters\\\": {\\n\",\n      \"    \\\"type\\\": \\\"object\\\",\\n\",\n      \"    \\\"properties\\\": {\\n\",\n      \"      \\\"zip_code\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      }\\n\",\n      \"    }\\n\",\n      \"  },\\n\",\n      \"  \\\"returns\\\": \\\"Weather\\\"\\n\",\n      \"}\\n\",\n      \"\\n\",\n      \"{\\n\",\n      \"  \\\"name\\\": \\\"calculate_mortgage_payment\\\",\\n\",\n      \"  \\\"description\\\": \\\"Get the monthly mortgage payment given an interest rate percentage.\\\",\\n\",\n      \"  \\\"parameters\\\": {\\n\",\n      \"    \\\"type\\\": \\\"object\\\",\\n\",\n      \"    \\\"properties\\\": {\\n\",\n      \"      \\\"loan_amount\\\": {\\n\",\n      \"        \\\"type\\\": \\\"int\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"interest_rate\\\": {\\n\",\n      \"        \\\"type\\\": \\\"float\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"loan_term\\\": {\\n\",\n      \"        \\\"type\\\": \\\"int\\\"\\n\",\n      \"      }\\n\",\n      \"    }\\n\",\n      \"  },\\n\",\n      \"  \\\"returns\\\": \\\"float\\\"\\n\",\n      \"}\\n\",\n      \"\\n\",\n      \"{\\n\",\n      \"  \\\"name\\\": \\\"get_article_details\\\",\\n\",\n      \"  \\\"description\\\": \\\"Get article details from unstructured article text.\\\\ndate_published: formatted as \\\\\\\"MM/DD/YYYY\\\\\\\"\\\",\\n\",\n      \"  \\\"parameters\\\": {\\n\",\n      \"    \\\"type\\\": \\\"object\\\",\\n\",\n      \"    \\\"properties\\\": {\\n\",\n      \"      \\\"title\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"authors\\\": {\\n\",\n      \"        \\\"type\\\": \\\"list[str]\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"short_summary\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"date_published\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"tags\\\": {\\n\",\n      \"        \\\"type\\\": \\\"list[str]\\\"\\n\",\n      \"      }\\n\",\n      \"    }\\n\",\n      \"  },\\n\",\n      \"  \\\"returns\\\": \\\"Article\\\"\\n\",\n      \"}\\n\",\n      \"\\n\",\n      \"To use these functions respond with:\\n\",\n      \"<multiplefunctions>\\n\",\n      \"    <functioncall> {\\\"name\\\": \\\"function_name\\\", \\\"arguments\\\": {\\\"arg_1\\\": \\\"value_1\\\", \\\"arg_2\\\": value_2, ...}} </functioncall>\\n\",\n      \"    <functioncall> {\\\"name\\\": \\\"function_name\\\", \\\"arguments\\\": {\\\"arg_1\\\": \\\"value_1\\\", \\\"arg_2\\\": value_2, ...}} </functioncall>\\n\",\n      \"    ...\\n\",\n      \"</multiplefunctions>\\n\",\n      \"\\n\",\n      \"Edge cases you must handle:\\n\",\n      \"- If there are no functions that match the user request, you will respond politely that you cannot help.<|im_end|>\\n\",\n      \"<|im_start|>user\\n\",\n      \"Determine the monthly mortgage payment for a loan amount of $200,000, an interest rate of 4%, and a loan term of 30 years.<|im_end|>\\n\",\n      \"<|im_start|>assistant\\n\",\n      \"<|im_start|>system\\n\",\n      \"You are a helpful assistant with access to the following functions:\\n\",\n      \"\\n\",\n      \"{\\n\",\n      \"  \\\"name\\\": \\\"get_weather\\\",\\n\",\n      \"  \\\"description\\\": \\\"Get the current weather given a zip code.\\\",\\n\",\n      \"  \\\"parameters\\\": {\\n\",\n      \"    \\\"type\\\": \\\"object\\\",\\n\",\n      \"    \\\"properties\\\": {\\n\",\n      \"      \\\"zip_code\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      }\\n\",\n      \"    }\\n\",\n      \"  },\\n\",\n      \"  \\\"returns\\\": \\\"Weather\\\"\\n\",\n      \"}\\n\",\n      \"\\n\",\n      \"{\\n\",\n      \"  \\\"name\\\": \\\"calculate_mortgage_payment\\\",\\n\",\n      \"  \\\"description\\\": \\\"Get the monthly mortgage payment given an interest rate percentage.\\\",\\n\",\n      \"  \\\"parameters\\\": {\\n\",\n      \"    \\\"type\\\": \\\"object\\\",\\n\",\n      \"    \\\"properties\\\": {\\n\",\n      \"      \\\"loan_amount\\\": {\\n\",\n      \"        \\\"type\\\": \\\"int\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"interest_rate\\\": {\\n\",\n      \"        \\\"type\\\": \\\"float\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"loan_term\\\": {\\n\",\n      \"        \\\"type\\\": \\\"int\\\"\\n\",\n      \"      }\\n\",\n      \"    }\\n\",\n      \"  },\\n\",\n      \"  \\\"returns\\\": \\\"float\\\"\\n\",\n      \"}\\n\",\n      \"\\n\",\n      \"{\\n\",\n      \"  \\\"name\\\": \\\"get_article_details\\\",\\n\",\n      \"  \\\"description\\\": \\\"Get article details from unstructured article text.\\\\ndate_published: formatted as \\\\\\\"MM/DD/YYYY\\\\\\\"\\\",\\n\",\n      \"  \\\"parameters\\\": {\\n\",\n      \"    \\\"type\\\": \\\"object\\\",\\n\",\n      \"    \\\"properties\\\": {\\n\",\n      \"      \\\"title\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"authors\\\": {\\n\",\n      \"        \\\"type\\\": \\\"list[str]\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"short_summary\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"date_published\\\": {\\n\",\n      \"        \\\"type\\\": \\\"str\\\"\\n\",\n      \"      },\\n\",\n      \"      \\\"tags\\\": {\\n\",\n      \"        \\\"type\\\": \\\"list[str]\\\"\\n\",\n      \"      }\\n\",\n      \"    }\\n\",\n      \"  },\\n\",\n      \"  \\\"returns\\\": \\\"Article\\\"\\n\",\n      \"}\\n\",\n      \"\\n\",\n      \"To use these functions respond with:\\n\",\n      \"<multiplefunctions>\\n\",\n      \"    <functioncall> {\\\"name\\\": \\\"function_name\\\", \\\"arguments\\\": {\\\"arg_1\\\": \\\"value_1\\\", \\\"arg_2\\\": value_2, ...}} </functioncall>\\n\",\n      \"    <functioncall> {\\\"name\\\": \\\"function_name\\\", \\\"arguments\\\": {\\\"arg_1\\\": \\\"value_1\\\", \\\"arg_2\\\": value_2, ...}} </functioncall>\\n\",\n      \"    ...\\n\",\n      \"</multiplefunctions>\\n\",\n      \"\\n\",\n      \"Edge cases you must handle:\\n\",\n      \"- If there are no functions that match the user request, you will respond politely that you cannot help.<|im_end|>\\n\",\n      \"<|im_start|>user\\n\",\n      \"What's the current exchange rate for USD to EUR?<|im_end|>\\n\",\n      \"<|im_start|>assistant\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"prompts = [\\n\",\n    \"    \\\"What's the weather in 10001?\\\",\\n\",\n    \"    \\\"Determine the monthly mortgage payment for a loan amount of $200,000, an interest rate of 4%, and a loan term of 30 years.\\\",\\n\",\n    \"    \\\"What's the current exchange rate for USD to EUR?\\\",\\n\",\n    \"]\\n\",\n    \"functions = [get_weather, calculate_mortgage_payment, get_article_details]\\n\",\n    \"\\n\",\n    \"for prompt in prompts:\\n\",\n    \"    print(generate_hermes_prompt(prompt, functions))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no\\n\",\n      \"ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes\\n\",\n      \"ggml_init_cublas: found 1 CUDA devices:\\n\",\n      \"  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5\\n\",\n      \"llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../../models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf (version GGUF V3 (latest))\\n\",\n      \"llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32002,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor    3:              blk.0.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   10:              blk.1.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   11:              blk.1.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   12:              blk.1.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   13:         blk.1.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   19:              blk.2.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   20:              blk.2.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   21:              blk.2.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   22:         blk.2.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   28:              blk.3.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   29:              blk.3.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   30:              blk.3.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   31:         blk.3.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   37:              blk.4.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   38:              blk.4.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   39:              blk.4.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   40:         blk.4.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   46:              blk.5.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   47:              blk.5.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   48:              blk.5.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   49:         blk.5.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   55:              blk.6.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   56:              blk.6.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   57:              blk.6.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   58:         blk.6.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   60:              blk.6.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   61:            blk.6.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   64:              blk.7.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   65:              blk.7.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   66:              blk.7.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   67:         blk.7.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   69:              blk.7.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   70:            blk.7.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   73:              blk.8.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   74:              blk.8.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   75:              blk.8.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   76:         blk.8.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   78:              blk.8.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   79:            blk.8.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   82:              blk.9.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   83:              blk.9.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   84:              blk.9.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   85:         blk.9.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   87:              blk.9.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   88:            blk.9.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   91:             blk.10.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   92:             blk.10.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   93:             blk.10.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   94:        blk.10.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   96:             blk.10.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   97:           blk.10.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  100:             blk.11.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  101:             blk.11.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  102:             blk.11.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  103:        blk.11.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  105:             blk.11.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  106:           blk.11.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  109:             blk.12.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  110:             blk.12.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  111:             blk.12.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  112:        blk.12.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  114:             blk.12.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  115:           blk.12.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  118:             blk.13.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  119:             blk.13.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  120:             blk.13.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  121:        blk.13.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  123:             blk.13.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  124:           blk.13.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  127:             blk.14.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  128:             blk.14.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  129:             blk.14.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  130:        blk.14.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  132:             blk.14.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  133:           blk.14.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  136:             blk.15.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  137:             blk.15.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  138:             blk.15.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  139:        blk.15.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  142:           blk.15.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  145:             blk.16.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  146:             blk.16.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  147:             blk.16.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  148:        blk.16.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  150:             blk.16.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  151:           blk.16.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  154:             blk.17.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  155:             blk.17.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  156:             blk.17.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  157:        blk.17.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  159:             blk.17.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  160:           blk.17.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  163:             blk.18.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  164:             blk.18.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  165:             blk.18.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  166:        blk.18.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  168:             blk.18.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  169:           blk.18.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  172:             blk.19.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  173:             blk.19.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  174:             blk.19.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  175:        blk.19.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  177:             blk.19.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  178:           blk.19.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  181:             blk.20.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  182:             blk.20.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  183:             blk.20.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  184:        blk.20.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  186:             blk.20.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  187:           blk.20.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  190:             blk.21.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  191:             blk.21.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  192:             blk.21.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  193:        blk.21.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  195:             blk.21.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  196:           blk.21.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  199:             blk.22.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  200:             blk.22.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  201:             blk.22.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  202:        blk.22.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  204:             blk.22.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  208:             blk.23.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  209:             blk.23.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  210:             blk.23.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  211:        blk.23.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  213:             blk.23.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  214:           blk.23.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  217:             blk.24.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  218:             blk.24.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  219:             blk.24.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  220:        blk.24.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  222:             blk.24.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  223:           blk.24.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  226:             blk.25.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  227:             blk.25.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  228:             blk.25.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  229:        blk.25.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  231:             blk.25.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  232:           blk.25.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  235:             blk.26.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  236:             blk.26.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  237:             blk.26.attn_v.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  238:        blk.26.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  240:             blk.26.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  241:           blk.26.ffn_down.weight q4_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  244:             blk.27.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  245:             blk.27.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  246:             blk.27.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  247:        blk.27.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  249:             blk.27.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  250:           blk.27.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  253:             blk.28.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  254:             blk.28.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  255:             blk.28.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  256:        blk.28.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  258:             blk.28.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  259:           blk.28.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  262:             blk.29.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  263:             blk.29.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  264:             blk.29.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  265:        blk.29.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  267:             blk.29.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  268:           blk.29.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  271:             blk.30.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  272:             blk.30.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  273:             blk.30.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  274:        blk.30.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  276:             blk.30.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  277:           blk.30.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  280:             blk.31.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  281:             blk.31.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  282:             blk.31.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  283:        blk.31.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]\\n\",\n      \"llama_model_loader: - tensor  290:                    output.weight q6_K     [  4096, 32002,     1,     1 ]\\n\",\n      \"llama_model_loader: - kv   0:                       general.architecture str              = llama\\n\",\n      \"llama_model_loader: - kv   1:                               general.name str              = teknium_openhermes-2.5-mistral-7b\\n\",\n      \"llama_model_loader: - kv   2:                       llama.context_length u32              = 32768\\n\",\n      \"llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096\\n\",\n      \"llama_model_loader: - kv   4:                          llama.block_count u32              = 32\\n\",\n      \"llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336\\n\",\n      \"llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128\\n\",\n      \"llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32\\n\",\n      \"llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8\\n\",\n      \"llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010\\n\",\n      \"llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000\\n\",\n      \"llama_model_loader: - kv  11:                          general.file_type u32              = 15\\n\",\n      \"llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama\\n\",\n      \"llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = [\\\"<unk>\\\", \\\"<s>\\\", \\\"</s>\\\", \\\"<0x00>\\\", \\\"<...\\n\",\n      \"llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...\\n\",\n      \"llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...\\n\",\n      \"llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1\\n\",\n      \"llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000\\n\",\n      \"llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0\\n\",\n      \"llama_model_loader: - kv  19:               general.quantization_version u32              = 2\\n\",\n      \"llama_model_loader: - type  f32:   65 tensors\\n\",\n      \"llama_model_loader: - type q4_K:  193 tensors\\n\",\n      \"llama_model_loader: - type q6_K:   33 tensors\\n\",\n      \"llm_load_vocab: special tokens definition check successful ( 261/32002 ).\\n\",\n      \"llm_load_print_meta: format           = GGUF V3 (latest)\\n\",\n      \"llm_load_print_meta: arch             = llama\\n\",\n      \"llm_load_print_meta: vocab type       = SPM\\n\",\n      \"llm_load_print_meta: n_vocab          = 32002\\n\",\n      \"llm_load_print_meta: n_merges         = 0\\n\",\n      \"llm_load_print_meta: n_ctx_train      = 32768\\n\",\n      \"llm_load_print_meta: n_embd           = 4096\\n\",\n      \"llm_load_print_meta: n_head           = 32\\n\",\n      \"llm_load_print_meta: n_head_kv        = 8\\n\",\n      \"llm_load_print_meta: n_layer          = 32\\n\",\n      \"llm_load_print_meta: n_rot            = 128\\n\",\n      \"llm_load_print_meta: n_gqa            = 4\\n\",\n      \"llm_load_print_meta: f_norm_eps       = 0.0e+00\\n\",\n      \"llm_load_print_meta: f_norm_rms_eps   = 1.0e-05\\n\",\n      \"llm_load_print_meta: f_clamp_kqv      = 0.0e+00\\n\",\n      \"llm_load_print_meta: f_max_alibi_bias = 0.0e+00\\n\",\n      \"llm_load_print_meta: n_ff             = 14336\\n\",\n      \"llm_load_print_meta: rope scaling     = linear\\n\",\n      \"llm_load_print_meta: freq_base_train  = 10000.0\\n\",\n      \"llm_load_print_meta: freq_scale_train = 1\\n\",\n      \"llm_load_print_meta: n_yarn_orig_ctx  = 32768\\n\",\n      \"llm_load_print_meta: rope_finetuned   = unknown\\n\",\n      \"llm_load_print_meta: model type       = 7B\\n\",\n      \"llm_load_print_meta: model ftype      = mostly Q4_K - Medium\\n\",\n      \"llm_load_print_meta: model params     = 7.24 B\\n\",\n      \"llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) \\n\",\n      \"llm_load_print_meta: general.name   = teknium_openhermes-2.5-mistral-7b\\n\",\n      \"llm_load_print_meta: BOS token = 1 '<s>'\\n\",\n      \"llm_load_print_meta: EOS token = 32000 '<|im_end|>'\\n\",\n      \"llm_load_print_meta: UNK token = 0 '<unk>'\\n\",\n      \"llm_load_print_meta: PAD token = 0 '<unk>'\\n\",\n      \"llm_load_print_meta: LF token  = 13 '<0x0A>'\\n\",\n      \"llm_load_tensors: ggml ctx size =    0.11 MiB\\n\",\n      \"llm_load_tensors: using CUDA for GPU acceleration\\n\",\n      \"llm_load_tensors: mem required  =   70.42 MiB\\n\",\n      \"llm_load_tensors: offloading 32 repeating layers to GPU\\n\",\n      \"llm_load_tensors: offloading non-repeating layers to GPU\\n\",\n      \"llm_load_tensors: offloaded 35/35 layers to GPU\\n\",\n      \"llm_load_tensors: VRAM used: 4095.06 MiB\\n\",\n      \"...............................................................................................\\n\",\n      \"llama_new_context_with_model: n_ctx      = 2048\\n\",\n      \"llama_new_context_with_model: freq_base  = 10000.0\\n\",\n      \"llama_new_context_with_model: freq_scale = 1\\n\",\n      \"llama_kv_cache_init: offloading v cache to GPU\\n\",\n      \"llama_kv_cache_init: offloading k cache to GPU\\n\",\n      \"llama_kv_cache_init: VRAM kv self = 256.00 MiB\\n\",\n      \"llama_new_context_with_model: kv self size  =  256.00 MiB\\n\",\n      \"llama_build_graph: non-view tensors processed: 740/740\\n\",\n      \"llama_new_context_with_model: compute buffer total size = 159.07 MiB\\n\",\n      \"llama_new_context_with_model: VRAM scratch buffer: 156.00 MiB\\n\",\n      \"llama_new_context_with_model: total VRAM used: 4507.07 MiB (model: 4095.06 MiB, context: 412.00 MiB)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import llama_cpp\\n\",\n    \"\\n\",\n    \"llama = llama_cpp.Llama(\\n\",\n    \"    model_path=\\\"../../models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf\\\",\\n\",\n    \"    n_gpu_layers=-1,\\n\",\n    \"    n_ctx=2048,\\n\",\n    \"    verbose=False,\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[{'name': 'get_weather', 'arguments': {'zip_code': '10001'}}]\\n\",\n      \"====================================================================================================\\n\",\n      \"[{'name': 'calculate_mortgage_payment', 'arguments': {'loan_amount': 200000, 'interest_rate': 0.04, 'loan_term': 30}}]\\n\",\n      \"====================================================================================================\\n\",\n      \"Unfortunately, I do not have a built-in function to check currency exchange rates. However, you can use third-party APIs or websites like Google Finance or XE to get this information.\\n\",\n      \"====================================================================================================\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"prompts = [\\n\",\n    \"    \\\"What's the weather in 10001?\\\",\\n\",\n    \"    \\\"Determine the monthly mortgage payment for a loan amount of $200,000, an interest rate of 4%, and a loan term of 30 years.\\\",\\n\",\n    \"    \\\"What's the current exchange rate for USD to EUR?\\\",\\n\",\n    \"]\\n\",\n    \"functions = [get_weather, calculate_mortgage_payment, get_article_details]\\n\",\n    \"\\n\",\n    \"for prompt in prompts:\\n\",\n    \"    prompt = generate_hermes_prompt(prompt, functions)\\n\",\n    \"    completion = llama.create_completion(prompt, max_tokens=-1)[\\\"choices\\\"][0][\\\"text\\\"]\\n\",\n    \"    function_calls = extract_function_calls(completion)\\n\",\n    \"    if function_calls:\\n\",\n    \"        print(function_calls)\\n\",\n    \"    else:\\n\",\n    \"        print(completion.strip())\\n\",\n    \"    print(\\\"=\\\" * 100)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"get_weather\\n\",\n      \"{'zip_code': '05751'}\\n\",\n      \"====================================================================================================\\n\",\n      \"get_weather\\n\",\n      \"{'zip_code': '05751'}\\n\",\n      \"get_weather\\n\",\n      \"{'zip_code': '07030'}\\n\",\n      \"calculate_mortgage_payment\\n\",\n      \"{'loan_amount': 250000, 'interest_rate': 4.18, 'loan_term': 30}\\n\",\n      \"====================================================================================================\\n\",\n      \"I don't have a function to get exchange rates, but I can provide some resources where you can find this information. You can check websites like Google Finance, XE.com, or Yahoo Finance for up-to-date currency exchange rates.\\n\",\n      \"====================================================================================================\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"prompts = [\\n\",\n    \"    \\\"What's the weather in 05751?\\\",\\n\",\n    \"    \\\"I'm planning a trip to Killington, Vermont (05751) from Hoboken, NJ (07030). Can you get me weather for both locations and directions?\\\",\\n\",\n    \"    \\\"What's the current exchange rate for USD to EUR?\\\",\\n\",\n    \"]\\n\",\n    \"\\n\",\n    \"for prompt in prompts:\\n\",\n    \"    completion = llama.create_completion(\\n\",\n    \"        generate_hermes_prompt(prompt, functions), max_tokens=-1\\n\",\n    \"    )[\\\"choices\\\"][0][\\\"text\\\"]\\n\",\n    \"    function_calls = extract_function_calls(completion)\\n\",\n    \"\\n\",\n    \"    if function_calls:\\n\",\n    \"        for function in function_calls:\\n\",\n    \"            print(function[\\\"name\\\"])\\n\",\n    \"            print(function[\\\"arguments\\\"])\\n\",\n    \"    else:\\n\",\n    \"        print(completion.strip())\\n\",\n    \"\\n\",\n    \"    print(\\\"=\\\" * 100)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \".venv\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.11.5+\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/notebooks/PerformanceTuning.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import time\\n\",\n    \"import json\\n\",\n    \"import multiprocessing\\n\",\n    \"\\n\",\n    \"import llama_cpp\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"np.int = int\\n\",\n    \"\\n\",\n    \"from skopt.space import Integer, Categorical\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"MODEL_PATH = \\\"../models/ggml-model.bin\\\"\\n\",\n    \"\\n\",\n    \"# Hyperparameters\\n\",\n    \"space = [\\n\",\n    \"    Categorical([True, False], name=\\\"f16_kv\\\"),\\n\",\n    \"    Categorical([True, False], name=\\\"use_mlock\\\"),\\n\",\n    \"    Integer(1, multiprocessing.cpu_count(), name=\\\"n_threads\\\"),\\n\",\n    \"    Integer(1, 2048, name=\\\"n_batch\\\"),\\n\",\n    \"]\\n\",\n    \"\\n\",\n    \"# TODO: Make this a random prompt to avoid any cache related inconsistencies\\n\",\n    \"PROMPT = \\\"\\\"\\\" ### Instructions:\\n\",\n    \"You are a helpful assistant.\\n\",\n    \"You answer questions truthfully and politely.\\n\",\n    \"You are provided with an input from the user and you must generate a response.\\n\",\n    \"Ignore this line which is just filler to test the performane of the model.\\n\",\n    \"### Inputs:\\n\",\n    \"What is the capital of France?\\n\",\n    \"### Response:\\n\",\n    \"\\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"from skopt.utils import use_named_args\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"@use_named_args(space)\\n\",\n    \"def objective(**params):\\n\",\n    \"    f16_kv = params[\\\"f16_kv\\\"]\\n\",\n    \"    use_mlock = params[\\\"use_mlock\\\"]\\n\",\n    \"    n_threads = params[\\\"n_threads\\\"]\\n\",\n    \"    n_batch = params[\\\"n_batch\\\"]\\n\",\n    \"    llm = llama_cpp.Llama(\\n\",\n    \"        model_path=MODEL_PATH,\\n\",\n    \"        f16_kv=f16_kv,\\n\",\n    \"        use_mlock=use_mlock,\\n\",\n    \"        n_threads=n_threads,\\n\",\n    \"        n_batch=n_batch,\\n\",\n    \"    )\\n\",\n    \"\\n\",\n    \"    t1 = time.time()\\n\",\n    \"    output = llm(\\n\",\n    \"        PROMPT,\\n\",\n    \"        max_tokens=1,  # Only optimize prompt processing\\n\",\n    \"        stop=[\\\"###\\\", \\\"\\\\n\\\"],\\n\",\n    \"        echo=True,\\n\",\n    \"    )\\n\",\n    \"    t2 = time.time()\\n\",\n    \"\\n\",\n    \"    print(json.dumps(output, indent=2))\\n\",\n    \"    print(f\\\"Time: {t2 - t1} seconds\\\")\\n\",\n    \"    print(f\\\"Time per token: {(t2 - t1) / output['usage']['total_tokens']} seconds\\\")\\n\",\n    \"\\n\",\n    \"    return (t2 - t1) / output[\\\"usage\\\"][\\\"total_tokens\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-d4443e14-fed3-4aa1-9e8a-c70f4503aade\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227287,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 10.981224775314331 seconds\\n\",\n      \"Time per token: 0.13726530969142914 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-4181439c-2ced-4ddb-b898-a0a7641f3e47\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227300,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 11.121099948883057 seconds\\n\",\n      \"Time per token: 0.13901374936103822 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-03ed5585-3de0-4546-96c3-6de7a5b3770c\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227312,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 14.457949876785278 seconds\\n\",\n      \"Time per token: 0.18072437345981598 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-103817fc-bceb-4e99-b968-3ef540f16dc5\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227328,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 10.334054946899414 seconds\\n\",\n      \"Time per token: 0.12917568683624267 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-41e34acc-6499-450f-9576-3cb37b82c490\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227340,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.012462615966797 seconds\\n\",\n      \"Time per token: 0.11265578269958496 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-f27244c9-e9c6-4332-ae7f-3856f152ef30\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227350,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 15.59382700920105 seconds\\n\",\n      \"Time per token: 0.1949228376150131 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-bc5dc1ba-f7ce-441c-a558-5005f2fb89b9\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227366,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 15.544022560119629 seconds\\n\",\n      \"Time per token: 0.19430028200149535 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-2006b117-1239-4b85-bcc4-a7439c01f440\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227383,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.330769300460815 seconds\\n\",\n      \"Time per token: 0.11663461625576019 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-ee50afee-78a8-4d55-9b73-c74cc2567408\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227393,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 14.17799687385559 seconds\\n\",\n      \"Time per token: 0.1772249609231949 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-1e2b7080-940f-4459-8503-a458db4d3578\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227409,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 10.127476215362549 seconds\\n\",\n      \"Time per token: 0.12659345269203187 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-c80008a4-191e-4418-821a-b18a4af24f70\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227421,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.495943784713745 seconds\\n\",\n      \"Time per token: 0.11869929730892181 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-d04c9fd2-3c20-4035-9181-0bfd05abfe15\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227432,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.226310014724731 seconds\\n\",\n      \"Time per token: 0.11532887518405914 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-04fcf88b-33c7-4b84-aac0-dcb5261363c2\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227443,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 12.182626962661743 seconds\\n\",\n      \"Time per token: 0.15228283703327178 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-14904676-3345-4674-a41c-419d9640b4e0\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227457,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 43.595701694488525 seconds\\n\",\n      \"Time per token: 0.5449462711811066 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-9e43b2ef-e7de-4bd2-91bf-284f5b3478fe\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227502,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 14.726518154144287 seconds\\n\",\n      \"Time per token: 0.1840814769268036 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-3947538b-e27e-42eb-8f87-2b56e14d104c\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227518,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.760729789733887 seconds\\n\",\n      \"Time per token: 0.10950912237167358 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-1a0d843e-9613-49aa-b565-0e59d8067615\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227529,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 11.672860383987427 seconds\\n\",\n      \"Time per token: 0.14591075479984283 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-ccad9270-9554-4f9f-9aaf-387f1a11894d\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227542,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 14.368357419967651 seconds\\n\",\n      \"Time per token: 0.17960446774959565 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-2623073e-004f-4386-98e0-7e6ea617523a\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227558,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.44194221496582 seconds\\n\",\n      \"Time per token: 0.11802427768707276 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-1a199f09-0d74-4052-a191-7a8ef2df57f3\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227569,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 11.253167629241943 seconds\\n\",\n      \"Time per token: 0.14066459536552428 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-2b61e491-d9b7-4d0b-b0c8-9f8ba822599d\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227582,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 12.381825685501099 seconds\\n\",\n      \"Time per token: 0.15477282106876372 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-0e4b4575-6278-4bd8-a4c5-ddb772014f7d\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227596,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 14.473106145858765 seconds\\n\",\n      \"Time per token: 0.18091382682323456 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-1ad3e3db-5120-41c8-8f9e-2ca07a846437\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227612,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 16.591509103775024 seconds\\n\",\n      \"Time per token: 0.2073938637971878 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-34c8fb5c-fa49-4ea6-b2e7-ba3b958e297d\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227630,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.034043788909912 seconds\\n\",\n      \"Time per token: 0.1129255473613739 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-8d5c56eb-0b43-4591-a9ac-c1ec174ec6db\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227641,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 11.218972444534302 seconds\\n\",\n      \"Time per token: 0.14023715555667876 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-bfdc554b-baa6-47c1-b35f-0f7d1321255a\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227654,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.300573110580444 seconds\\n\",\n      \"Time per token: 0.11625716388225556 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-ad67d78b-6975-4789-982e-3653c7fca7e1\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227665,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.009618520736694 seconds\\n\",\n      \"Time per token: 0.11262023150920868 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-2eec3e0f-dd48-4c3a-9430-c5048827f557\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227676,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.997699737548828 seconds\\n\",\n      \"Time per token: 0.11247124671936035 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-b129732a-8d7b-4382-baaf-740378c923ec\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227686,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.252354621887207 seconds\\n\",\n      \"Time per token: 0.11565443277359008 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-bb25c002-69e0-40ec-8099-0ba4462338aa\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227697,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.040243864059448 seconds\\n\",\n      \"Time per token: 0.1130030483007431 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-63705814-7c93-4d6b-a9f2-0579941ebf54\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227708,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.947132349014282 seconds\\n\",\n      \"Time per token: 0.11183915436267852 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-8afe123b-423d-4757-82d9-15fc12cfd24e\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227720,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 10.335533857345581 seconds\\n\",\n      \"Time per token: 0.12919417321681975 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-4937353f-e66f-4632-aea7-dd1133af9727\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227732,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.99415397644043 seconds\\n\",\n      \"Time per token: 0.11242692470550537 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-78f86527-ccc7-4a5d-9b7f-38386998ba2a\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227743,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 15.732706308364868 seconds\\n\",\n      \"Time per token: 0.19665882885456085 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-4d98c564-fcb4-45ec-9f8d-f64430abbfb3\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227761,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.319743633270264 seconds\\n\",\n      \"Time per token: 0.11649679541587829 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-ee855931-2578-45bc-93bf-319c4e6aa43a\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227772,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 15.189301490783691 seconds\\n\",\n      \"Time per token: 0.18986626863479614 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-14f0b547-4d71-4a7f-a3d6-3127998903b3\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227790,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.464989423751831 seconds\\n\",\n      \"Time per token: 0.11831236779689788 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-4eb5258a-5836-414c-88f6-e217bacaded6\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227801,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 13.818569660186768 seconds\\n\",\n      \"Time per token: 0.1727321207523346 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-66b7c783-d506-45c1-b39b-c91666a02b44\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227817,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 27.316773176193237 seconds\\n\",\n      \"Time per token: 0.34145966470241546 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-d53b48ca-30e2-43c2-9fb5-62ef6a65fafa\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227847,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.132777214050293 seconds\\n\",\n      \"Time per token: 0.11415971517562866 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-d0909f83-5caa-4098-a0e6-9b2ad1e2b12f\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227858,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.273045539855957 seconds\\n\",\n      \"Time per token: 0.11591306924819947 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-7045f5c7-cf5d-48e3-9353-032c320e56fa\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227870,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.90743088722229 seconds\\n\",\n      \"Time per token: 0.11134288609027862 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-e623667d-d6cc-4908-a648-60380f723592\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227881,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.06355595588684 seconds\\n\",\n      \"Time per token: 0.11329444944858551 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-44ec163c-25dd-40ae-a786-d8b4c9ff31b1\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227892,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.249061107635498 seconds\\n\",\n      \"Time per token: 0.11561326384544372 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-cb435214-0d20-4566-b312-68d8960ebe25\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227903,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.296529054641724 seconds\\n\",\n      \"Time per token: 0.11620661318302154 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-dc704f52-bed9-44f0-8335-a2ec4af3a27c\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227914,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 12.455670356750488 seconds\\n\",\n      \"Time per token: 0.1556958794593811 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-67570fa5-1c3d-47d6-b7c6-b3a734aae3f5\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227928,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.269653558731079 seconds\\n\",\n      \"Time per token: 0.11587066948413849 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-4bd6c6f2-9849-4047-93c8-88b1914ef184\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227939,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.308398485183716 seconds\\n\",\n      \"Time per token: 0.11635498106479644 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-6413afd7-fdc1-4c28-864d-6acdf2775060\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227950,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 10.430264711380005 seconds\\n\",\n      \"Time per token: 0.13037830889225005 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-c4e1c14a-3b8a-4ab3-b42a-f47440f79962\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227962,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.389702558517456 seconds\\n\",\n      \"Time per token: 0.1173712819814682 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-ac307870-dc67-42b8-8bb8-bb8d3083cea2\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227974,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 10.35448431968689 seconds\\n\",\n      \"Time per token: 0.12943105399608612 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-58c06f3e-3fba-4e23-b12e-141a1742c51b\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227986,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.097248792648315 seconds\\n\",\n      \"Time per token: 0.11371560990810395 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-b5eccb52-85e3-41d0-b8d8-f35e68bf7997\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680227997,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 12.466306686401367 seconds\\n\",\n      \"Time per token: 0.1558288335800171 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-e1dbc2ee-abc0-4891-a474-386d97b521b6\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228011,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 11.436015367507935 seconds\\n\",\n      \"Time per token: 0.14295019209384918 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-fd9bce6d-0a33-4c24-90b3-913ab3b33d24\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228025,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 14.052912712097168 seconds\\n\",\n      \"Time per token: 0.1756614089012146 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-038fa38d-7640-40ee-907c-0bb131c20d80\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228040,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.250384330749512 seconds\\n\",\n      \"Time per token: 0.1156298041343689 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-d00a2058-9fda-4113-8e5e-bf0f39cef238\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228051,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.228248834609985 seconds\\n\",\n      \"Time per token: 0.11535311043262482 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-f8d90e63-4939-491c-9775-fc15aa55505e\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228062,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.341724395751953 seconds\\n\",\n      \"Time per token: 0.11677155494689942 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-9e3777bc-119a-46bf-bdd3-21557e686f3c\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228074,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.285743951797485 seconds\\n\",\n      \"Time per token: 0.11607179939746856 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-123eaa35-110b-4f73-ba60-fa8a75ea929c\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228085,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.105633020401001 seconds\\n\",\n      \"Time per token: 0.1138204127550125 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-cc095f4b-8047-446e-a9f5-c798a66d1003\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228096,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.305238485336304 seconds\\n\",\n      \"Time per token: 0.1163154810667038 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-e2e69b3e-7742-4534-b21f-adfe53345820\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228108,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.190222263336182 seconds\\n\",\n      \"Time per token: 0.11487777829170227 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-666ae55e-d837-4534-b8e6-9f1b01f69778\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228120,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.126368999481201 seconds\\n\",\n      \"Time per token: 0.11407961249351502 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-63bdfa8e-b7c3-4669-ab76-54cdbb8878d5\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228131,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.136119604110718 seconds\\n\",\n      \"Time per token: 0.11420149505138397 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-1ec02c53-c7c8-434e-b28f-70884f8c35b2\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228143,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.126901626586914 seconds\\n\",\n      \"Time per token: 0.11408627033233643 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-3ec3495b-009a-4a82-b444-d8c1c6bf20a1\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228154,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.08673644065857 seconds\\n\",\n      \"Time per token: 0.11358420550823212 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-17fd0e6b-7ac3-494f-9e85-4e4a26013ad9\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228165,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.252317428588867 seconds\\n\",\n      \"Time per token: 0.11565396785736085 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-14a2647f-3961-4b60-b20a-ae9872c34feb\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228177,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 11.389162302017212 seconds\\n\",\n      \"Time per token: 0.14236452877521516 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-fa0e5edd-e9c9-40b9-bc9b-c48b8762850c\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228190,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.433730125427246 seconds\\n\",\n      \"Time per token: 0.11792162656784058 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-2b1c5964-265a-488a-8d8f-7e0692fcf96f\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228202,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 47.81757044792175 seconds\\n\",\n      \"Time per token: 0.5977196305990219 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-516fbd4c-3fe5-4945-bfc5-7312f2c02687\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228252,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.540155410766602 seconds\\n\",\n      \"Time per token: 0.10675194263458251 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-94c9ab1f-ac6e-4fc7-bcd9-7ab96515a722\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228262,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.660873889923096 seconds\\n\",\n      \"Time per token: 0.10826092362403869 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-63b1e1a7-0c6b-42e0-ba65-6f42d6ec77bb\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228273,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.815936088562012 seconds\\n\",\n      \"Time per token: 0.11019920110702515 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-92e1a879-2ebd-4299-b86e-90c87762db45\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228284,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.12400484085083 seconds\\n\",\n      \"Time per token: 0.11405006051063538 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  512.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-033ea9dc-fffe-41a0-a695-d647f725ee97\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228296,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 13.992429971694946 seconds\\n\",\n      \"Time per token: 0.17490537464618683 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-5153f39a-589a-4b3d-8642-8efce64fc439\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228312,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.084643125534058 seconds\\n\",\n      \"Time per token: 0.11355803906917572 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-af9ea5c6-5449-43b4-9e50-da930af8d6b8\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228323,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.076856851577759 seconds\\n\",\n      \"Time per token: 0.11346071064472199 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-5bbea5c1-ea8c-4599-bf63-a6eb80bc7525\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228334,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.02251124382019 seconds\\n\",\n      \"Time per token: 0.11278139054775238 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-ff9d87c7-e2b1-4481-9e8f-848d7a0fbd35\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228346,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.012435913085938 seconds\\n\",\n      \"Time per token: 0.11265544891357422 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-3dbe8ae4-c9ca-4a1b-abaf-6b85ef648ba9\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228357,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.997032880783081 seconds\\n\",\n      \"Time per token: 0.11246291100978852 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-b20a3b61-9c8b-4b2e-bb43-8ed9ce5a9d0d\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228369,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.042449951171875 seconds\\n\",\n      \"Time per token: 0.11303062438964843 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-9c781d69-83e0-415a-ac97-252508b10590\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228380,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.058239459991455 seconds\\n\",\n      \"Time per token: 0.11322799324989319 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-86cead9e-780f-4503-831c-466a6abd5ab2\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228392,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.070426940917969 seconds\\n\",\n      \"Time per token: 0.1133803367614746 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-65361c7e-74ef-4566-bad5-c6b3867a7f7e\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228403,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.985144138336182 seconds\\n\",\n      \"Time per token: 0.11231430172920227 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-23feb1ca-8103-46d8-ab71-b4da59f05d16\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228415,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.999938011169434 seconds\\n\",\n      \"Time per token: 0.11249922513961792 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-0db73f26-9ab1-4a78-a11f-e22d915ffae2\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228426,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.969520330429077 seconds\\n\",\n      \"Time per token: 0.11211900413036346 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-54e6edeb-99ea-46ed-8735-5185f78c222c\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228438,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.12838339805603 seconds\\n\",\n      \"Time per token: 0.11410479247570038 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-bd6502fd-f8c7-41d8-ab15-b10ca6aabd96\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228450,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.01610016822815 seconds\\n\",\n      \"Time per token: 0.11270125210285187 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-72733563-53f5-4cd5-a4eb-48656408b2d8\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228461,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.993805408477783 seconds\\n\",\n      \"Time per token: 0.11242256760597229 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-f7365eaa-fd68-422b-bbca-c6bcbcad36e0\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228473,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.292223930358887 seconds\\n\",\n      \"Time per token: 0.11615279912948609 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-1cfcf44a-c692-4020-8dcb-e6da8b163920\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228485,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.99638295173645 seconds\\n\",\n      \"Time per token: 0.11245478689670563 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-8b679f09-bc0e-4fc9-a935-9fefd9126993\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228497,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.972327709197998 seconds\\n\",\n      \"Time per token: 0.11215409636497498 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-08cb0cd7-84d8-4193-a20c-5a6ca4b5e404\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228508,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.024793863296509 seconds\\n\",\n      \"Time per token: 0.11280992329120636 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-ffe4b2b8-c041-4492-9e03-ab79cd4fd60d\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228520,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.996853351593018 seconds\\n\",\n      \"Time per token: 0.11246066689491271 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-196bb891-9299-4f91-9f68-ba6c7233a2dd\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228532,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.039422273635864 seconds\\n\",\n      \"Time per token: 0.1129927784204483 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-e50f5489-b40c-4a5d-9cb2-4a6d13bbb8c7\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228544,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 8.978781461715698 seconds\\n\",\n      \"Time per token: 0.11223476827144623 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-210cc2b8-df35-4d3f-a34a-a5facb635ec0\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228555,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.032035827636719 seconds\\n\",\n      \"Time per token: 0.11290044784545898 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-e3c7ca0d-c4cb-495c-9210-4e1ed3b6010d\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228567,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.0346040725708 seconds\\n\",\n      \"Time per token: 0.11293255090713501 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-7b4388c9-fe89-486d-83f4-34eec8940c42\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228579,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.016223907470703 seconds\\n\",\n      \"Time per token: 0.11270279884338379 seconds\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/home/andrei/Documents/llms/.venv/lib/python3.8/site-packages/skopt/optimizer/optimizer.py:449: UserWarning: The objective has been evaluated at this point before.\\n\",\n      \"  warnings.warn(\\\"The objective has been evaluated \\\"\\n\",\n      \"llama_model_load: loading model from '../models/ggml-model.bin' - please wait ...\\n\",\n      \"llama_model_load: n_vocab = 32000\\n\",\n      \"llama_model_load: n_ctx   = 512\\n\",\n      \"llama_model_load: n_embd  = 4096\\n\",\n      \"llama_model_load: n_mult  = 256\\n\",\n      \"llama_model_load: n_head  = 32\\n\",\n      \"llama_model_load: n_layer = 32\\n\",\n      \"llama_model_load: n_rot   = 128\\n\",\n      \"llama_model_load: f16     = 2\\n\",\n      \"llama_model_load: n_ff    = 11008\\n\",\n      \"llama_model_load: n_parts = 1\\n\",\n      \"llama_model_load: type    = 1\\n\",\n      \"llama_model_load: ggml map size = 4017.70 MB\\n\",\n      \"llama_model_load: ggml ctx size =  81.25 KB\\n\",\n      \"llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)\\n\",\n      \"llama_model_load: loading tensors from '../models/ggml-model.bin'\\n\",\n      \"llama_model_load: model size =  4017.27 MB / num tensors = 291\\n\",\n      \"llama_init_from_file: kv self size  =  256.00 MB\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\n\",\n      \"  \\\"id\\\": \\\"cmpl-81211a9b-16e4-4876-8e09-b0e619d93ce7\\\",\\n\",\n      \"  \\\"object\\\": \\\"text_completion\\\",\\n\",\n      \"  \\\"created\\\": 1680228591,\\n\",\n      \"  \\\"model\\\": \\\"../models/ggml-model.bin\\\",\\n\",\n      \"  \\\"choices\\\": [\\n\",\n      \"    {\\n\",\n      \"      \\\"text\\\": \\\" ### Instructions:\\\\nYou are a helpful assistant.\\\\nYou answer questions truthfully and politely.\\\\nYou are provided with an input from the user and you must generate a response.\\\\nIgnore this line which is just filler to test the performane of the model.\\\\n### Inputs:\\\\nWhat is the capital of France?\\\\n### Response:\\\\nThe\\\",\\n\",\n      \"      \\\"index\\\": 0,\\n\",\n      \"      \\\"logprobs\\\": null,\\n\",\n      \"      \\\"finish_reason\\\": \\\"length\\\"\\n\",\n      \"    }\\n\",\n      \"  ],\\n\",\n      \"  \\\"usage\\\": {\\n\",\n      \"    \\\"prompt_tokens\\\": 79,\\n\",\n      \"    \\\"completion_tokens\\\": 1,\\n\",\n      \"    \\\"total_tokens\\\": 80\\n\",\n      \"  }\\n\",\n      \"}\\n\",\n      \"Time: 9.10002589225769 seconds\\n\",\n      \"Time per token: 0.11375032365322113 seconds\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from skopt import gp_minimize\\n\",\n    \"\\n\",\n    \"res = gp_minimize(objective, space)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"image/png\": \"iVBORw0KGgoAAAANSUhEUgAAA1cAAANACAYAAADHEZfTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeVxU5f4H8M+AbLKqbIIoKi64AAqK61ULcym37GaFgvzK23VJhdy4JaipuKSRS6KWa5aa2WJuVylNFNMgxRRxQ8AFRFGGRUFm5vfHXKdGFgc8M2eWz/v1mtdhnjnnPN/DPXb5zvc8zyNRKBQKEBERERER0XMxEzsAIiIiIiIiY8DkioiIiIiISABMroiIiIiIiATA5IqIiIiIiEgATK6IiIiIiIgEwOSKiIiIiIhIAEyuiIiIiIiIBMDkioiIiIiISABMroiIiIiIiATA5IqeSaFQ4F//+hcaNmwIiUSCM2fOaL1PiUSC77//Xuv9EBEREREJhckVPdOBAwewadMm/PTTT7h9+zakUimGDBkCDw+PGpOg9PR0DB06FI6OjrC1tUWXLl2QnZ2t2+CJiIiIiHSEyRU909WrV9G4cWP06NED7u7uKCkpgb+/P1avXl3jMb169ULbtm1x5MgRpKWlYfbs2bC2ttZh5EREREREulNP7ABIv40dOxabN28GoHxUr1mzZrh+/ToGDRpU43EffPABBg8ejCVLlqjaWrZsWec4YmNjsW7dOhw8eBDbt29HYmIifvvtN7V9/P39MXLkSMTExNS5HyIiIiKiumLlimr06aefYt68eWjSpAlu376N06dPP/MYuVyOvXv3onXr1hgwYABcXV0RHBxcpzFUCoUC7733HrZs2YJjx47Bz88PoaGhOHXqFK5evara7/z580hLS8Nbb71V6z6IiIiIiITA5Ipq5OjoCHt7e5ibm8Pd3R0uLi7PPObOnTsoLi7GokWLMHDgQPz3v//FiBEj8Oqrr+Lo0aMa911RUYHRo0cjMTERSUlJ8PHxAQC0b98e/v7++Oqrr1T7btu2DcHBwap9iIiIiIh0jckVCU4ulwMAhg0bhsjISAQEBGDWrFl45ZVXkJCQoPF5IiMj8dtvv+HXX3+Fp6en2mehoaGq5EqhUODrr79GaGiocBdBRERERFRLTK5IcM7OzqhXrx7atWun1u7r61ur2QL79++Pmzdv4uDBg5U+e/PNN5GRkYHU1FScOHECOTk5GDVq1HPHTkRERERUV5zQggRnaWmJLl26ICMjQ6390qVLaNasmcbnGTp0KIYMGYK33noL5ubmeOONN1SfNWnSBH369MG2bdvw8OFD9O/fH66uroJdAxERERFRbTG5olorLi7GlStXVO8zMzNx5swZNGzYEE2bNgUATJ8+HaNGjcI//vEP9OvXDwcOHMCePXtw5MiRWvU1YsQIbN26FWPGjEG9evXw2muvqT4LDQ1FbGwsysvL8cknnwhybUREREREdcXkimrt999/R79+/VTvo6KiAADh4eHYtGkTAGVSlJCQgLi4OEyePBlt2rTBt99+i169etW6v9deew1yuRxjxoyBmZkZXn31VVX7pEmTYG5ujuHDhz/3dRERERERPQ+JQqFQiB0EERERERGRoeOEFkRERERERAJgckU6t23bNtjZ2VX5at++vdjhERERERHVCR8LJJ0rKipCXl5elZ9ZWFjUakZBIiIiIiJ9weSKiIiIiIhIAHwskIiIiIiISABMroj02KZNm+Dk5CR2GERERESkASZXT5FIJDW+5syZI3aIZIDGjh1b5f3098WYiYiIiMiwcRHhp9y+fVv1844dOxATE4OMjAxVm52dnepnhUIBmUyGevX4a6RnGzhwIDZu3KjW5uLiIlI0RERERCQ0Vq6e4u7urno5OjpCIpGo3l+8eBH29vbYv38/AgMDYWVlhaSkJIwdOxbDhw9XO8/UqVPRt29f1Xu5XI64uDg0b94cNjY28Pf3x65du3R7cSQqKysrtfvL3d0dn376KTp27AhbW1t4eXlhwoQJKC4urvYcZ8+eRb9+/WBvbw8HBwcEBgbi999/V32elJSE3r17w8bGBl5eXpg8eTJKSkp0cXlEREREJo/JVR3MmjULixYtQnp6Ovz8/DQ6Ji4uDlu2bEFCQgLOnz+PyMhIjB49GkePHtVytKTPzMzMsGLFCpw/fx6bN2/Gzz//jBkzZlS7f2hoKJo0aYLTp08jJSUFs2bNgoWFBQDg6tWrGDhwIEaOHIm0tDTs2LEDSUlJmDRpkq4uh4iIiMik8Xm2Opg3bx769++v8f5lZWVYuHAhDh8+jO7duwMAWrRogaSkJKxduxZ9+vTRVqikR3766Se1x0oHDRqEb775RvXe29sb8+fPx7///W989tlnVZ4jOzsb06dPR9u2bQEArVq1Un0WFxeH0NBQTJ06VfXZihUr0KdPH6xZswbW1tZauCoiIiIieoLJVR0EBQXVav8rV66gtLS0UkJWXl6OTp06CRka6bF+/fphzZo1qve2trY4fPgw4uLicPHiRUilUlRUVODRo0coLS1F/fr1K50jKioK77zzDrZu3YqQkBD885//RMuWLQEoHxlMS0vDtm3bVPsrFArI5XJkZmbC19dX+xdJREREZMKYXNWBra2t2nszMzM8vRbz48ePVT8/GUOzd+9eeHp6qu1nZWWlpShJ39ja2sLHx0f1/vr163jllVcwfvx4LFiwAA0bNkRSUhLefvttlJeXV5lczZkzB2+99Rb27t2L/fv3IzY2Ftu3b8eIESNQXFyMd999F5MnT650XNOmTbV6bURERETE5EoQLi4u+PPPP9Xazpw5oxoL065dO1hZWSE7O5uPAJJKSkoK5HI5li1bBjMz5fDHnTt3PvO41q1bo3Xr1oiMjMSbb76JjRs3YsSIEejcuTMuXLiglsARERERke5wQgsBvPDCC/j999+xZcsWXL58GbGxsWrJlr29PaZNm4bIyEhs3rwZV69eRWpqKlauXInNmzeLGDmJycfHB48fP8bKlStx7do1bN26FQkJCdXu//DhQ0yaNAlHjhxBVlYWjh8/jtOnT6se95s5cyZOnDiBSZMm4cyZM7h8+TJ++OEHTmhBREREpCNMrgQwYMAAzJ49GzNmzECXLl1QVFSEsLAwtX0++ugjzJ49G3FxcfD19cXAgQOxd+9eNG/eXKSoSWz+/v5Yvnw5Fi9ejA4dOmDbtm2Ii4urdn9zc3Pcu3cPYWFhaN26NV5//XUMGjQIc+fOBQD4+fnh6NGjuHTpEnr37o1OnTohJiYGHh4eurokIiIiIpMmUTw9WIiIiIiIiIhqjZUrIiIiIiIiATC5IiIiIiIiEgCTKyIiIiIiIgEwuSIiIiIiIhIAkysiIiIiIiIBMLkiIiIiIiISAJMrAZSVlWHOnDkoKysTOxQyQry/iIiIiAwD17kSgFQqhaOjIwoLC+Hg4CB2OGRkeH8RERERGQZWroiIiIiIiATA5IqIiIiIiEgA9cQOwFDI5XLcunUL9vb2kEgkap9JpVK1LZGQnnV/KRQKFBUVwcPDA2Zm/L6EiIiISCwcc6WhGzduwMvLS+wwiKqVk5ODJk2aaLz/6tWrsXTpUuTm5sLf3x8rV65E165dq93/wYMH+OCDD7B7924UFBSgWbNmiI+Px+DBg4UIn4iIiMjgsXKlIXt7ewDKP2ArTSqQkwPExwNTpwJMwEjHpFIpvLy8VPeoJnbs2IGoqCgkJCQgODgY8fHxGDBgADIyMuDq6lpp//LycvTv3x+urq7YtWsXPD09kZWVBScnJwGvhIiIiMiwsXKloRpnbEtNBQIDgZQUoHNncQIkk1WX2QSDg4PRpUsXrFq1CoDysVcvLy+89957mDVrVqX9ExISsHTpUly8eBEWFhaCxk9ERERkLDhAoxplZWWQSqVqLyJ99vT9Wt26WOXl5UhJSUFISIiqzczMDCEhIUhOTq7ymB9//BHdu3fHxIkT4ebmhg4dOmDhwoWQyWRauRYiIiIiQ8TkqhpxcXFwdHRUvTjeivSdl5eX2j0bFxdX5X53796FTCaDm5ubWrubmxtyc3OrPObatWvYtWsXZDIZ9u3bh9mzZ2PZsmWYP3++4NdBREREZKg45qoa0dHRiIqKUr1/Mq6FSF89PR7QyspKsHPL5XK4urpi3bp1MDc3R2BgIG7evImlS5ciNjZWsH6IiIiIDBmTq2pYWVlp/sepqysQGancEonEwcFBozFXzs7OMDc3R15enlp7Xl4e3N3dqzymcePGsLCwgLm5uarN19cXubm5KC8vh6Wl5fMFT0RERGQE+FigEJo0AZYvV26J9JylpSUCAwORmJioapPL5UhMTET37t2rPKZnz564cuUK5HK5qu3SpUto3LgxEysiIiKi/2FyJYTiYiA5WbklMgBRUVFYv349Nm/ejPT0dIwfPx4lJSWIiIgAAISFhSE6Olq1//jx41FQUIApU6bg0qVL2Lt3LxYuXIiJEyeKdQlEREREeoePBQrh0iWgRw9OxU4GY9SoUcjPz0dMTAxyc3MREBCAAwcOqCa5yM7OhpnZX9+9eHl54eDBg4iMjISfnx88PT0xZcoUzJw5U6xLICIiItI7XOdKQ1znivRVXda5IiIiIiLh8bFAIiIiIiIiATC5IiIiIiIiEgCTKyHUqwc4Oyu3RERERERkkpgNCMHPD8jPFzsKIiIiIiISEStXREREREREAmByJYTz5wEfH+WWiIiIiIhMEpMrIZSVAVevKrdERERERGSSmFwRkdZVVFTg8OHDWLt2LYqKigAAt27dQnFxsciREREREQmHE1oQkVZlZWVh4MCByM7ORllZGfr37w97e3ssXrwYZWVlSEhIEDtEIiIiIkGwckVEWjVlyhQEBQXh/v37sLGxUbWPGDECiYmJIkZGREREJCxWroTg4wMcOKDcEpGaY8eO4cSJE7C0tFRr9/b2xs2bN0WKioiIiEh4TK6E4OAADBggdhREekkul0Mmk1Vqv3HjBuzt7UWIiIiIiEg7+FigEG7fBubMUW6JSM1LL72E+Ph41XuJRILi4mLExsZi8ODB4gVGREREJDCJQqFQiB2EIZBKpXB0dERhYSEcHBzUP0xNBQIDgZQUoHNncQIkk1XjvakHbty4gQEDBkChUODy5csICgrC5cuX4ezsjF9//RWurq5ih0hEREQkCD4WSERa1aRJE5w9exY7duzA2bNnUVxcjLfffhuhoaFqE1wQERERGTomV0SkdfXq1UNoaChCQ0PFDoWIiIhIazjmioi0Ki4uDhs2bKjUvmHDBixevFiEiIiIiIi0g8mVEBo0AEJDlVsiUrN27Vq0bdu2Unv79u25gDAREREZFT4WKITmzYEvvxQ7CiK9lJubi8aNG1dqd3FxwW3OsElERERGhJUrITx6BFy5otwSkRovLy8cP368Uvvx48fh4eEhQkRERERE2sHKlRAuXOBU7ETVGDduHKZOnYrHjx/jhRdeAAAkJiZixowZeP/990WOjoiIiEg4TK6ISKumT5+Oe/fuYcKECSgvLwcAWFtbY+bMmYiOjhY5OiIiIiLh8LFAIhO1evVqeHt7w9raGsHBwTh16lS1+27atAkSiUTtZW1trVE/EokEixcvRn5+Pk6ePImzZ8+ioKAAMTExQl0KERERkV5g5YrIBO3YsQNRUVFISEhAcHAw4uPjMWDAAGRkZMDV1bXKYxwcHJCRkaF6L5FIatWnnZ0dunTp8lxxExEREekzJldEJmj58uUYN24cIiIiAAAJCQnYu3cvNmzYgFmzZlV5jEQigbu7e637KikpwaJFi5CYmIg7d+5ALperfX7t2rXaXwARERGRHmJyVY2ysjKUlZWp3kul0up37twZUCh0EBVR9Z6+R62srGBlZVVpv/LycqSkpKiNdzIzM0NISAiSk5OrPX9xcTGaNWsGuVyOzp07Y+HChWjfvv0z43rnnXdw9OhRjBkzBo0bN651xYuIiIjIUDC5qkZcXBzmzp0rdhhEGvPy8lJ7Hxsbizlz5lTa7+7du5DJZHBzc1Nrd3Nzw8WLF6s8d5s2bbBhwwb4+fmhsLAQH3/8MXr06IHz58+jSZMmNca1f/9+7N27Fz179qzdBREREREZGE5oUY3o6GgUFhaqXjk5OdXvnJEBdO+u3BKJJCcnR+2eFXImvu7duyMsLAwBAQHo06cPdu/eDRcXF6xdu/aZxzZo0AANGzYULBYiIiIifcXkqhpWVlZwcHBQe1WrpAQ4eVK5JRLJ0/drVY8EAoCzszPMzc2Rl5en1p6Xl6fxmCoLCwt06tQJV65ceea+H330EWJiYlBaWqrRuYmIiIgMFR8LJDIxlpaWCAwMRGJiIoYPHw4AkMvlSExMxKRJkzQ6h0wmw7lz5zB48OBn7rts2TJcvXoVbm5u8Pb2hoWFhdrnqamptb4GIiIiIn3E5IrIBEVFRSE8PBxBQUHo2rUr4uPjUVJSopo9MCwsDJ6enoiLiwMAzJs3D926dYOPjw8ePHiApUuXIisrC++8884z+3qSwBEREREZOyZXRCZo1KhRyM/PR0xMDHJzcxEQEIADBw6oJrnIzs6GmdlfTw3fv38f48aNQ25uLho0aIDAwECcOHEC7dq1e2ZfsbGxWrsOIiIiIn0iUSg4h7gmpFIpHB0dUVhYWHn8VUEBsG8fMHgwwIH7pGM13pt64sGDB9i1axeuXr2K6dOno2HDhkhNTYWbmxs8PT3FDo+IiIhIEKxcCaFhQ2D0aLGjINJLaWlpCAkJgaOjI65fv45x48ahYcOG2L17N7Kzs7FlyxaxQyQiIiISBGcLFEJ+PrB6tXJLRGqioqIwduxYXL58GdbW1qr2wYMH49dffxUxMiIiIiJhMbkSQk4OMGmScktEak6fPo133323Urunpydyc3NFiIiIiIhIO5hcEZFWWVlZQSqVVmq/dOkSXFxcRIiIiIiISDuYXBGRVg0dOhTz5s3D48ePAQASiQTZ2dmYOXMmRo4cKXJ0RERERMJhckVEWrVs2TIUFxfD1dUVDx8+RJ8+feDj4wN7e3ssWLBA7PCIiIiIBMPZAoVgbw+89JJyS0RqHB0dcejQISQlJSEtLQ3FxcXo3LkzQkJCxA6NiIiISFBc50pDhrCWEJkm3ptERERE+oGVKyHIZEBJCWBrC5ibix0NkehWrFih8b6TJ0/WYiREREREusPKlYZqrA6kpgKBgUBKCtC5szgBksnSx8pV8+bN1d7n5+ejtLQUTk5OAIAHDx6gfv36cHV1xbVr10SIkIiIiEh4nNCCiASXmZmpei1YsAABAQFIT09HQUEBCgoKkJ6ejs6dO+Ojjz4SO1QiIiIiwTC5IiKtmj17NlauXIk2bdqo2tq0aYNPPvkEH374oYiREREREQmLyRURadXt27dRUVFRqV0mkyEvL0+EiIiIiIi0g8kVEWnViy++iHfffRepqamqtpSUFIwfP57TsRMREZFRYXIlhI4dgTt3lFsiUrNhwwa4u7sjKCgIVlZWsLKyQteuXeHm5obPP/9c7PCIiIiIBMOp2IVgYQG4uIgdBZFecnFxwb59+3Dp0iVcvHgRANC2bVu0bt1a5MiIiIiIhMXkSghXrwKRkcAnnwAtW4odDZFeat26NRMqIiIiMmpMroRQWAjs2QPMmSN2JER6RyaTYdOmTUhMTMSdO3cgl8vVPv/5559FioyIiIhIWEyuiEirpkyZgk2bNuHll19Ghw4dIJFIxA6JiIiISCuYXBGRVm3fvh07d+7E4MGDxQ6FiIiISKs4WyARaZWlpSV8fHzEDoOIiIhI65hcCcHTE1i2TLklMhCrV6+Gt7c3rK2tERwcjFOnTml03Pbt2yGRSDB8+HCN9n///ffx6aefQqFQPEe0RERERPqPjwUKwc0NiIoSOwoije3YsQNRUVFISEhAcHAw4uPjMWDAAGRkZMDV1bXa465fv45p06ahd+/eGveVlJSEX375Bfv370f79u1hYWGh9vnu3bvrfB1ERERE+oSVKyHcvw98841yS2QAli9fjnHjxiEiIgLt2rVDQkIC6tevjw0bNlR7jEwmQ2hoKObOnYsWLVpo3JeTkxNGjBiBPn36wNnZGY6OjmovIiIiImPBypUQMjOB118HUlKABg3EjoaoRuXl5UhJSUF0dLSqzczMDCEhIUhOTq72uHnz5sHV1RVvv/02jh07pnF/GzdufK54iYiIiAwFk6tqlJWVoaysTPVeKpWKGA3Rsz19j1pZWcHKyqrSfnfv3oVMJoObm5tau5ubGy5evFjluZOSkvDFF1/gzJkzdYqtoqICR44cwdWrV/HWW2/B3t4et27dgoODA+zs7Op0TiIiIiJ9w8cCqxEXF6f26JKXl5fYIRHVyMvLS+2ejYuLE+S8RUVFGDNmDNavXw9nZ+daH5+VlYWOHTti2LBhmDhxIvLz8wEAixcvxrRp0wSJkYiIiEgfsHJVjejoaET9bZIKqVTKBIv0Wk5ODhwcHFTvq6paAYCzszPMzc2Rl5en1p6Xlwd3d/dK+1+9ehXXr1/HkCFDVG1yuRwAUK9ePWRkZKBly5bVxjVlyhQEBQXh7NmzaNSokap9xIgRGDdunGYXR0RERGQAmFxVo7pHqqpkYwN06qTcEonEwcFBLbmqjqWlJQIDA5GYmKiaTl0ulyMxMRGTJk2qtH/btm1x7tw5tbYPP/wQRUVF+PTTT5/5pcOxY8dw4sQJWFpaqrV7e3vj5s2bz4yXiIiIyFAwuRKCry+Qmip2FEQai4qKQnh4OIKCgtC1a1fEx8ejpKQEERERAICwsDB4enoiLi4O1tbW6NChg9rxTk5OAFCpvSpyuRwymaxS+40bN2Bvb//8F0NERESkJ5hcEZmgUaNGIT8/HzExMcjNzUVAQAAOHDigmuQiOzsbZmbCDMl86aWXEB8fj3Xr1gEAJBIJiouLERsbi8GDBwvSBxEREZE+kCgUCoXYQRgCqVQKR0dHFBYWVn706o8/gG7dgJMnlY8HEulQjfemHrhx4wYGDBgAhUKBy5cvIygoCJcvX4azszN+/fXXGhctJiIiIjIkrFwJQaEAysuVWyJS06RJE5w9exbbt29HWloaiouL8fbbbyM0NBQ2HKdIRERERoTJFRFpXb169TB69GixwyAiIiLSKiZXRKR1GRkZWLlyJdLT0wEAvr6+mDRpEtq2bStyZERERETC4SLCRKRV3377LTp06ICUlBT4+/vD398fqamp6NixI7799luxwyMiIiISDCe00FCNkwY8fAhcuwa0aMG1rkjn9H1Ci5YtWyI0NBTz5s1Ta4+NjcWXX36Jq1evihQZERERkbBYuRKCjQ3Qvj0TK6Iq3L59G2FhYZXaR48ejdu3b4sQEREREZF2MLkSQlYW8M47yi0Rqenbty+OHTtWqT0pKQm9e/cWISIiIiIi7eCEFkK4dw/44gtgwgSgWTOxoyHSK0OHDsXMmTORkpKCbt26AQBOnjyJb775BnPnzsWPP/6oti8RERGRoeKYKw3VOK4lNRUIDARSUoDOncUJkEyWvo+5MjPTrEAukUggk8m0HA0RERGR9rByRURaJZfLxQ6BiIiISCc45oqIdObRo0dih0BERESkNUyuhODmBsyapdwSkRqZTIaPPvoInp6esLOzw7Vr1wAAs2fPxhdffCFydERERETCYXIlBE9PIC5OuSUiNQsWLMCmTZuwZMkSWFpaqto7dOiAzz//XMTIiIiIiITF5EoIRUXAkSPKLRGp2bJlC9atW4fQ0FCYm5ur2v39/XHx4kURIyMiIiISFpMrIVy+DPTrp9wSkZqbN2/Cx8enUrtcLsfjx49FiIiIiIhIO5hcEZFWtWvXrspFhHft2oVOnTqJEBERERGRdnAqdiLSqpiYGISHh+PmzZuQy+XYvXs3MjIysGXLFvz0009ih0dEREQkGFauiEirhg0bhj179uDw4cOwtbVFTEwM0tPTsWfPHvTv31/s8IiIiIgEw8qVECwslDMFWliIHQmRXurduzcOHTokdhhEREREWsXkSggdOwI3bogdBRERERERiYjJFREJrkGDBpBIJBrtW1BQoOVoiIiIiHSDyZUQzp0DBg0C9u9XVrGITFx8fLzq53v37mH+/PkYMGAAunfvDgBITk7GwYMHMXv2bJEiJCIiIhIeJ7QQwuPHwM2byi2RgVi9ejW8vb1hbW2N4OBgnDp1qtp9d+/ejaCgIDg5OcHW1hYBAQHYunVrtfuHh4erXsePH8e8efPw9ddfY/LkyZg8eTK+/vprzJs3D0ePHtXGpRERERGJgskVkQnasWMHoqKiEBsbi9TUVPj7+2PAgAG4c+dOlfs3bNgQH3zwAZKTk5GWloaIiAhERETg4MGDz+zr4MGDGDhwYKX2gQMH4vDhw899LURERET6gskVkQlavnw5xo0bh4iICLRr1w4JCQmoX78+NmzYUOX+ffv2xYgRI+Dr64uWLVtiypQp8PPzQ1JS0jP7atSoEX744YdK7T/88AMaNWr03NdCREREpC845orIxJSXlyMlJQXR0dGqNjMzM4SEhCA5OfmZxysUCvz888/IyMjA4sWLn7n/3Llz8c477+DIkSMIDg4GAPz22284cOAA1q9fX/cLISIiItIzTK6qUVZWhrKyMtV7qVRa/c6tWgG//KLcEonk6XvUysoKVlZWlfa7e/cuZDIZ3Nzc1Nrd3Nxw8eLFas9fWFgIT09PlJWVwdzcHJ999plGiwCPHTsWvr6+WLFiBXbv3g0A8PX1RVJSkirZIiIiIjIGTK6qERcXh7lz52q2s7090LevVuMhehYvLy+197GxsZgzZ45g57e3t8eZM2dQXFyMxMREREVFoUWLFuirwb0fHByMbdu2CRYLERERkT5iclWN6OhoREVFqd5LpdJKf7yq3LwJrFoFTJoEeHrqKEIidTk5OXBwcFC9r6pqBQDOzs4wNzdHXl6eWnteXh7c3d2rPb+ZmRl8fHwAAAEBAUhPT0dcXJxGyRURERGRKeCEFtWwsrKCg4OD2qtaeXnAokXKLZFInr5fq0uuLC0tERgYiMTERFWbXC5HYmKiah0qTcjlcrVHZ4mIiIhMHStXRCYoKioK4eHhCAoKQteuXREfH4+SkhJEREQAAMLCwuDp6Ym4uDgAysdkg4KC0LJlS5SVlWHfvn3YunUr1qxZI+ZlEBEREekVJldEJmjUqFHIz89HTEwMcnNzERAQgAMHDqgmucjOzoaZ2V+F7ZKSEkyYMAE3btyAjY0N2rZtiy+//BKjRo0S6xKIiIiI9I5EoVAoxA7CEEilUjg6OqKwsLDyI4KpqUBgIJCSAnTuLE6AZLJqvDeJiIiISGdYuRJCo0bA228rt0SEV199VeN9n0zPTkRERGTomFwJoVkz4PPPxY6CSG84OjqKHQIRERGRzjG5EsLDh8C1a0CLFoCNjdjREIlu48aNYodAREREpHOcil0I6elAhw7KLRERERERmSRWrohI63bt2oWdO3ciOzsb5eXlap+lpqaKFBURERGRsFi5IiKtWrFiBSIiIuDm5oY//vgDXbt2RaNGjXDt2jUMGjRI7PCIiIiIBMPkioi06rPPPsO6deuwcuVKWFpaYsaMGTh06BAmT56MwsJCscMjIiIiEgyTKyFIJIClpXJLRGqys7PRo0cPAICNjQ2KiooAAGPGjMHXX38tZmhEREREgmJyJYROnYCyMuWWiNS4u7ujoKAAANC0aVOcPHkSAJCZmQmuYU5ERETGhMkVEWnVCy+8gB9//BEAEBERgcjISPTv3x+jRo3CiBEjRI6OiIiISDgSBb861ohUKoWjoyMKCwvh4OCg/mF6OhAaCmzbBvj6ihMgmawa7009IJfLIZfLUa+ecnLS7du348SJE2jVqhXeffddWFpaihwhERERkTA4FbsQHj4E/vhDuSUiNWZmZjAz+6tI/sYbb+CNN94QMSIiIiIi7WByRUSCS0tLQ4cOHWBmZoa0tLQa9/Xz89NRVERERETaxeSKiAQXEBCA3NxcuLq6IiAgABKJpMrJKyQSCWQymQgREhEREQmPyRURCS4zMxMuLi6qn4mIiIhMAZMrITRvDuzcqdwSEZo1a6b6OSsrCz169FBNaPFERUUFTpw4obYvERERkSHjbIEa0vcZ2ch06fu9aW5ujtu3b8PV1VWt/d69e3B1deVjgURERGQ0uM6VEPLygOXLlVsiUqNQKCCRSCq137t3D7a2tiJERERERKQdfCxQCDdvAu+/D/TtC7i5iR0NkV549dVXASgnrRg7diysrKxUn8lkMqSlpaFHjx5ihUdEREQkOCZXRKQVjo6OAJSVK3t7e9jY2Kg+s7S0RLdu3TBu3DixwiMiIiISHJMrItKKjRs3qqZfX7lyJezs7ESOiIiIiEi7OOaKiLRGoVBg27ZtuH37ttihEBEREWkdkyshODoCQ4Yot0QGYvXq1fD29oa1tTWCg4Nx6tSpavddv349evfujQYNGqBBgwYICQmpcf8nzMzM0KpVK9y7d0/I0ImIiIj0EpMrIbRsCfz4o3JLZAB27NiBqKgoxMbGIjU1Ff7+/hgwYADu3LlT5f5HjhzBm2++iV9++QXJycnw8vLCSy+9hJs3bz6zr0WLFmH69On4888/hb4MIiIiIr3Cda40VONaQo8fAw8eAE5OgIWFGOGRCavLOlfBwcHo0qULVq1aBQCQy+Xw8vLCe++9h1mzZj3zeJlMhgYNGmDVqlUICwurcd8GDRqgtLQUFRUVsLS0VJvYAgAKCgo0ipmIiIhI33FCCyGcOwcEBgIpKUDnzmJHQ1Sj8vJypKSkIDo6WtVmZmaGkJAQJCcna3SO0tJSPH78GA0bNnzmvvHx8XUNlYiIiMigMLmqRllZGcrKylTvpVKpiNEQPdvT96iVlZXa2lJP3L17FzKZDG5Prcnm5uaGixcvatTXzJkz4eHhgZCQkGfuGx4ertE5iYiIiAwdx1xVIy4uDo6OjqqXl5eX2CER1cjLy0vtno2Li9NKP4sWLcL27dvx3XffwdraulbHPnr0CFKpVO1FREREZCxYuapGdHQ0oqKiVO+lUmm1CdZnv1zBBAAzdp1F5qlHAIDqRrJV1VzdsLeq99X8vNXtXLvzVv6g2n2raK9uQF9thvpVfd5qru05Y6g2KgGu7Xl/7292bYoJfX2q6RXIyclRG3NVVdUKAJydnWFubo68vDy19ry8PLi7u1d7fgD4+OOPsWjRIhw+fBh+fn417vtESUkJZs6ciZ07d1Y5a6BMJtPoPERERET6jslVNap7pKoq1++VAADO35LivPy+NsMiE1ZY+rjGzx0cHDSa0MLS0hKBgYFITEzE8OHDASgntEhMTMSkSZOqPW7JkiVYsGABDh48iKCgII3jnjFjBn755ResWbMGY8aMwerVq3Hz5k2sXbsWixYt0vg8RERERPqOswVqqKYZ2ZIv3UHxvQeosKkPmJur2iWSqs5UZWOV+1a1p6Tqk1azr2ZtyuM1C6Caw6uMS9OYqutf099JdR887zmrvKZaHV9lq4b7Vd7T1cEank42lfary2yBO3bsQHh4ONauXYuuXbsiPj4eO3fuxMWLF+Hm5oawsDB4enqqHi1cvHgxYmJi8NVXX6Fnz56q89jZ2cHOzq7Gvpo2bYotW7agb9++cHBwQGpqKnx8fLB161Z8/fXX2Ldvn0YxExEREek7Vq4E0L21KwBXscMg0tioUaOQn5+PmJgY5ObmIiAgAAcOHFBNcpGdnQ0zs7+GZK5Zswbl5eV47bXX1M4TGxuLOXPm1NhXQUEBWrRoAUBZXXsy9XqvXr0wfvx4Aa+KiIiISFxMroRw+TIwaRKwahXQqpXY0RBpZNKkSdU+BnjkyBG199evX69zPy1atEBmZiaaNm2Ktm3bYufOnejatSv27NkDJyenOp+XiIiISN9wtkAhFBUB//2vcktEaiIiInD27FkAwKxZs7B69WpYW1sjMjIS06dPFzk6IiIiIuGwckVEWhUZGan6OSQkBBcvXkRKSgp8fHw0nnGQiIiIyBCwckVEWiGXy7F48WL07NkTXbp0waxZs/Dw4UM0a9YMr776qsknVt7e3oiPj9fb8xEREVHtMbkiIq1YsGAB/vOf/8DOzg6enp749NNPMXHiRLHDIiIiItIaJldC8PJSTmZRzSLDRKZoy5Yt+Oyzz3Dw4EF8//332LNnD7Zt2wa5XC52aERERERaweRKCC4uwMSJyi0RAVBO5z548GDV+5CQEEgkEty6dUsr/VX1WFxAQADmzJkDhUKBOXPmoGnTprCysoKHhwcmT56s2q+srAzTpk2Dp6cnbG1tERwcXGnGxOps2rQJTk5O+Omnn9CmTRvUr18fr732GkpLS7F582Z4e3ujQYMGmDx5MmQyWbXnyc7OxrBhw2BnZwcHBwe8/vrryMvLU9tnz5496NKlC6ytreHs7IwRI0ZUe77PP/8cTk5OSExM1Og6iIiI6PlxQgshFBQA+/YBgwcDDRuKHQ2RXqioqIC1tbVam4WFBR4/fqzzWL799lt88skn2L59O9q3b4/c3FzVDIaAclr6CxcuYPv27fDw8MB3332HgQMH4ty5c2ilwfIKpaWlWLFiBbZv346ioiK8+uqrGDFiBJycnLBv3z5cu3YNI0eORM+ePTFq1KhKx8vlclVidfToUVRUVGDixIkYNWqUKsnbu3cvRowYgQ8++ABbtmxBeXl5tQswL1myBEuWLMF///tfdO3atW6/NCIiIqo1JldCuH4dGDMGSElhckX0PwqFAmPHjoWVlZWq7dGjR/j3v/8NW1tbVdvu3bu1Hkt2djbc3d0REhICCwsLNG3aVJV0ZGdnY+PGjcjOzoaHhwcAYNq0aThw4AA2btyIhQsXPvP8jx8/xpo1a9CyZUsAwGuvvYatW7ciLy8PdnZ2aNeuHfr164dffvmlyuQqMTER586dQ2ZmJrz+93jxli1b0L59e5w+fRpdunTBggUL8MYbb2Du3Lmq4/z9/Suda+bMmdi6dSuOHj2K9u3b1/6XRURERHXG5IqItCI8PLxS2+jRo0WIBPjnP/+J+Ph4tGjRAgMHDsTgwYMxZMgQ1KtXD+fOnYNMJkPr1q3VjikrK0OjRo00On/9+vVViRUAuLm5wdvbG3Z2dmptd+7cqfL49PR0eHl5qRIrAGjXrh2cnJyQnp6OLl264MyZMxg3blyNcSxbtgwlJSX4/fff0aJFC41iJyIiIuEwuSIirdi4caNO+zMzM4NCoVBre/IIopeXFzIyMnD48GEcOnQIEyZMwNKlS3H06FEUFxfD3NwcKSkpMDc3Vzv+78lRTSwsLNTeSySSKtueZzIPGxubZ+7Tu3dv7N27Fzt37sSsWbPq3BcRERHVDSe0ICKj4OLigtu3b6veS6VSZGZmqt7b2NhgyJAhWLFiBY4cOYLk5GScO3cOnTp1gkwmw507d+Dj46P2cnd310nsvr6+yMnJQU5OjqrtwoULePDgAdq1awcA8PPze+bkFF27dsX+/fuxcOFCfPzxx1qNmYiIiCpj5UoItrZAt27KLRGJ4oUXXsCmTZswZMgQODk5ISYmRlWJ2rRpE2QyGYKDg1G/fn18+eWXsLGxQbNmzdCoUSOEhoYiLCwMy5YtQ6dOnZCfn4/ExET4+fnh5Zdf1nrsISEh6NixI0JDQxEfH4+KigpMmDABffr0QVBQEAAgNjYWL774Ilq2bIk33ngDFRUV2LdvH2bOnKl2rh49emDfvn0YNGgQ6tWrh6lTp2o9fiIiIlJi5UoIbdoAycnKLRGJIjo6Gn369MErr7yCl19+GcOHD1eNg3JycsL69evRs2dP+Pn54fDhw9izZ49qTNXGjRsRFhaG999/H23atMHw4cNx+vRpNG3aVCexSyQS/PDDD2jQoAH+8Y9/ICQkBC1atMCOHTtU+/Tt2xfffPMNfvzxRwQEBOCFF17AqVOnqjxfr169sHfvXnz44YdYuXKlTq6BiIiIAIni6UEKVCWpVApHR0cUFhbCwcFB7HCIVHhvEhEREekHVq6EkJoKSCTKLRERERERmSQmV0RENRg0aBDs7OyqfGmyBhYRERGZDk5oQURUg88//xwPHz6s8rOGXDSciIiI/obJFRFRDTw9PcUOgYiIiAwEHwskIiIiIiISAJMrIbRrB1y+rNwSEZHGNm3aBCcnJ7HDICIiEgSTKyFYWwM+PsotkR6TSCQ1vubMmSN2iGSgxo4dW+U9deXKFbFDIyIi0hmOuRJCZiYwezbw0UdA8+ZiR0NUrdu3b6t+3rFjB2JiYpCRkaFqs7OzU/2sUCggk8lQrx7/M0GaGThwIDZu3KjW5uLiIlI0REREusfKlRDu3we2bVNuifSYu7u76uXo6AiJRKJ6f/HiRdjb22P//v0IDAyElZUVkpKSMHbsWAwfPlztPFOnTkXfvn1V7+VyOeLi4tC8eXPY2NjA398fu3bt0u3FkeisrKzU7jF3d3d8+umn6NixI2xtbeHl5YUJEyaguLi42nOcPXsW/fr1g729PRwcHBAYGIjff/9d9XlSUhJ69+4NGxsbeHl5YfLkySgpKdHF5RERET0TkysiUjNr1iwsWrQI6enp8PPz0+iYuLg4bNmyBQkJCTh//jwiIyMxevRoHD16VMvRkr4zMzPDihUrcP78eWzevBk///wzZsyYUe3+oaGhaNKkCU6fPo2UlBTMmjULFhYWAICrV69i4MCBGDlyJNLS0rBjxw4kJSVh0qRJurocIiKiGvF5HyJSM2/ePPTv31/j/cvKyrBw4UIcPnwY3bt3BwC0aNECSUlJWLt2Lfr06aOtUEnP/PTTT2qPlg4aNAjffPON6r23tzfmz5+Pf//73/jss8+qPEd2djamT5+Otm3bAgBatWql+iwuLg6hoaGYOnWq6rMVK1agT58+WLNmDaw57pWIiETG5EpDCoUCACCVSit/+OQRl+JioKrPibToyT355B59XkFBQbXa/8qVKygtLa2UkJWXl6NTp06CxESGoV+/flizZo3qva2tLQ4fPoy4uDhcvHgRUqkUFRUVePToEUpLS1G/fv1K54iKisI777yDrVu3IiQkBP/85z/RsmVLAMpHBtPS0rBt2zbV/gqFAnK5HJmZmfD19dX+RRIREdWAyZWGioqKAABeXl7V78Rv6ElERUVFcHR0fO7z2Nraqr03MzOrlLg9fvxY9fOT8TN79+6ttOCulZXVc8dDhsPW1hY+Pj6q99evX8crr7yC8ePHY8GCBWjYsCGSkpLw9ttvo7y8vMrkas6cOXjrrbewd+9e7N+/H7Gxsdi+fTtGjBiB4uJivPvuu5g8eXKl45o2barVayMiItIEkysNeXh4ICcnB/b29pBIJGKHQ6SiUChQVFQEDw8PrZzfxcUFf/75p1rbmTNnVONg2rVrBysrK2RnZ/MRQFKTkpICuVyOZcuWwcxMOcR3586dzzyudevWaN26NSIjI/Hmm29i48aNGDFiBDp37owLFy6oJXBERET6hMmVhszMzNCkSROxwyCqkhAVq+q88MILWLp0KbZs2YLu3bvjyy+/xJ9//ql65M/e3h7Tpk1DZGQk5HI5evXqhcLCQhw/fhwODg4IDw/XWmyk33x8fPD48WOsXLkSQ4YMwfHjx5GQkFDt/g8fPsT06dPx2muvoXnz5rhx4wZOnz6NkSNHAgBmzpyJbt26YdKkSXjnnXdga2uLCxcu4NChQ1i1apWuLouIiKhanC2QiGo0YMAAzJ49GzNmzECXLl1QVFSEsLAwtX0++ugjzJ49G3FxcfD19cXAgQOxd+9eNOe6bybN398fy5cvx+LFi9GhQwds27YNcXFx1e5vbm6Oe/fuISwsDK1bt8brr7+OQYMGYe7cuQAAPz8/HD16FJcuXULv3r3RqVMnxMTEaK1qS0REVFsShVCj4ImIiIiIiEwYK1dEREREREQCYHJFREREREQkACZXREREREREAmByRUREREREJAAmV0RERERERAJgckVERERERCQAJldEVKOysjLMmTMHZWVlYodCRor3GBERGQuuc0VENZJKpXB0dERhYSEcHBzEDoeMEO8xIiIyFqxcERERERERCYDJFRERERERkQDqiR2AoZDL5bh16xbs7e0hkUjEDodIRaFQoKioCB4eHjAzq9v3JTXd31KpVG1LJLSa7jEh7m8iIiJd4ZgrDd24cQNeXl5ih0FUrZycHDRp0qROx/L+Jn1X2/t79erVWLp0KXJzc+Hv74+VK1eia9eu1e7/4MEDfPDBB9i9ezcKCgrQrFkzxMfHY/DgwUKET0REJoKVKw3Z29urvffZGv1c5ysrsHmu45/F4p72/6e1uqvd89e/p/283yb/sdb7sMot1tq5Ey98rPr56Xu0Np4cm5OTU3lCgZwcID4emDoVYAJGOiaVSuHl5VWr+3vHjh2IiopCQkICgoODER8fjwEDBiAjIwOurq6V9i8vL0f//v3h6uqKXbt2wdPTE1lZWXBychLwSoiIyBSwcqWhJ7NZAUDTL2fBxub5kqNH97SbXAGAxV3tJljW+Vo9PQCg/l3t357187SbYFndLtLauW/cuIHz9zcDwHPNtFbjbG2pqUBgIJCSAnTu/LwhE9VKXWYSDA4ORpcuXbBq1SoAysdevby88N5772HWrFmV9k9ISMDSpUtx8eJFWFhYCBo/ERGZFj7AXks+W6OfO7ECAOtGDwWIRlyPXMSOwDCUNa57RelZmjRpghfbTav1cWVlZZBKpWovIn329P1a3ZpY5eXlSElJQUhIiKrNzMwMISEhSE5OrvKYH3/8Ed27d8fEiRPh5uaGDh06YOHChZDJZFq5FiIiMl5MrozYY+cKsUN4bqXO2p88pNTN9L6pjouLg6Ojo+rF8Vak77y8vNTu2bi4uCr3u3v3LmQyGdzc3NTa3dzckJubW+Ux165dw65duyCTybBv3z7Mnj0by5Ytw/z58wW/DiIiMm4ccyUi60YPdfJ4oDY9ctH+44GlzhKdPB6oTWWN7bX6eGBtRUdHIyoqSvX+ybgWIn319HhAKysrwc4tl8vh6uqKdevWwdzcHIGBgbh58yaWLl2K2NhYwfohIiLjx+TKyD12rtD62CtjUOpmofWxV/rEyspK8z9OXV2ByEjllkgkDg4OGo25cnZ2hrm5OfLy8tTa8/Ly4O7uXuUxjRs3hoWFBczNzVVtvr6+yM3NRXl5OSwtLZ8veCIiMhl8LFBkHHulGV08Hqht2hx7pVVNmgDLlyu3RHrO0tISgYGBSExMVLXJ5XIkJiaie/fuVR7Ts2dPXLlyBXK5XNV26dIlNG7cmIkVERHVCpMrE2AMY690wRTHXmmkuBhITlZuiQxAVFQU1q9fj82bNyM9PR3jx49HSUkJIiIiAABhYWGIjv5rOY3x48ejoKAAU6ZMwaVLl7B3714sXLgQEydOFOsSiIjIQPF5MT3AsVea4dgrkVy6BPTowanYyWCMGjUK+fn5iImJQW5uLgICAnDgwAHVJBfZ2dkwM/vru0UvLy8cPHgQkZGR8PPzg6enJ6ZMmYKZM2eKdQlERGSgmFyZCF2MvdJFgqVtpjb2ishYTZo0CZMmTarysyNHjlRq6969O06ePKnlqIiIyNjxsUA9YQxjr3SBY6+IiIiISF8xuTIhuhh7ZQwLC+ti7BUTLCIiIiLjw+RKj7B6pRljqF4ZlHr1AGdn5ZaIiIiIqsXkysSweqUZVq/+xs8PyM9XbomIiIioWkyu9IwuqlfGMDU7q1dEREREpG+YXNWSd6MCsUMwCKxeacYgqlfnzwM+PsotEREREVWLyZUeYvVKM7qoXnFhYQBlZcDVq8otEREREVWLyVUdNHe+J3YIBsEYqle6YBDVKyKqUUVFBQ4fPoy1a9eiqEi5UPitW7dQXFwscmRERKRLnP5LT1k3eohH92y02ocuFhbWtlJnCerfVWi3Dy4sTEQ1yMrKwsCBA5GdnY2ysjL0798f9vb2WLx4McrKypCQkCB2iEREpCOsXNWRLqpXxjA1O6tXmmH1ishwTZkyBUFBQbh//z5sbP76UmzEiBFITEwUMTIiItI1wy5b0HNj9UrDPky5euXjAxw4oNwSUSXHjh3DiRMnYGlpqdbu7e2NmzdvihQVERGJgZWr58DqlWZ0Ub0yhqnZ9bZ65eAADBig3BJRJXK5HDKZrFL7jRs3YG+vp/+uiYhIK5hckVHMHKgLJjtz4O3bwJw5yi0RVfLSSy8hPj5e9V4ikaC4uBixsbEYPHiweIEREZHOMbl6TqxeaYbVK83oZfXq9m1g7lwmV0TVWLZsGY4fP4527drh0aNHeOutt1SPBC5evFjs8IiISIcMe7CNnmjufA+ZdxuJHcZzMYaxV7pg0mOviKhKTZo0wdmzZ7Fjxw6cPXsWxcXFePvttxEaGqo2wQURERk//jVtIHQxNbu2PXIBrPO124cuJrfQtrLG9rC6XSR2GERUC/Xq1UNoaChCQ0PFDoWIiETExwIFYgwLC+ti7JUxTM1usmOviKhKcXFx2LBhQ6X2DRs28LFAIiITw+TKgBjD2Ctd4NgrgTVoAISGKrdEVMnatWvRtm3bSu3t27fnAsJERCaGyZWAWL3SDKtXBqZ5c+DLL5VbIqokNzcXjRs3rtTu4uKC25wIhojIpDC5MjCsXmmG1SsBPXoEXLmi3BJRJV5eXjh+/Hil9uPHj8PDw0OEiIiISCxMrgTG6pVmWL3SjF4kWBcuAK1aKbdEVMm4ceMwdepUbNy4EVlZWcjKysKGDRsQGRmJcePGiR0eERHpEGcLNEDGMHOgLhjDzIFEpP+mT5+Oe/fuYcKECSgvLwcAWFtbY+bMmYiOjhY5OiIi0iVWrrSA1SvNsHqlGb2oXhEZmNWrV8Pb2xvW1tYIDg7GqVOnqt1306ZNkEgkai9ra2uN+5JIJFi8eDHy8/Nx8uRJnD17FgUFBYiJiRHiUoiIyIAwuTJQuhh7pYsES9uMYewVEdXOjh07EBUVhdjYWKSmpsLf3x8DBgzAnTt3qj3GwcEBt2/fVr2ysrJq3a+dnR26dOmCDh06wMrK6nkugYiIDBSTKy0xhuqVLrB6pRlWr4g0t3z5cowbNw4RERFo164dEhISUL9+/SrXonpCIpHA3d1d9XJzc9O4v5KSEsyePRs9evSAj48PWrRoofYiIiLTwTFXBkwXY68eO1fA4q5h3ya6GHtV6maB+nmPtdqHkMrKylBWVqZ6L5VKq9+5c2dAwbFrJK6n71ErK6sqq0Pl5eVISUlRG+tkZmaGkJAQJCcnV3v+4uJiNGvWDHK5HJ07d8bChQvRvn17jWJ75513cPToUYwZMwaNGzeGRMKKORGRqTLsv5r1XHPne8i820jsMPTeIxfAOl/sKPRfWWN7WN0uEuRccXFxmDt3riDnItIFLy8vtfexsbGYM2dOpf3u3r0LmUxWqfLk5uaGixcvVnnuNm3aYMOGDfDz80NhYSE+/vhj9OjRA+fPn0eTJk2eGdv+/fuxd+9e9OzZU/MLIiIio8THAg0cx15pRhdjrwxpYeHo6GgUFhaqXjk5OdXvnJEBdO+u3BKJJCcnR+2eFXIWvu7duyMsLAwBAQHo06cPdu/eDRcXF6xdu1aj4xs0aICGDRsKFg8RERkuJldapouxV8awsLAxjL3SBaHGXllZWcHBwUHtVa2SEuDkSeWWSCRP36/VTRjh7OwMc3Nz5OXlqbXn5eXB3d1do74sLCzQqVMnXLlyRaP9P/roI8TExKC0tFSj/YmIyHgxuSKNsHqlYR8GVL0iMkaWlpYIDAxEYmKiqk0ulyMxMRHdu3fX6BwymQznzp1D48aNNdp/2bJlOHjwINzc3NCxY0d07txZ7UVERKaDY650QBdjr4xhYWFdjL0yhoWFhRx7RWSMoqKiEB4ejqCgIHTt2hXx8fEoKSlBREQEACAsLAyenp6Ii4sDAMybNw/dunWDj48PHjx4gKVLlyIrKwvvvPOORv0NHz5cW5dCREQGhskVacwYZg7UBUObOZDI2IwaNQr5+fmIiYlBbm4uAgICcODAAdUkF9nZ2TAz++vBjfv372PcuHHIzc1FgwYNEBgYiBMnTqBdu3Ya9RcbG6uV6yAiIsMjUSg4x7ImpFIpHB0dEbLvXdSzrdvikLqYOVDb1StdJFe6mDlQ29UrXSRXT6pXFbIyJF74GIWFhTWPnarBk/u7ynMUFAD79gGDBwMctE86VuO9qUcePHiAXbt24erVq5g+fToaNmyI1NRUuLm5wdPTU+zwiIhIR1iG0CFjmJqd1SvNGFX1qmFDYPRosaMg0ltpaWkICQmBo6Mjrl+/jnHjxqFhw4bYvXs3srOzsWXLFrFDJCIiHeGEFkaGMwdqRheTW2ibUDMHPlN+PrB6tXJLRJVERUVh7NixuHz5MqytrVXtgwcPxq+//ipiZEREpGtMrnRMF1Oza5suZg40hqnZjWbmwJwcYNIk5ZaIKjl9+jTefffdSu2enp7Izc0VISIiIhILkysjZAzVK11g9YqIhGBlZQWpVFqp/dKlS3BxMYJvioiISGNMrkTA6pVmWL0iIkMwdOhQzJs3D48fK8dZSiQSZGdnY+bMmRg5cqTI0RERkS4xuTJSrF5pxiiqV+52YodAZNKWLVuG4uJiuLq64uHDh+jTpw98fHxgb2+PBQsWiB0eERHpEKd9EwlnDtSMLhYW1jaDnznQ3h546SXllogqcXR0xKFDh5CUlIS0tDQUFxejc+fOCAkJETs0IiLSMZNJrn755Rf069evys9Wr16NiRMn6jgi7bNu9FDr614Zw9Tspc4Sra97ZdBatQIOHhQ7CiK916tXL/Tq1UvsMIiISESG/VdxLbz66qs4fPgwAgMD1do//fRTzJ49W5TkyhiqV7rA6pXIZDKgpASwtQXMzcWOhkgvrFixQuN9J0+erMVIiIhIn5hMcrV06VIMGjQIv/76K9q2bQtA+Zz8vHnzsHfvXpGj0x5WrzTD6lUNzp4FAgOBlBSgc2exoyHSC5988ona+/z8fJSWlsLJyQkA8ODBA9SvXx+urq5MroiITIhh/0VcC++88w4KCgoQEhKCpKQk7NixAwsXLsS+ffvQs2dP0eJi9UozrF4RkT7JzMxU/fzVV1/hs88+wxdffIE2bdoAADIyMjBu3Lgq178iIiLjZTLJFQDMmDED9+7dQ1BQEGQyGQ4ePIhu3bqJHZbWsXqlGV1Ur5hgERmf2bNnY9euXarECgDatGmDTz75BK+99hpCQ0NFjI6IiHTJsP8afoaqnon39PRE/fr18Y9//AOnTp3CqVOnAIj7TDyrV5oxhuoVERmf27dvo6Ki8tp/MpkMeXl5IkRERERiMerk6uln4p8wNzfH8ePHcfz4cQDKBR+N/Zl4Vq80w+oVEdXWiy++iHfffReff/45Ov9vXGJKSgrGjx/P6diJiEyMYf8l/Ax/fyZe3+mieqWLBEvbWL0SQceOwJ07wP8G6hORug0bNiA8PBxBQUGwsLAAAFRUVGDAgAH4/PPPRY6OiIh0yaiTK9I9Vq807MOQqlcWFoCLi9hREOktFxcX7Nu3D5cuXcLFixcBAG3btkXr1q1FjoyIiHTNTOwAdGXkyJFYvHhxpfYlS5bgn//8pwgRVdbc+Z7W+7Bu9FDrfWjbIx38nV/qLNF+J4bi6lVg6FDlloiq1bp1awwdOhRDhw5lYkVEZKIMu8RQC7/++ivmzJlTqX3QoEFYtmyZ7gMyYsZQvdIFg6leFRYCe/YAVfz7ISLlxBWbNm1CYmIi7ty5A7lcrvb5zz//LFJkRESkaybzF3BxcTEsLS0rtVtYWEAqlYoQUdU49kozuhh7xYWFiUgTU6ZMwaZNm/Dyyy+jQ4cOkEhY+SYiMlUmk1x17NgRO3bsQExMjFr79u3b0a5dO43P084hF5dkzYQOT40xTM3O6pVmDKZ6RUTV2r59O3bu3InBgweLHQoREYnMZP76nT17Nl599VVcvXoVL7zwAgAgMTERX3/9Nb755huRo9M9Vq80w+oVET2LpaUlfHx8xA6DiIj0gMlMaDFkyBB8//33uHLlCiZMmID3338fN27cwOHDhzF8+PBanauj4y3tBPk3upjcQtseO1deVFNoupjcQttK3SzEDqFmnp7AsmXKLRFV8v777+PTTz+FQsEvYoiITJ3JVK4A4OWXX8bLL78sdhh6wxiqV7pg8tUrNzcgKkrsKIhqZfXq1Vi6dClyc3Ph7++PlStXomvXrs88bvv27XjzzTcxbNgwfP/99xr1lZSUhF9++QX79+9H+/btVWtdPbF79+66XAIRERkgk0quACAlJQXp6ekAgPbt26NTp051Ok9Hx1s4V+ghZGiVcOyVZoxhYWG9Hnt1/z5w+DAQEgI0aCB2NETPtGPHDkRFRSEhIQHBwcGIj4/HgAEDkJGRAVdX12qPu379OqZNm4bevXvXqj8nJyeMGDHiecMmIiIjYDLJ1Z07d/DGG2/gyJEjcHJyAgA8ePAA/fr1w/bt2+FSh0VSdZFgaRurV5ox6epVZibw+utASgqTKzIIy5cvx7hx4xAREQEASEhIwN69e7FhwwbMmjWrymNkMhlCQ0Mxd+5cHDt2DA8ePNC4v40bNwoRNhERGQGTGXP13nvvoaioCOfPn0dBQQEKCgrw559/QiqVYvLkyWKHVy2OvdIMx17VTllZGaRSqdqLSJ89fb+WlZVVuV95eTlSUlIQEhKiajMzM0NISAiSk5OrPf+8efPg6uqKt99+u07xVVRU4PDhw1i7di2KiooAALdu3UJxcXGdzkdERIbJZJKrAwcO4LPPPoOvr6+qrV27dli9ejX2799f5/PqYnILbbNu9FDrfegiwdK2UmfjWbsmLi4Ojo6OqpeXl5fYIRHVyMvLS+2ejYuLq3K/u3fvQiaTwc3NTa3dzc0Nubm5VR6TlJSEL774AuvXr69TbFlZWejYsSOGDRuGiRMnIj9f+Zzy4sWLMW3atDqdk4iIDJPJJFdyubzSIGNAuYiwXC4XISLNGUP1ShdYvdJcdHQ0CgsLVa+cnByd9EtUVzk5OWr3bHR0tCDnLSoqwpgxY7B+/Xo4OzvX6RxTpkxBUFAQ7t+/Dxubvx6zHjFiBBITEwWJk4iIDIPJjLl64YUXMGXKFHz99dfw8FCOk7p58yYiIyPx4osvPte5OfZKM8awsLCxjL2ysrKClZWVZjvb2ACdOim3RCJxcHCAg4PDM/dzdnaGubk58vLy1Nrz8vLg7u5eaf+rV6/i+vXrGDJkiKrtyRdu9erVQ0ZGBlq2bFljn8eOHcOJEydgaWmp1u7t7Y2bN28+M2YiIjIeJlO5WrVqFaRSKby9vdGyZUu0bNkSzZs3h1QqxcqVK8UO75lYvdIMq1da4OsLpKYqt0R6ztLSEoGBgWoVI7lcjsTERHTv3r3S/m3btsW5c+dw5swZ1Wvo0KHo168fzpw5o9Ejs3K5HDKZrFL7jRs3YG9v/3wXREREBsWwywi14OXlhdTUVBw+fBgXL14EAPj6+qoNen4erF5phtUrDfvQ56nZifRcVFQUwsPDERQUhK5duyI+Ph4lJSWq2QPDwsLg6emJuLg4WFtbo0OHDmrHP5lR9un26rz00kuIj4/HunXrAAASiQTFxcWIjY3F4MGDhbswIiLSe4b9V24tSSQS9O/fH/379xc7lDrRxbpXxjA1uzGse6VX/vgD6NYNOHlS+XggkZ4bNWoU8vPzERMTg9zcXAQEBODAgQOqSS6ys7NhZibcgxvLli3DgAED0K5dOzx69AhvvfUWLl++DGdnZ3z99deC9UNERPpPolAoDH8ASTVWrFih8b7Pmo5dKpXC0dERk5OGwcqu+se2tF290sWiwrpIrrRdvdJFcqWLsVeaVK8qKh4h6ec5KCws1GhMSlWe3N9VniM1FQgMVK5z1blznc5PVFc13pt6pKKiAtu3b0daWhqKi4vRuXNnhIaGqk1wQURExs+oK1effPKJRvtJJBK9Xuvq71i90gyrV0SkS/Xq1cPo0aPFDoOIiERm1MlVZmamzvs0hrFXusCxVxr2wbFXRAYhIyMDK1euRHp6OgDlmN5Jkyahbdu2IkdGRES6ZDKzBRoTXcwcqIuFhbVNFzMHGtPCwkRUN99++y06dOiAlJQU+Pv7w9/fH6mpqejYsSO+/fZbscMjIiIdMuzSQS0oFArs2rULv/zyC+7cuVNp4eDdu3cL1herV5oxhuqVLohevfL1Bf78E2jRQrwYiPTYjBkzEB0djXnz5qm1x8bGYsaMGRg5cqRIkRERka6ZTOVq6tSpGDNmDDIzM2FnZwdHR0e1l6Fh9UozrF4JwMYGaN+eiwgTVeP27dsICwur1D569Gjcvn1bhIiIiEgsJlM22Lp1K3bv3q2zNUd0Ub3SxeQW2sbqlWZErV5lZQEffQTMng00ayZODER6rG/fvjh27Bh8fHzU2pOSktC7d2+RoiIiIjGYzF+1jo6OaMHHmmqNMwdqRheTW4jm3j3giy+ACROYXBFVYejQoZg5cyZSUlLQrVs3AMDJkyfxzTffYO7cufjxxx/V9iUiIuNl1Otc/d3mzZtx4MABbNiwoU7rjmi6ztXTdDH2StvVK2NY9wrQfoIl1rpXXOeKjJkhrHOl6YLEEokEMplMy9EQEZGYTKZy9frrr+Prr7+Gq6srvL29YWGhniClpqaKFJn+M4bqlS4YdfWKiKr19ARJRERkukwmuQoPD0dKSgpGjx4NNzc3SCS6mYSAY680o4uxV8awsLDoMwcSUY0ePXoEa2trscMgIiKRmExytXfvXhw8eBC9evXSed/GMDU7q1eaMcrqlZsbMGuWcktElchkMixcuBAJCQnIy8vDpUuX0KJFC8yePRve3t54++23xQ6RiIh0xGSmYvfy8tLb5/WFoIup2bXtsXOF1vvQxdTs2lbqpvmYP0F4egJxccotEVWyYMECbNq0CUuWLIGlpaWqvUOHDvj8889FjIyIiHTNZJKrZcuWYcaMGbh+/boo/Xd0vCVKv0LSxbpXukiwtM3o1r0qKgKOHFFuiaiSLVu2YN26dQgNDYW5ubmq3d/fHxcvXhQxMiIi0jWTeSxw9OjRKC0tRcuWLVG/fv1KE1oUFBSIFJlwjGHslS5w7FUtXb4M9OvH2QKJqnHz5s1Ka1wByokuHj/mGEkiIlNiMslVfHy82CFw7JWGjGFhYaMce0VEVWrXrh2OHTuGZk+tA7dr1y506tRJpKiIiEgMhv0XbC2Eh4drtN+iRYvw73//G05OTtoNSEtYvdIMq1dEJJSYmBiEh4fj5s2bkMvl2L17NzIyMrBlyxb89NNPYodHREQ6ZDJjrjS1cOFCrT4iyLFXmuHYKw370PXkFkRUybBhw7Bnzx4cPnwYtra2iImJQXp6Ovbs2YP+/fuLHR4REemQyVSuNKVQGP6jXLqoXhnD1OzGUL3SCQsL5UyBFkzkiKrTu3dvHDp0SOwwiIhIZKxcicAYqle6wOqVZh66aDnp6dgRuHFDuSUiIiKiarFyZaRYvdIMq1dEVBcNGjSARKLZlyfGMBstERFphsmVSIxh5kBd4MyBeuDcOWDQIGD/flaviP7n7zPQ3rt3D/Pnz8eAAQPQvXt3AEBycjIOHjyI2bNnixQhERGJgY8FGrHmzve03ocuJrfQtkcu2u/DoBcWfvwYuHlTuSUyEKtXr4a3tzesra0RHByMU6dOVbvv7t27ERQUBCcnJ9ja2iIgIABbt26t8fzh4eGq1/HjxzFv3jx8/fXXmDx5MiZPnoyvv/4a8+bNw9GjR4W+NCIi0mNMrp7Su3dv2Njo5lE3XYy90kWCpW3GMPaKiHRnx44diIqKQmxsLFJTU+Hv748BAwbgzp07Ve7fsGFDfPDBB0hOTkZaWhoiIiIQERGBgwcPatTfwYMHMXDgwErtAwcOxOHDh5/rWoiIyLCYVHJ19epVfPjhh3jzzTdV/ye7f/9+nD9/XrXPvn370LhxY7FCNEisXmnGoKtXRAZk+fLlGDduHCIiItCuXTskJCSgfv362LBhQ5X79+3bFyNGjICvry9atmyJKVOmwM/PD0lJSRr116hRI/zwww+V2n/44Qc0asR1B4mITInJJFdHjx5Fx44d8dtvv2H37t0oLi4GAJw9exaxsbGixcXqlWZYvSIiTZSXlyMlJQUhISGqNjMzM4SEhCA5OfmZxysUCiQmJiIjIwP/+Mc/NOpz7ty5mDlzJoYMGYL58+dj/vz5GDJkCGbNmoW5c+fW+VqIiMjwmExyNWvWLMyfPx+HDh2CpaWlqv2FF17AyZMnRYzMOLB6pRl9qV6VlZVBKpWqvarVqhXwyy/KLZFInr5fy8rKqtzv7t27kMlkcHNzU2t3c3NDbm5utecvLCyEnZ0dLC0t8fLLL2PlypUaLwA8duxYHD9+HA4ODti9ezd2794NBwcHJCUlYezYsRpfIxERGT7DnoatFs6dO4evvvqqUrurqyvu3r0rQkR/0cXMgbqYml3bdDFzoKlMzR4XF6f5N+r29kDfvlqNh+hZvLy81N7HxsZizpw5gp3f3t4eZ86cQXFxMRITExEVFYUWLVqgr4b3fnBwMLZt2yZYPEREZJhMJrlycnLC7du30bx5c7X2P/74A56eniJFZVyMYd0rXdCHqdmjo6MRFRWlei+VSiv98apy8yawahUwaRLAfyskkpycHDg4OKjeW1lZVbmfs7MzzM3NkZeXp9ael5cHd3f3as9vZmYGHx8fAEBAQADS09MRFxencXJFREQEmNBjgW+88QZmzpyJ3NxcSCQSyOVyHD9+HNOmTUNYWJjY4XHslYZ0MfZKF48His3KygoODg5qr2rl5QGLFim3RCJ5+n6tLrmytLREYGAgEhMTVW1yuRyJiYmqNag0IZfLq330kIiIqDomk1wtXLgQbdu2hZeXF4qLi9GuXTv84x//QI8ePfDhhx+KHR4A3SRY2mYMY690QV/GXhEZo6ioKKxfvx6bN29Geno6xo8fj5KSEkRERAAAwsLCEB0drdo/Li4Ohw4dwrVr15Ceno5ly5Zh69atGD16tFiXQEREBspkHgu0tLTE+vXrERMTg3PnzqG4uBidOnVCKxMbpM+xV5oxlbFXRMZo1KhRyM/PR0xMDHJzcxEQEIADBw6oJrnIzs6Gmdlf3y2WlJRgwoQJuHHjBmxsbNC2bVt8+eWXGDVqlFiXQEREBkqiUCjEHfwhEplMhnPnzqFZs2Zo0KDBM/eXSqVwdHTE5KRhsLKz0Gps2p7cQhfJlS7GXmk7wdJFciXE2CtZ+SOk7PgAhYWFNT/eV4Mn93eV50hNBQIDgZQUoHPn546XqDZqvDeJiIj0jMlUrqZOnYqOHTvi7bffhkwmQ58+fXDixAnUr18fP/30k0kNWjaG6pUusHr1P40aAW+/rdwSEQDg1Vdf1Xjf3bt3azESIiLSJyYz5mrXrl3w9/cHAOzZswfXrl3DxYsXERkZiQ8++EDj83SxvaatEFU49kozxrCwsEGMvWrWDPj8c+WWiAAAjo6OGr+IiMh0mEzl6u7du6ppePft24fXX38drVu3xv/93//h008/rdW5utldwcliH22EqTOsXmmG1SsADx8C164BLVoANpxqnwgANm7cKHYIRESkh0ymcuXm5oYLFy5AJpPhwIED6N+/PwCgtLQU5ubmIkdXGatXmmH1SgfS04EOHZRbIiIiIqqWyVSuIiIi8Prrr6Nx48aQSCQICQkBAPz2229o27Ztrc/H6pVmjGFhYVaviOhZdu3ahZ07dyI7Oxvl5eVqn6WmpooUFRER6ZrJVK7mzJmDL774Av/6179w/Phx1QKU5ubmauud6BNjqF7pAqtXRCSmFStWICIiAm5ubvjjjz/QtWtXNGrUCNeuXcOgQYPEDo+IiHTIZCpX8+bNU/28YcMGtc+ysrIwdOjQWp+T1SvNsHpFRMbss88+w7p16/Dmm29i06ZNmDFjBlq0aIGYmBgUFBSIHR4REemQySRX3333ndr7x48fIzMzE/Xq1UPLli0RExMjUmQ16+h4S+vrXhkDXSwsrG2lzhJB1r0SnEQCWFoqt0RUSXZ2Nnr06AEAsLGxQVFREQBgzJgx6NatG1atWiVmeEREpEOG/ddoLfzxxx+V2qRSKcaOHYsRI0bU+bysXmmG1SvN6GWC1akTUFYmdhREesvd3R0FBQVo1qwZmjZtipMnT8Lf3x+ZmZlQKPTs3zMREWmVyYy5qoqDgwPmzp2L2bNnix1KjXQx9qq58z2t96FtxjD2iogMzwsvvIAff/wRgHLypMjISPTv3x+jRo16ri/viIjI8JhM5ao6hYWFKCwsfK5zGEP1ShdYvdKM3lWv0tOB0FBg2zbA11fsaIj0zrp16yCXywEAEydORKNGjXDixAkMHToU7777rsjRERGRLplMcrVixQq19wqFArdv38bWrVsNYjYnXYy9MoaFhY1h7JXeefgQ+OMP5ZaIKjEzM4OZ2V8Pgrzxxht44403RIyIiIjEYjJ/hX7yySdq783MzODi4oLw8HBBpmJn9UozrF5pRu+qV0SkJi0tDR06dICZmRnS0tJq3NfPz09HURERkdhMJrnKzMwUO4TnxuqVZnRRveLU7ESmLSAgALm5uXB1dUVAQAAkEkmVk1dIJBLIZDIRIiQiIjGYTHKlC6xeacYYqle6wOoVkf7KzMyEi4uL6mciIiLAxGcLNEScOVAzupg58JGL1rvQD82bAzt3KrdEBABo1qwZJP9b+y0rKwuenp5o1qyZ2svT0xNZWVkiR0pERLrE5Epg3eyuaL0PXSRY2mbdiJMjaKLUWQ8W7m3QAPjnP5VbIqqkX79+KCgoqNReWFiIfv36iRARERGJhckVVYnVK82YRPUqLw9Yvly5JaJKFAqFqor1d/fu3YOtra0IERERkVg45koLdDH2SheTW2ibLsZeGcPU7KKPvbp5E3j/faBvX8DNTbw4iPTMq6++CkA5acXYsWNhZWWl+kwmkyEtLQ09evQQKzwiIhKBYf/VSVplDDMH6gJnDiQyTY6OjgCUlSt7e3vY2Pz1ZZGlpSW6deuGcePGiRUeERGJgMmVlrB6pRlWrzQjevWKiCrZuHGjavr1lStXws7OTuSIiIhIbBxzpUW6mNxC24xh7JUumMTYKyKqRKFQYNu2bbh9+7bYoRARkR5gcmXgOHOgZnQxuYW2iTZzoKMjMGSIcktkIFavXg1vb29YW1sjODgYp06dqnbf9evXo3fv3mjQoAEaNGiAkJCQGvf/OzMzM7Rq1Qr37vGLKCIiYnKldaxeacYYpmY32upVy5bAjz8qt0QGYMeOHYiKikJsbCxSU1Ph7++PAQMG4M6dO1Xuf+TIEbz55pv45ZdfkJycDC8vL7z00ku4efOmRv0tWrQI06dPx59//inkZRARkQGSKJ48ME41kkqlcHR0xNY/OqK+vXmtjtX22CsAWh97pYuJLbQ99gqA1sde6WJii6fHXsnKHyFlxwcoLCyEg4NDnc755P6u8hyPHwMPHgBOToCFRd2CJqqjGu/NagQHB6NLly5YtWoVAEAul8PLywvvvfceZs2a9czjZTIZGjRogFWrViEsLOyZ+zdo0AClpaWoqKiApaWl2sQWAKpcA4uIiIyTYY/yNxC6mNxC23Qxc6AuJrfQNl3MHKjzyS3OnQMCA4GUFKBzZ931S1QH5eXlSElJQXR0tKrNzMwMISEhSE5O1ugcpaWlePz4MRo2bKjR/vHx8XUJlYiIjBCTKyNhDDMH6oIxzBwohLKyMpSVlaneS6VSEaMheran71ErKyu1daWeuHv3LmQyGdyeWpPNzc0NFy9e1KivmTNnwsPDAyEhIRrtHx4ertF+RERk/DjmSkc49kozHHulmeed3CIuLg6Ojo6ql5eXl0CREWmHl5eX2j0bFxenlX4WLVqE7du347vvvoO1tXWtj3/06BGkUqnai4iITAe/wjciuqheGcPCwqxeAdHR0YiKilK9l0ql1SZYRzLuoC+AlT9fxq3rNY250uxRxdqM8tR0X4U2+tZ0P4FjrE3nmseo4e9Hw/Mpz6nhfhqer627PSb2q/7x6ZycHLUxV1VVrQDA2dkZ5ubmyMvLU2vPy8uDu7t7jTF8/PHHWLRoEQ4fPgw/Pz8NIwdKSkowc+ZM7Ny5s8pZA2UymcbnIiIiw2baf2HqmDGMvdIFjr3SzPOMvarukaqqpN+Woi+AA3/m4vxdLpJK2vGg1LnG5MrBwUGjCS0sLS0RGBiIxMREDB8+HIByQovExERMmjSp2uOWLFmCBQsW4ODBgwgKCqpV7DNmzMAvv/yCNWvWYMyYMVi9ejVu3ryJtWvXYtGiRbU6FxERGTYmV0aG1SvNsHqlubYDeyPBJxUDrW0wwLzmmTI1eVhRouETjRJNdxSyTw2uQPNzabifBjtqEpem59KUpr9/oa7Tw0m4L1SioqIQHh6OoKAgdO3aFfHx8SgpKUFERAQAICwsDJ6enqpHCxcvXoyYmBh89dVX8Pb2Rm5uLgDAzs4OdnbP/kJhz5492LJlC/r27YuIiAj07t0bPj4+aNasGbZt24bQ0FDBro2IiPQb/7rUMVavNMPqlWZKnSWw0vI60v3aNUa/do212wmRgEaNGoX8/HzExMQgNzcXAQEBOHDggGqSi+zsbJiZ/TXkeM2aNSgvL8drr72mdp7Y2FjMmTPnmf0VFBSgRYsWAJQVtidTr/fq1Qvjx48X6KqIiMgQMLkyQqxeaUYX1StdJFhad/kyMGkSsGoV0KqV2NEQaWTSpEnVPgZ45MgRtffXr19/rr5atGiBzMxMNG3aFG3btsXOnTvRtWtX7NmzB05OTs91biIiMiycLVAEupg5sKOjlssZOmAMMwfqQmkjAZ8Fq0pREfDf/yq3RFRJREQEzp49CwCYNWsWVq9eDWtra0RGRmL69OkiR0dERLrEyhXVGatXmjGK6hURVSsyMlL1c0hICC5evIiUlBT4+PjUatZBIiIyfKxciYTVK82wekVE+koul2Px4sXo2bMnunTpglmzZuHhw4do1qwZXn31VSZWREQmiMkVPRddLCysbY+dK7Tehy4WFiYi3VqwYAH+85//wM7ODp6envj0008xceJEscOqtTlz5iAgIEDsMFSuX78OiUSCM2fOiB0KEVGtMbkSEatXmtFF9UoXCZbB8vJSTmZRzSLDRKZqy5Yt+Oyzz3Dw4EF8//332LNnD7Zt2wa5XC52aNWSSCT4/vvvxQ6DiMhoMbmi52YM1StdMNjqlYsLMHGicktEKtnZ2Rg8eLDqfUhICCQSCW7dMvwvtZ6lvLxc7BCIiPQSkyuRsXqlGVavRFRQAHz5pXJLRCoVFRWwtrZWa7OwsMDjx4+12m/fvn0xefJkzJgxAw0bNoS7u7tG63F5e3sDAEaMGAGJRKJ6/8TWrVvh7e0NR0dHvPHGGyj62wyhffv2xaRJkzB16lQ4OztjwIABAIA///wTgwYNgp2dHdzc3DBmzBjcvXtXddyBAwfQq1cvODk5oVGjRnjllVdw9epVtX5PnTqFTp06wdraGkFBQfjjjz/UPr9//z5CQ0Ph4uICGxsbtGrVChs3bqzFb4yISHeYXOkBXSRY2sbqlWYMsnp1/TowZoxyS0QqCoUCY8eOxauvvqp6PXr0CP/+97/V2rRh8+bNsLW1xW+//YYlS5Zg3rx5OHToUI3HnD59GgCwceNG3L59W/UeAK5evYrvv/8eP/30E3766SccPXoUixYtqtSnpaUljh8/joSEBDx48AAvvPACOnXqhN9//x0HDhxAXl4eXn/9ddUxJSUliIqKwu+//47ExESYmZlhxIgRqkcni4uL8corr6Bdu3ZISUnBnDlzMG3aNLV+Z8+ejQsXLmD//v1IT0/HmjVr4Ozs/Fy/PyIibeFU7CZCFwsLa5t1o4d4dM9Gq33oYmp2IjIO4eHhldpGjx6tk779/PwQGxsLAGjVqhVWrVqFxMRE9O/fv9pjXP73aK+TkxPc3d3VPpPL5di0aRPs7e0BAGPGjEFiYiIWLFig2qdVq1ZYsmSJ6v38+fPRqVMnLFy4UNW2YcMGeHl54dKlS2jdujVGjhyp1s+GDRvg4uKCCxcuoEOHDvjqq68gl8vxxRdfwNraGu3bt8eNGzcwfvx41THZ2dno1KkTgoKCAKBSxY2ISJ/wr0g90c3uCk4W+4gdxnPRxbpXukiwtI3rXhEZBzEfTXt6mvfGjRvjzp07dT6ft7e3KrGq7nyBgYFq78+ePYtffvkFdnZ2lc539epVtG7dGpcvX0ZMTAx+++033L17V1Wxys7ORocOHZCeng4/Pz+1xyu7d++udq7x48dj5MiRSE1NxUsvvYThw4ejR48edb5WIiJtYnJlQoyheqULrF4Rkb6zsLBQey+RSJ5rlkJNzmdra6v2vri4GEOGDMHixYsrna9x48YAgCFDhqBZs2ZYv349PDw8IJfL0aFDh1pNiDFo0CBkZWVh3759OHToEF588UVMnDgRH3/8scbnICLSFY650iMce6UZY1hY2KDGXtnaAt26KbdEZNAsLCwgk8kEOVfnzp1x/vx5eHt7w8fHR+1la2uLe/fuISMjAx9++CFefPFF+Pr64v79+2rn8PX1RVpaGh49eqRqO3nyZKW+XFxcEB4eji+//BLx8fFYt26dINdARCQ0JlcmxhhmDtQFzhz4N23aAMnJyi0RGTRvb28kJiYiNze3UqJTWxMnTkRBQQHefPNNnD59GlevXsXBgwcREREBmUyGBg0aoFGjRli3bh2uXLmCn3/+GVFRUWrneOuttyCRSDBu3DhcuHAB+/btq1SRiomJwQ8//IArV67g/Pnz+Omnn+Dr6/tcsRMRaQuTKz3D6pVmWL0iIqq9ZcuW4dChQ/Dy8kKnTp2e61weHh44fvw4ZDIZXnrpJXTs2BFTp06Fk5MTzMzMYGZmhu3btyMlJQUdOnRAZGQkli5dqnYOOzs77NmzB+fOnUOnTp3wwQcfVHrM0NLSEtHR0fDz88M//vEPmJubY/v27c8VOxGRtkgUCoVC7CAMgVQqhaOjIy6mu+G8RWOt9qWLiS10MfZK25Nb6GJiC22PvRJiYgtZ2SOkf/YfFBYWwsHBoU7neHJ/V3mO1FQgMBBISQE6d37+gIlqocZ7k4iISM+wcqWHjKF6pQusXhERERGRPmFyVQfdrPPEDuG56WLslTEsLKyLsVdMsIhICNu2bYOdnV2Vr/bt24sdHhGRSeB803rKGNa90gVjWPeKiEgIQ4cORXBwcJWfPT3VOhERaQeTqzrqZp2Hk4/cxA7juehi3StdLCysbbpY94oLCxPR87K3t1dbCJiIiHSPjwXqMV2MvTKGqdmNYeyVXmvXDrh8WbklIiIiomoxuXoOxjD2Shc49kozejv2ytoa8PFRbolMRFxcHLp06QJ7e3u4urpi+PDhyMjI0HkcixYtgkQiwdSpU3XS382bNzF69Gg0atQINjY26NixI37//Xet9imTyTB79mw0b94cNjY2aNmyJT766CNoYzLjX3/9FUOGDIGHhwckEgm+//57tc8VCgViYmLQuHFj2NjYICQkBJcvXxY8DiIyXkyuasn8hm4Xl2X1SjOsXmlRZiYwerRyS2Qijh49iokTJ+LkyZM4dOgQHj9+jJdeegklJSU6i+H06dNYu3Yt/Pz8dNLf/fv30bNnT1hYWGD//v24cOECli1bhgYNGmi138WLF2PNmjVYtWoV0tPTsXjxYixZsgQrV64UvK+SkhL4+/tj9erVVX6+ZMkSrFixAgkJCfjtt99ga2uLAQMG4NGjR4LHQkTGiWOuasnqv2WAr6XqvTGMvdIFjr3SjF6Ovbp/H9i2DYiKApo3FzsaIp04cOCA2vtNmzbB1dUVKSkp+Mc//qH1/ouLixEaGor169dj/vz5Wu8PUCY5Xl5e2Lhxo6qtuQ7+zZ84cQLDhg3Dyy+/DADw9vbG119/jVOnTgne16BBgzBo0KAqP1MoFIiPj8eHH36IYcOGAQC2bNkCNzc3fP/993jjjTcEj4eIjA8rV7Vk81/df3vF6pVmdFG90sXjgUSkfwoLCwEADRs21El/EydOxMsvv4yQkBCd9AcAP/74I4KCgvDPf/4Trq6u6NSpE9avX6/1fnv06IHExERcunQJAHD27FkkJSVVmwRpS2ZmJnJzc9V+546OjggODkZycrJOYyEiw8XKVS3VO1uBijG34eD41/iTF4JK8POo5oBEImJk+s8Yqle6UJvqVX5+PvK+jNNuQEQmTi6XY+rUqejZsyc6dOig9f62b9+O1NRUnD59Wut9/d21a9ewZs0aREVF4T//+Q9Onz6NyZMnw9LSEuHh4Vrrd9asWZBKpWjbti3Mzc0hk8mwYMEChIaGaq3PquTm5gIA3NzUn0Zxc3NTfUZE9CxMrjT0ZGBtMQCHX4AKPIJCAhT/qz6Kh1mjtFiu1f79kIHTJS202kdr8yxckLprtQ9ZqXYrfxY2j1BWoN11r8psAYt72v2nIyvTbL+/J1bPM/j7ybFSqbTyh8XFf22r+pxIi57ck9qY3EBTEydOxJ9//omkpCSt95WTk4MpU6bg0KFDsNbxJDJyuRxBQUFYuHAhAKBTp074888/kZCQoNXkaufOndi2bRu++uortG/fHmfOnMHUqVPh4eGh1X6JiLSByZWGioqKAABef29UAFhbqnzhjg6iOKeDPshQFRUVwdHRsc7HAoCXl1f1O/XpU6dzEwnhee7v5zFp0iT89NNP+PXXX9GkSROt95eSkoI7d+6gc+fOqjaZTIZff/0Vq1atQllZGczNzbXSd+PGjdHuqSUXfH198e2332qlvyemT5+OWbNmqcY0dezYEVlZWYiLi9NpcuXurvxyMS8vD40bN1a15+XlISAgQGdxEJFhY3KlIQ8PD+Tk5MDe3h4SPv5HekShUKCoqAgeHnVfEJr3N+krIe7vuvb73nvv4bvvvsORI0d0MrEDALz44os4d079i7SIiAi0bdsWM2fO1FpiBQA9e/asNN38pUuX0KxZM631CQClpaUwM1MfAm5ubg65XLtPhDytefPmcHd3R2JioiqZkkql+O233zB+/HidxkJEhovJlYbMzMx08q0lUV087zf6vL9Jn4lRsZo4cSK++uor/PDDD7C3t1eNuXF0dISNjfYePba3t680rsvW1haNGjXS+nivyMhI9OjRAwsXLsTrr7+OU6dOYd26dVi3bp1W+x0yZAgWLFiApk2bon379vjjjz+wfPly/N///Z/gfRUXF+PKlb8micrMzMSZM2fQsGFDNG3aFFOnTsX8+fPRqlUrNG/eHLNnz4aHhweGDx8ueCxEZJwkCjEfZCciItJD1VVwN27ciLFjx+o0lr59+yIgIADx8fFa7+unn35CdHQ0Ll++jObNmyMqKgrjxo3Tap9FRUWYPXs2vvvuO9y5cwceHh548803ERMTA0tLy2efoBaOHDmCfv36VWoPDw/Hpk2boFAoEBsbi3Xr1uHBgwfo1asXPvvsM7Ru3VrQOIjIeDG5IiIiIiIiEgDXuSIiIiIiIhIAkysiIiIiIiIBMLkiIiIiIiISAJMrIiIiIiIiATC5IiIiIiIiEgCTKyIiIiIiIgEwuSIiIiIiIhIAkysiIqJqlJWVYc6cOSgrKzOZvnnNRER1x0WEiYiIqiGVSuHo6IjCwkI4ODiYRN+8Zt1eMxEZF1auiIiIiIiIBMDkioiIiIiISAD1xA7AUMjlcty6dQv29vaQSCRih0OkolAoUFRUBA8PD5iZ1e37Et7fpK/Evr+lUqnaVpfE6pvXrDt1vb9Xr16NpUuXIjc3F/7+/li5ciW6du1a7f4PHjzABx98gN27d6OgoADNmjVDfHw8Bg8eLMRlENHfcMyVhm7cuAEvLy+xwyCqVk5ODpo0aVKnY3l/k77j/U3GrDb3944dOxAWFoaEhAQEBwcjPj4e33zzDTIyMuDq6lpp//LycvTs2ROurq74z3/+A09PT2RlZcHJyQn+/v5CXwqRyWNypaHCwkI4OTkBAGxtgeTTlf8DRqRrfu3uqH5+8OABHB0d63Sev9/fAJB2gfc3iU/o+zsnJ4eTFWjizBmgTx/g6FEgIEDsaIyaVCqFl5dXre7v4OBgdOnSBatWrQKgrMx6eXnhvffew6xZsyrtn5CQgKVLl+LixYuwsLAQNH4iqoyPBWroyaMktrbAH+cawcKCw9VIfGkXbOHXrgQAnutxvr8fe+GSI2xseH+T+C5cckS71oUAhLm/HRwcmFxpws7ury1/Xzqh6f1dXl6OlJQUREdHq9rMzMwQEhKC5OTkKo/58ccf0b17d0ycOBE//PADXFxc8NZbb2HmzJkwNzcXJH4i+guTq1pKPu3KxIr0hr29/f8SrDvP3lkDaRdcmViR3rCxsUHaBata399lZWVq6xWJMXaIqDaevketrKxgZWVVab+7d+9CJpPBzc1Nrd3NzQ0XL16s8tzXrl3Dzz//jNDQUOzbtw9XrlzBhAkT8PjxY8TGxgp3EUQEgLMFEhGRkYmLi4Ojo6PqxfFWpO+8vLzU7tm4uDjBzi2Xy+Hq6op169YhMDAQo0aNwgcffICEhATB+iCiv7ByRURERiU6OhpRUVGq90/GtZCGXF2ByEjllnTi6fGAVVWtAMDZ2Rnm5ubIy8tTa8/Ly4O7u3uVxzRu3BgWFhZqjwD6+voiNzcX5eXlsLS0FOAKiOgJJldERGRUqnukasjKY6hnbQsAaGRnhZVvdoKbg7Wuw9N/TZoAy5eLHYVJ0XQ8oKWlJQIDA5GYmIjhw4cDUFamEhMTMWnSpCqP6dmzJ7766ivI5XLVdO+XLl1C48aNmVgRaQEfCyQiIpOQebcUV/NLcDW/BKcyC/DLRWHGKhqd4mIgOVm5Jb0TFRWF9evXY/PmzUhPT8f48eNRUlKCiIgIAEBYWJjahBfjx49HQUEBpkyZgkuXLmHv3r1YuHAhJk6cKNYlEBk1Vq6IiMgkbBrbBbb29lj1yxUcu3wXRY8qxA5JP126BPToAaSkAJ07ix0NPWXUqFHIz89HTEwMcnNzERAQgAMHDqgmucjOzlZbkNjLywsHDx5EZGQk/Pz84OnpiSlTpmDmzJliXQKRUWNyRUREJiGoeUM4ODhgT9otHLsMFJUxuSLDNGnSpGofAzxy5Eiltu7du+PkyZNajoqIAD4WSEREJsbOSrmQajErV0REJDAmV0REZFLsrZUPbRSXPRY5EiIiMjZMroiIyKT8lVyxclWlevUAZ2flloiIaoXJVS1FTn4AuVwudhhEAIDy8nIE+gs341n3Lnfw+DG/zSf9IJPJMHLYXcHPa2elTBo4oUU1/PyA/HzlloiIaoXJVS0lHi7HhHcLxQ6DCADQ0bcAQuZCJSVAZ797wp2Q6Dm8MrAAly8L/2UWkysiItIWJld1kHaW3+yTfigvF/6cXNqG9MX16zKtnNeOjwXW7Px5wMdHuSUiolphclUHfv4WYodABACwtBT+nHZ2wp+TqC68vc21cl57zhZYs7Iy4OpV5ZaIiGqFyVUtvRhiic/WOoodBhEA4Fx6Q1gImOvb2gKpaY2EOyHRc/jpQEO0aiX8/01xQgui2qmoqMDhw4exdu1aFBUVAQBu3bqFYj7qQFQJk6ta+mSFk9rK50RisrS0RMpZV8HOl3zaFRZCZmtEz8Hc3Bzf/uAs+Hn//ligTK4Q/PxExiQrKwsdO3bEsGHDMHHiROTn5wMAFi9ejGnTpokcHZH+YZZAREQm5cmEFgBQUs7qFVFNpkyZgqCgINy/fx82Njaq9hEjRiAxMVHEyIj0ExexICIik2JVzwwW5hI8lilQ/KgCDtas1qrx8QEOHFBuyeQdO3YMJ06cgOVTg3y9vb1x8+ZNkaIi0l+sXBERkUmRSCSq6hXHXVXBwQEYMEC5JZMnl8shk1WeufPGjRuwt7cXISIi/cbkioiITM6TcVdc66oKt28Dc+Yot2TyXnrpJcTHx6veSyQSFBcXIzY2FoMHDxYvMCI9xeSKiIhMjmo6dlauKrt9G5g7l8kVAQCWLVuG48ePo127dnj06BHeeust1SOBixcvFjs8Ir3DMVdERGRy/qpccVF4opo0adIEZ8+exY4dO3D27FkUFxfj7bffRmhoqNoEF0SkxOSKiIhMjv2TMVd8LJDomerVq4fQ0FCEhoaKHQqR3jOKxwJ//fVXDBkyBB4eHpBIJPj+++/VPlcoFIiJiUHjxo1hY2ODkJAQXL58WZxgiYhIdHZcSJhII3FxcdiwYUOl9g0bNvCxQKIqGEVyVVJSAn9/f6xevbrKz5csWYIVK1YgISEBv/32G2xtbTFgwAA8evRIx5ESEZE+eDJbICe0qEKDBkBoqHJLJm/t2rVo27Ztpfb27dsjISFBhIiI9JtRJFeDBg3C/PnzMWLEiEqfKRQKxMfH48MPP8SwYcPg5+eHLVu24NatW5UqXJqInPwAcrlcgKiJnp9MJsPIYXcFOx/vb9InQt/ff2dvzQktqtW8OfDll8otmbzc3Fw0bty4UruLiwtuc9ITokqMIrmqSWZmJnJzcxESEqJqc3R0RHBwMJKTk6s9rqysDFKpVO0FAImHyzHh3UKtx02kiVcGFuDy5donQ7y/yRDU9f7WhL01x1xV69Ej4MoV5ZZMnpeXF44fP16p/fjx4/Dw8BAhIiL9ZvTJVW5uLgDAzc1Nrd3NzU31WVXi4uLg6Oioenl5eak+SzvL2aVIP1y/XnlhR03w/iZDUNf7WxOqxwLLeL9XcuEC0KqVcksmb9y4cZg6dSo2btyIrKwsZGVlYcOGDYiMjMS4cePEDo9I73C2wGpER0cjKipK9V4qlar+APXztxArLCI13t7muHix9n+A8v4mQ1DX+1sTHHNFpJnp06fj3r17mDBhAsrLywEA1tbWmDlzJqKjo0WOjkj/GH3lyt3dHQCQl5en1p6Xl6f6rCpWVlZwcHBQewHAiyGW+Gyto/YCJqqFnw40RKtWtf9nzPubDEFd729NcLZAMmSrV6+Gt7c3rK2tERwcjFOnTlW776ZNmyCRSNRe1tbWGvclkUiwePFi5Ofn4+TJkzh79iwKCgoQExMjxKUQGR2jT66aN28Od3d3JCYmqtqkUil+++03dO/evdbn+2SFE8zMjP7XRgbC3Nwc3/7gLNj5eH+TPhH6/v47rnNFhmrHjh2IiopCbGwsUlNT4e/vjwEDBuDOnTvVHuPg4IDbt2+rXllZWbXu187ODl26dEGHDh1gZWX1PJdAZNSM4rHA4uJiXLlyRfU+MzMTZ86cQcOGDdG0aVNMnToV8+fPR6tWrdC8eXPMnj0bHh4eGD58uHhBExGRaDhbIBmq5cuXY9y4cYiIiAAAJCQkYO/evdiwYQNmzZpV5TESiaTGp3VqUlJSgkWLFiExMRF37typNKPstWvX6nReImNlFMnV77//jn79+qnePxlLEh4ejk2bNmHGjBkoKSnBv/71Lzx48AC9evXCgQMHalUWJyIiw1BWVoaysjLV+yezYf6dHWcLrF7nzoBCIXYUJuXpe9TKyqrK6lB5eTlSUlLUxjqZmZkhJCSkxhmQi4uL0axZM8jlcnTu3BkLFy5E+/btNYrtnXfewdGjRzFmzBg0btwYEolEw6siMk1GkVz17dsXihr+j0AikWDevHmYN2+eDqMiIiIxxMXFYe7cuTXu82RCi+LyCsjlCpiZ8Q9GEs/fZ2wFgNjYWMyZM6fSfnfv3oVMJqtyBuSLFy9Wee42bdpgw4YN8PPzQ2FhIT7++GP06NED58+fR5MmTZ4Z2/79+7F371707NlT8wsiMmEcXEFEREYlOjoahYWFqldOTk6lfZ6sc6VQACXlrF6pycgAundXbkkncnJy1O5ZIWfh6969O8LCwhAQEIA+ffpg9+7dcHFxwdq1azU6vkGDBmjYsKFg8RAZOyZXRERkVKqbDVNtn3pmqPe/ahXHXT2lpAQ4eVK5JZ14+n6tbsIIZ2dnmJub13oG5L+zsLBAp06d1Maq1+Sjjz5CTEwMSktLNdqfyNQxuSIiIpMjkUg47ooMjqWlJQIDA9VmQJbL5UhMTNR4BmSZTIZz586hcePGGu2/bNkyHDx4EG5ubujYsSM6d+6s9iIidUYx5oqIiKi27K3r4UHpYxSxckUGJCoqCuHh4QgKCkLXrl0RHx+PkpIS1eyBYWFh8PT0RFxcHABg3rx56NatG3x8fPDgwQMsXboUWVlZeOeddzTqjzMrE9UOkysiIjJJdlYWAB6iiJUrMiCjRo1Cfn4+YmJikJubi4CAABw4cEA1yUV2drbaeoX379/HuHHjkJubiwYNGiAwMBAnTpxAu3btNOovNjZWK9dBZKyYXBERkUniQsLV8PYGtm5VbkkvTZo0CZMmTarysyNHjqi9/+STT/DJJ588V38PHjzArl27cPXqVUyfPh0NGzZEamoq3Nzc4Onp+VznJjI2HHNVS9u2lNY47TuRLslkMowcdlew840cppzml0gfKBQKbNuivUH0qjFXZY+11odBatgQGD1auSWTl5aWhtatW2Px4sX4+OOP8eDBAwDA7t27BZ3VkMhYMLmqpcWLirFpA2fMIf3wysACXL4sF+x8ly/L8crAAsHOR/Q8Nm0oxeJFxVo7/5O1rvhY4FPy84HVq5VbMnlRUVEYO3YsLl++DGtra1X74MGD8euvv4oYGZF+YnJVBym/81tO0g/XrwtfZdLGOYnq4nct/7f2r8oVkys1OTnApEnKLZm806dP4913363U7unpidzcXBEiItJvTK7qIDDIQuwQiAAA3t7mBnFOoroI0vJ/a+05FTvRM1lZWUEqlVZqv3TpElxcXESIiEi/MbmqpZmz7DD2/+qLHQYRAOCnAw3RqpVw/4xbtTLDTwc4zoL0w9j/q4+Zs+y0dn57PhZI9ExDhw7FvHnz8PixspIskUiQnZ2NmTNnYuTIkSJHR6R/mFzVUmhYfUgkErHDIAIAmJub49sfnAU737c/OMPcnJUr0g8SiQShYdr7MuvJmCs+FkhUvWXLlqG4uBiurq54+PAh+vTpAx8fH9jb22PBggVih0ekdzgVOxERmSQ7a+Vjh1xE+Cn29sBLLym3ZPIcHR1x6NAhJCUlIS0tDcXFxejcuTNCQkLEDo1ILzG5IiIik6SqXD3iJEVqWrUCDh4UOwrSM7169UKvXr3EDoNI7zG5IiIik+TA2QKrJpMBJSWArS3Ax4RN0ooVKzTed/LkyVqMhMjwMLmqpSMPW8DGnL820h8PH1YAuCPIuXh/k74R8v5+mh1nC6za2bNAYCCQkgJ07ix2NCSCTz75RO19fn4+SktL4eTkBAB48OAB6tevD1dXVyZXRE8RdUKLzZs3Y+/evar3M2bMgJOTE3r06IGsrCwRIyMiImPHRYSJqpaZmal6LViwAAEBAUhPT0dBQQEKCgqQnp6Ozp0746OPPhI7VCK9I2pytXDhQtjY2AAAkpOTsXr1aixZsgTOzs6IjIwUMzQiIjJyqspVeQXkcoXI0RDpp9mzZ2PlypVo06aNqq1Nmzb45JNP8OGHH4oYGZF+EvX5n5ycHPj4+AAAvv/+e4wcORL/+te/0LNnT/Tt21fM0IiIyMjZWylnC1QogNLHMlUli4j+cvv2bVRUVK7uymQy5OXliRARkX4TtXJlZ2eHe/fuAQD++9//on///gAAa2trPHz4ULB+ZDIZZs+ejebNm8PGxgYtW7bERx99BIWC31QSEZkqawszmJsp1y3kuCuiqr344ot49913kZqaqmpLSUnB+PHjOR07URVE/Zquf//+eOedd9CpUydcunQJgwcPBgCcP38e3t7egvWzePFirFmzBps3b0b79u3x+++/IyIiAo6OjhyISURkoiQSCeyt6+FB6WMUlz0GYC12SPqhY0fgzh3gf5MXkGnbsGEDwsPDERQUBAsLZbW3oqICAwYMwOeffy5ydET6R9TkavXq1fjwww+Rk5ODb7/9Fo0aNQKg/EbkzTffFKyfEydOYNiwYXj55ZcBAN7e3vj6669x6tQpwfogEoNMJkPc62cFO9+Rr25j4LgmkEgkgp2TqK7kcjk+fz9Dq33YWSmTK05q8TcWFoCLi9hRkJ5wcXHBvn37cOnSJVy8eBEA0LZtW7Ru3VrkyIj0k6jJlZOTE1atWlWpfe7cuYL206NHD6xbtw6XLl1C69atcfbsWSQlJWH58uXVHlNWVoaysjLVe6lUKmhMREL4aPhZ3L5a+0doq7u/dy/LgoWVGULCPQWLkaiuEt67iLQj97XaB2cMrMLVq0BkJPDJJ0DLlmJHQ3qidevWTKiINKDz5CotLU3jff38/ATpc9asWZBKpWjbti3Mzc0hk8mwYMEChIaGVntMXFyc4EkekdDuZD+q03E13d9X/yhCSPjzREUkjOt/Fmu9D3suJFxZYSGwZw8wZ47YkZAekMlk2LRpExITE3Hnzh3I5XK1z3/++WeRIiPSTzpPrgICAiCRSKBQKJ756JFMJhOkz507d2Lbtm346quv0L59e5w5cwZTp06Fh4cHwsOr/isyOjoaUVFRqvdSqRReXl6CxEMkFNem1rh5qbTWx9V0f7fsZC9YfETPw7uDHe7nFmi1jyeVK05oQVS1KVOmYNOmTXj55ZfRoUMHPjZO9Aw6T64yMzNVP//xxx+YNm0apk+fju7duwNQrne1bNkyLFmyRLA+p0+fjlmzZuGNN94AAHTs2BFZWVmIi4urNrmysrKClZWVYDEQacPs7/0xd8iZWj8aWN39/er7zfBimIdQ4RE9l3+vbIvV49O1+mignbVygH4RK1dEVdq+fTt27typmnSMiGqm8+SqWbNmqp//+c9/YsWKFWr/YP38/ODl5YXZs2dj+PDhgvRZWloKMzP1WefNzc0rlbaJDI25uTmid/pjcuBJQc7X963G/FaS9IaZmRneWdZGsPu7KqrHAlm5IqqSpaWlak1SIno2Ude5OnfuHJo3b16pvXnz5rhw4YJg/QwZMgQLFizA3r17cf36dXz33XdYvnw5RowYIVgfRERkeOxVE1o8FjkSPeLpCSxbptySyXv//ffx6aefcm1QIg2Jmlz5+voiLi4O5eXlqrby8nLExcXB19dXsH5WrlyJ1157DRMmTICvry+mTZuGd999Fx999JFgfRARkeFRjbniY4F/cXMDoqKUW9JLq1evhre3N6ytrREcHKzx0jLbt2+HRCKp1ZNBSUlJ2LZtG1q2bIkhQ4bg1VdfVXsRkTpRp2JPSEjAkCFD0KRJE9XMgGlpaZBIJNizZ49g/djb2yM+Ph7x8fGCnZOIiAyf3f8eC+SYq7+5fx84fBgICQEaNBA7GnrKjh07EBUVhYSEBAQHByM+Ph4DBgxARkYGXF1dqz3u+vXrmDZtGnr37l2r/pycnPikD1EtiJpcde3aFdeuXcO2bdtUC9ONGjUKb731FmxtbcUMjYiITABnC6xCZibw+utASgqTKz20fPlyjBs3DhEREQCUX1Tv3bsXGzZswKxZs6o8RiaTITQ0FHPnzsWxY8fw4MEDjfvbuHGjEGETmQxRkysAsLW1xb/+9S+xwyAiIiNRm0Xguc4V6YOn79HqZnQtLy9HSkoKoqOjVW1mZmYICQlBcnJyteefN28eXF1d8fbbb+PYsWO1jq+iogJHjhzB1atX8dZbb8He3h63bt2Cg4MD7Ozsan0+ImMmenIFABcuXEB2drba2CsAGDp0qEgRERGRoarNIvD2/5uKnZUrEtPT62jGxsZiThWLON+9excymQxuT42Hc3NzUz0B9LSkpCR88cUXOHPmTJ1iy8rKwsCBA5GdnY2ysjL0798f9vb2WLx4McrKypCQkFCn8xIZK1GTq2vXrmHEiBE4d+6camFhAKqpoIVaRJiIiExHbRaBt+NsgaQHcnJy4ODgoHov1DqbRUVFGDNmDNavXw9nZ+c6nWPKlCkICgrC2bNn0ahRI1X7iBEjMG7cOEHiJDImoiZXU6ZMQfPmzZGYmIjmzZvj1KlTuHfvHt5//318/PHHYoZGREQGqjaLwHNCiyrY2ACdOim3pBMODg5qyVV1nJ2dYW5ujry8PLX2vLw8uLu7V9r/6tWruH79OoYMGaJqe7LGZ7169ZCRkYGWLVvW2OexY8dw4sQJWFpaqrV7e3vj5s2bz4yZyNSIOhV7cnIy5s2bB2dnZ5iZmcHMzAy9evVCXFwcJk+eLGZoRERkAuz/NhU71/H5H19fIDVVuSW9YmlpicDAQCQmJqra5HI5EhMT0b1790r7t23bFufOncOZM2dUr6FDh6Jfv344c+ZMtRXdv5PL5VU+SXTjxg3Y29s/3wURGSFRK1cymUz1D9PZ2Rm3bt1CmzZt0KxZM2RkZIgZGhERmYAnlSuFAij9f/buO66J+40D+CcJhD1FQBDBrThAQRG3FcXduuqgihT91UEd2DrqtlrQVkvrwr2qtba1w4WDghMXiDjBKgoqQ0U2hJDc7w8kNbKSI+FCeN6vV15wl7vvPYdf4Z58V5EERnoaMRSZkAoFBgbC19cX7u7u6Ny5M0JCQpCXlyebPXDixImwt7dHUFAQ9PX10bZtW7nzzc3NAaDM/or0798fISEh2LZtG4CSoRu5ublYtmwZBg0apLobI0RLcNpy1bZtW9y6dQsA4OHhgbVr1+LSpUtYuXIlmjRpwmVoFYo8mEKfbhKNIZFIEPTxLZWVR/WbaBJV1+/yGOgKIOCXjPOlGQPfunkT0NMr+Uo0zpgxY/Ddd99h6dKlcHV1RWxsLMLCwmSTXCQlJSElJUVl11u3bh0uXboEZ2dnFBYWYvz48bIugWvWrFHZdQjRFpx+RLd48WLk5eUBKJkmdMiQIejRowfq1auHX375hcvQKnRk3VPo6vHh5WvPdSiE4OuPbiHlUYHKyqP6TTSJqut3eXg8Hoz1dJBVIEZOYTFsqh72ov0YBigqKvlKNFJAQAACAgLKfS8yMrLSc/fs2aPUtRo2bIhbt27h0KFDiIuLQ25uLvz9/eHj4wMDGpdHSBmcJlfe3t6y75s1a4YHDx4gIyMDFhYWshkDNdGjmznw8uU6CkKA9KRClZdJ9ZtoCnXU7/L8l1zRjIGElEdHRweffPIJ12EQUitw2i2w1L///otTp06hoKAAlpaWXIdTpaYdaAAn0QzWjfRVXibVb6Ip1FG/y0MLCRNSufj4eAQEBKBv377o27cvAgICKlxXi5C6jtPk6vXr1+jbty9atGiBQYMGyfoI+/v7Y+7cuVyGVqERcx3Rd6Id12EQAgBY8qcLGjRVXbcMqt9Ek6i6flekdK0rWkiYkLJ+//13tG3bFtHR0XBxcYGLiwtiYmLQrl07/P7771yHR4jG4TS5mjNnDnR1dZGUlARDQ0PZ/jFjxiAsLIzDyCrWe3wDje6ySOoWgUCAhYddVFYe1W+iSVRdvytCa129p3Vr4M4dmoqdAADmzZuHhQsXIioqCuvXr8f69etx+fJlfPXVV5g3bx7X4RGicThNrk6fPo01a9agYcOGcvubN2+Op0+fchQVIYSQusREXxcAtVzJGBgAbdrQIsIEAJCSkoKJEyeW2f/JJ5+odFZCQrQFp8lVXl6eXItVqYyMDOjp6XEQESGEkLrGWI/GXMl5+hSYPLnkK6nzevfujQsXLpTZf/HiRfTo0YODiAjRbJzOFtijRw/s27cPX3/9NYCSKXGlUinWrl2LPn36cBkaIYSQOqJ0QguaLfCt16+BnTuB6dMBR0euoyEcGzZsGObPn4/o6Gh06dIFAHDlyhX8+uuvWLFiBf7++2+5Ywmp6zhNrtauXYu+ffvixo0bKCoqwrx583D37l1kZGTg0qVLXIZGCCGkjihtuYpPy8WZe2kKn2ckFKBzY0voCDRi4l1C1GL69OkAgM2bN2Pz5s3lvgeUfEAukUhqNDZCNBGnyVXbtm2RkJCAjRs3wsTEBLm5uRgxYgRmzJiBBg0acBlahXobPIaJIf0hJZojRyJVWVlUv4mmUWX9rojp25ar8wkvcT7hpVLnLhrUGlN6NlFHWIRoBKlU/f8HCdEmnCVXYrEYAwYMQGhoKBYtWsRVGIQQQuq4AW0b4PzDV8jIK1L4nJc5IjzPLEBCWo4aIyNEsxQWFkJfv2bWnyOktuIsudLV1UVcXFyNXe/58+eYP38+Tp48ifz8fDRr1gy7d++Gu7u7UuXoPCsGWgvVFCUh3KL6TeoiWzN97JrUSalzDl5Nwld/3MabfMUTslrDxgZYsKDkK6nzJBIJvvnmG4SGhiItLQ0JCQlo0qQJlixZAicnJ/j7+3MdIiEahdP+P5988gl27typ9uu8efMG3bp1g66uLk6ePIl79+5h3bp1sLCwULosg9Na+IeUkLeofhOiGEujkg8hXivR2lVr2NsDQUElX0mdt3r1auzZswdr166FUPjfh29t27bFjh07OIyMEM3E6Zir4uJi7Nq1C2fPnoWbmxuMjIzk3l+/fr1KrrNmzRo4ODhg9+7dsn2NGzdmVZbhmULkzzJWSVyEaBqq34Qopp5xyUOmMl0Ja42cHCA6GnBzA0xMuI6GcGzfvn3Ytm0b+vbti6lTp8r2u7i44MGDBxxGRohm4jS5unPnDjp27AgASEhIkHuPx+Op7Dp///03vL29MXr0aJw7dw729vaYPn06pkyZonRZwlvFkE5KhbHJf32ORW66yPE1BFQYMyGKyMvLQ3tn1Y35oPpNNEl+fj7aO2dzHUa5LAzfJle5WphcPXwI9OlTkmC9/RtN6q7nz5+jWbNmZfZLpVKIxbR8ASHv4zS5ioiIqJHrPH78GFu2bEFgYCC++uorXL9+HTNnzoRQKISvr2+554hEIohEItl2dnbJH3g+ANtwACgEwweyphsh5xN68CTcaNuKXWJF9ZvUBm1aamZiBQD13nYLzBEVo6hYCqEOzbJJtJOzszMuXLgAx/fWPPvtt9/QoUMHjqIiRHNxmlzVFKlUCnd3d3zzzTcAgA4dOuDOnTsIDQ2tMLkKCgrCihUrKiyz2JaPVz+YobCrnlpiJkSdqH4TUj1mBroQ8HmQSBm8yS+CjSnNoEa009KlS+Hr64vnz59DKpXiyJEjiI+Px759+3Ds2DGuwyNE43D6UVteXh6WLFmCrl27olmzZmjSpIncS1UaNGgAZ2dnuX2tW7dGUlJShecsXLgQWVlZsldycrLc+2kHLOnBk9RaVL8JqR4+nwcLQ10AwGtt7BpIyFsffvghjh49irNnz8LIyAhLly7F/fv3cfToUfTr14/r8AjROJy2XE2ePBnnzp3DhAkT0KBBA5WOs3pXt27dEB8fL7cvISGhTBP3u/T09KCnV/HDpd61Iohb1ImGP6LB7jwwYdU1kOo3qQ3uxptqdNdASyMhXuUWad+kFrq6JTMF6upyHQnRED169MCZM2e4DoOQWoHTp6eTJ0/i+PHj6Natm1qvM2fOHHTt2hXffPMNPv74Y1y7dg3btm3Dtm3blC6rqAkfeCyF0YlC5H5iqIZoCVGckZER4u4ZoL1zukrKo/pNNImhoSHi7umrrH6rWumkFq/zRFUcWcu0awc8e8Z1FIQQUitxmlxZWFjA0tJS7dfp1KkT/vjjDyxcuBArV65E48aNERISAh8fH6XLSv2tHvS+zYXxrwXgv5FCakGDmIn2oPpNiOJKp2N/o20tV6TOs7CwULg3UUZGhpqjIaR24TS5+vrrr7F06VLs3bsXhobq/ZR8yJAhGDJkSLXLYfR5eL3WDAU99aB/pQj5A2kQM9EeVL8JUVzpQsJa1y3w9m1g4EDg5MmSVixS54SEhMi+f/36NVatWgVvb294enoCAKKionDq1CksWbKEowgJ0Vw1nlx16NBB7tOQf//9FzY2NnBycoLue/27Y2Jiajo8heUP0QcYhuswCFELqt+EVM3SqGTc4mttS67EYuD585KvpE56dyblkSNHYuXKlQgICJDtmzlzJjZu3IizZ89izpw5XIRIiMaq8eTqo48+qulLqg+t/UO0GdVvQipVT1tbrojG27RpE7799lukpqbCxcUFGzZsQOfOncs99siRI/jmm2/w77//QiwWo3nz5pg7dy4mTJig0LVOnTqFNWvWlNk/YMAALFiwoFr3QYg2qvHkatmyZTV9SUIIIUTlLIxKJ7Sg5IrUnF9++QWBgYEIDQ2Fh4cHQkJC4O3tjfj4eFhbW5c53tLSEosWLUKrVq0gFApx7Ngx+Pn5wdraGt7e3lVer169evjrr78wd+5cuf1//fUX6tWrp7L7IkRbcDrmqkmTJrh+/XqZ/5yZmZno2LEjHj9+zFFkhBBCSOVKW65oQgtSk9avX48pU6bAz88PABAaGorjx49j165d5bYk9e7dW2571qxZ2Lt3Ly5evKhQcrVixQpMnjwZkZGR8PDwAABcvXoVYWFh2L59e/VviBAtw+lUYE+ePIFEIimzXyQS4RlNA0sIIYQFkUiE7OxsuZc6aO2EFs2bAxERJV9JjXi/vopE5U/vX1RUhOjoaHh5ecn28fl8eHl5ISoqqsrrMAyD8PBwxMfHo2fPngrFNmnSJFy6dAmmpqY4cuQIjhw5AlNTU1y8eBGTJk1SqAxC6hJOWq7+/vtv2fenTp2CmZmZbFsikSA8PByNGzfmIjRCCCG1XFBQEFasWKH268harvKLIJUy4PO1ZJyiiQnwXmsHUS8HBwe57WXLlmH58uVljnv16hUkEglsbGzk9tvY2ODBgwcVlp+VlQV7e3uIRCIIBAJs3rwZ/fr1Uzg+Dw8PHDhwQOHjCanLOEmuSie14PF4cjPSAICuri6cnJywbt06DiIjhBBS2y1cuBCBgYGy7ezs7DIPr6pQOuZKygCZBWJZS1at9/w5sHEjEBAA2NtzHU2dkJycDFNTU9m2np6eSss3MTFBbGwscnNzER4ejsDAQDRp0qRMl0FCSPVxklxJpVIAQOPGjXH9+nVYWVlxEQYrc2ZmYttOc/D5tLgq4Z5UKsWcmZkqK2/kh69w8kw9CAQClZVJCFsMw+DAvnylz9PT01P5w2l5dAV8mOjrIKewGBl5Iu1JrtLSgOBgYPRoSq5qiKmpqVxyVRErKysIBAKkpaXJ7U9LS4OtrW2F5/H5fDRr1gwA4Orqivv37yMoKIiSK0LUgNMMITExUaHEql27dkhOTq6BiKoWfrYI0z/L4joMQgAA0z/LQvhZ1Y33ePhQiiEDMlRWHiHVsWdXPtYE53IdRqX+m46d1oQi6icUCuHm5obw8HDZPqlUivDwcNkCv4qQSqUVjusihFQPp7MFKurJkycQa9BihnG3NCcWUrfdUkNdfPKk7CQzhHDhxg3N/11raSTEk9f5yMijB1VSMwIDA+Hr6wt3d3d07twZISEhyMvLk80eOHHiRNjb2yMoKAhAyRhEd3d3NG3aFCKRCCdOnMD+/fuxZcsWLm+DEK1VK5IrTdPeRZfrEAgBALi46CI1RbUPdU5O1CWQaAZ3d12cOFbIdRiVsjQq6X5Ia12RmjJmzBi8fPkSS5cuRWpqKlxdXREWFiab5CIpKUlu6EJeXh6mT5+OZ8+ewcDAAK1atcJPP/2EMWPGcHULhGg1Sq6U1NdLiM1bzao+kJAasHmrGf7nn6myroHNm/NxLMxSJWURUl2TPjWEqJDR6K6Bsm6BuVqUXNWrB/j7l3wlGikgIAABAQHlvhcZGSm3vWrVKqxatUqp8keMGKHwsUeOHFGqbEK0HSVXSvr+R5rMgmgOPp+P7380R3vndJWU9/tfVhAIqH4TzcDj8eAz0VCjk6vSGQO1quXK0RHYsYPrKAiH3l0ihxCiHEquCCGEEJbeXetKaxQUAI8fA02aAAYGXEdDOLB7926uQyCk1qKPqAkhhBCWLGWzBWpRcnX/PtC2bclXQgghSuG85So8PBzh4eFIT0+XrX9VateuXQCArVu3llmNnCuRBU1gIOD8x0aITEFBMQDVdAuk+k00jSrrtzpYGr/tFqhNY64Iec9vv/2Gw4cPIykpCUVF8nU9JiaGo6gI0UyctlytWLEC/fv3R3h4OF69eoU3b97IvUqNHz8eRkZGHEZKCCGElGVpqIUtV4S848cff4Sfnx9sbGxw8+ZNdO7cGfXq1cPjx48xcOBArsMjRONw+hF1aGgo9uzZgwkTJnAZBiGEEMLKu90CGYYBj8fjOCJCVGvz5s3Ytm0bxo0bhz179mDevHlo0qQJli5diowMWnSekPdx2nJVVFSErl27chkCIYQQwlq9t90CiyRS5BVpyQLcPB4gFJZ8JXVeUlKS7FnNwMAAOTk5AIAJEybg559/5jI0QjQSp8nV5MmTcfDgQS5DIIQQQlgzFOpAX7fkT6nWrHXVoQMgEpV8JXWera2trIWqUaNGuHLlCgAgMTERDMNwGRohGonTboGFhYXYtm0bzp49i/bt20NXV1fu/fXr16vlusHBwVi4cCFmzZqFkJAQtVyDEEJI3VDPSA/PMwvwOk+ERvUMuQ6HEJX64IMP8Pfff6NDhw7w8/PDnDlz8Ntvv+HGjRtKLTZMSF3BaXIVFxcHV1dXAMCdO3fk3lNXv/Xr169j69ataN++vVrKJ4QQUrdYGOnieWaB9kxqcf8+4OMDHDgAtG7NdTSEY9u2bZPN5jxjxgzUq1cPly9fxrBhw/DZZ59xHB0hmofT5CoiIqJGr5ebmwsfHx9s374dq1atYlVG5MEUDJjSkAYtE40glUqxY268ysqj+k00iarrt7pYGukBAF5rS3JVUADcvFnyldR5fD4ffP5/o0jGjh2LsWPHchgRIZqtTi1oM2PGDAwePBheXl5VJlcikQgikUi2nZ2dDQA4su4pdPX48PK1V2ushCgi9PMHiIt8U/WB76H6TWoDtvW7ptV7O2PgG21JrkidFxcXh7Zt24LP5yMuLq7SY6knECHy6kxydejQIcTExOD69esKHR8UFIQVK1aU+96jmznw8lVldISw8+ROLqvzqH6T2oBt/a5p707HTog2cHV1RWpqKqytreHq6goej1fu5BU8Hg8SiZbMkkmIitSJ5Co5ORmzZs3CmTNnoK+vr9A5CxcuRGBgoGw7OzsbDg4OAICmHUzUEichynJqa4w3qcqvM0L1m9QGbOt3TStNrrSmWyCp8xITE1G/fn3Z94QQxXE6FXtNiY6ORnp6Ojp27AgdHR3o6Ojg3Llz+PHHH6Gjo1Pupy56enowNTWVewHAiLmO6DvRrqZvgZByTd3QCu17Wyh9HtVvUhuwrd81Tetarho3Bg4fLvlK6iRHR0fZ2NunT5/C3t4ejo6Oci97e3s8ffqU40gJ0Tx1Irnq27cvbt++jdjYWNnL3d0dPj4+iI2NhUAgULis3uMb0GB/ojH4fD4mr2upsvKofhNNour6rS5al1xZWACjR5d8JXVenz59ZOtcvSsrKwt9+vThICJCNFud6BZoYmKCtm3byu0zMjJCvXr1yuwnhBBClFFP25KrtLSSadh9fAAbG66jIRxjGKbcD91ev34NIyMjDiIiRLPVieSKEEIIUReta7l6/hyYOxfo3ZuSqzqsdIFgHo+HSZMmQU9PT/aeRCJBXFwcunbtylV4hGisOptcRUZGch0CIYQQLVDv7TpXuaJiiIol0NNRvKs5IZrKzMwMQEnLlYmJCQwMDGTvCYVCdOnSBVOmTOEqPEI0Vp1NrgghhBBVMNHXgYDPg0TKICOvCA3MDKo+iRANt3v3btn06xs2bICxsTHHERFSO9SJCS0IIYQQdeHzebAw1LKugYSgpNXqwIEDSElJ4ToUQmoNSq4IIYSQatKqSS3MzIChQ0u+Eo20adMmODk5QV9fHx4eHrh27VqFx27fvh09evSAhYUFLCws4OXlVenx7+Lz+WjevDlev36tqtAJ0XqUXBFCCCHVpFWTWjRtCvz9d8lXonF++eUXBAYGYtmyZYiJiYGLiwu8vb2Rnp5e7vGRkZEYN24cIiIiEBUVBQcHB/Tv3x/Pnz9X6HrBwcH48ssvcefOHVXeBiFai5IrQgghpJosjUuSq9e5WpBcicXAy5clX4nGWb9+PaZMmQI/Pz84OzsjNDQUhoaG2LVrV7nHHzhwANOnT4erqytatWqFHTt2QCqVIjw8XKHrTZw4EdeuXYOLiwsMDAxgaWkp9yKEyKMJLQghhJBqstSmMVe3bwNubkB0NNCxI9fRkHcUFRUhOjoaCxculO3j8/nw8vJCVFSUQmXk5+dDLBYrnBiFhISwCZWQOouSK0IIIVpFJBJBJBLJtrOzs9V+TVm3wHwtSK5IjXu/jurp6cmtK1Xq1atXkEgksHlv/TEbGxs8ePBAoWvNnz8fdnZ28PLyUuh4X19fhY4jhJSgboFK2jE3HlKplOswCAFQMpNT5EHVzeIUeTBFNvUuIVyTSqXYMTde6fOCgoJgZmYmezk4OKghOnn13nYLzNCGboGkxjk4OMjV2aCgILVcJzg4GIcOHcIff/wBfX19pc8vLCxEdna23IsQIo9arpQUF/kGoZ8/wPRNzlyHQgjC973AkXVPVVbekXVPoavHh5evvcrKJISt0M8fIC7yjdLnLVy4EIGBgbLt7OxstSdYpS1X91OzsfXcowqPM9LTwWj3hrTQMJGTnJwMU1NT2XZ5rVYAYGVlBYFAgLS0NLn9aWlpsLW1rfQa3333HYKDg3H27Fm0b99e4djy8vIwf/58HD58uNxZAyUSicJlEVIXUHLFwpM7uVyHQAgA4N+YHJWX+ehmDryoFwjRAGx/11bUpUqdbE1LWgGevs5H0MnKu2clvsrDkiH0AR35j6mpqVxyVRGhUAg3NzeEh4fjo48+AgDZ5BQBAQEVnrd27VqsXr0ap06dgru7u1KxzZs3DxEREdiyZQsmTJiATZs24fnz59i6dSuCg4OVKouQuoCSKxac2tIq5UQzNOtoguiwVyots2kHE5WWRwhbTm2N8SY1g+swFNKxkQVmezVHUkZ+hceIxFIcv52CPZefYLR7Q7SyrfphmhMuLkBWFmBkxHUkpByBgYHw9fWFu7s7OnfujJCQEOTl5cHPzw9Ayex+9vb2sq6Fa9aswdKlS3Hw4EE4OTkhNTUVAGBsbAxj46qfZ44ePYp9+/ahd+/e8PPzQ48ePdCsWTM4OjriwIED8PHxUd/NElILUXKlpPa9LTB1QyuuwyAEANB3oh3EIqnKugaOmOuIvhPtVFIWIdU1dUMrbJp2n1XXwJrG5/Mw26tFlcdJf4rGyTupWPLnHRz+zBM8Hq8GolOSQAAo0IpCuDFmzBi8fPkSS5cuRWpqKlxdXREWFiab5CIpKQl8/n9D6rds2YKioiKMGjVKrpxly5Zh+fLlVV4vIyMDTZo0AVDSwpaRUfKBR/fu3TFt2jQV3RUh2oMmtFDS5HUt5X5pEcIlHo+H3uMbqKy83uMbaObDHqmT+Hw+Jq9ryXUYKrVkiDMMhQJcf/IGv8cotohrjXv4EPD2LvlKNFJAQACePn0KkUiEq1evwsPDQ/ZeZGQk9uzZI9t+8uQJGIYp81IksQKAJk2aIDExEQDQqlUrHD58GEBJi5a5ubmqbokQrUEtV0q6mNkcQrGQ6zAIkSnKKwJwRSVlUf0mmkaV9VsT2JkbYGbf5gg++QBBJ+6jX2sbmBnqch2WvJwc4PTpkq8VkEoZLPv7Lu6nKDdbHJ/Hw+D2DTDR05E+yKkl/Pz8cOvWLfTq1QsLFizA0KFDsXHjRojFYqxfv57r8AjROJRcKenea1sICmp2oDQhlZHki6o+SEFUv4mmUWX91hSfdmuM36Of4WF6Lr49/QCrPmrHdUhKu5mcif1X2HVHvvYkA49f5mLZ0Dbg8ynB0nRz5syRfe/l5YUHDx4gOjoazZo1U2rWQULqCkquCCGEkBok1OFj5YdtMW77FRy4mgSnekYw1Ve89Uqow4eXsw2M9bj7E37lccmU3J0bW+LTbo0VPi8hLQffn03A3qineJ1XhHUfu9C09BpKKpXi22+/xd9//42ioiL07dsXy5Ytg6OjIxwdHbkOjxCNRcmVkjLTjME3UH7hPULURVqgui5FVL+JplFl/dYknk3rYXgHe/xx8zlWHb+v9PnjOjsgaAR3rQalydWgtrYY0Lby9ZXeNaCtLRpbGSHwcCyOxaUgM1+M0AlunCaKpHyrV6/G8uXL4eXlBQMDA/zwww9IT0/Hrl27uA6Nld69e8PV1RUhISFqvc7y5cvx559/IjY2VqXlRkZGok+fPnjz5g2NddNw9NtMScI0XQiU+ISREHWTFKpuAUeq30TTqLJ+a5olQ5zB4wFZ+WKFzxEVS3Hx31f4K/YFFg12Vk9S4uAAbNxY8rUcRcVS3HhSMoNjl6b1lC5+qIsdzA118dn+aFz89xV6ro2AiX7F9yHg8TC1d1N87K7ehaCJvH379mHz5s347LPPAABnz57F4MGDsWPHDprYiwNdu3ZFSkoKzMzMuA6FVIGSK0IIIYQDlkZCrP/YValzGIZB3/Xn8PhlHo7HvcCYTo1UH1j9+sCMGRW+fft5JgrEElgaCdHCmt26eD2a18eh/3WB3+7reJ1XhIy8okqPX/H3XfRuWR/WJtSyXlOSkpIwaNAg2baXlxd4PB5evHiBhg0bchhZ3SQUCmFrq3grMeFOnfjoISgoCJ06dYKJiQmsra3x0UcfIT4+nuuwCCGEEKXweDxZC84v15PVc5GMDOCnn0q+luPK45L9Ho0tqzUhRfuG5jg/rw9+n9a10pdLQzPkFUnw/RmaGr4mFRcXQ19fPpnV1dWFWFzS0tq7d298/vnnmD17NiwsLGBjY4Pt27fLFjQ2MTFBs2bNcPLkSQCARCKBv78/GjduDAMDA7Rs2RI//PCDrOzCwkK0adMG//vf/2T7Hj16BBMTE4W7Il66dAm9e/eGoaEhLCws4O3tjTdv/lsnTyqVYt68ebC0tIStra3cdPRPnjwBj8eT686XmZkJHo+HyMhIACVd83g8HsLDw+Hu7g5DQ0N07dq10mfKR48eoUmTJggICADDMJXG//TpUwwdOhQWFhYwMjJCmzZtcOLECblrZ2ZmAij5+fN4vDKvJ0+eyGKfPHky6tevD1NTU3zwwQe4deuWQj9HUj11Irk6d+4cZsyYgStXruDMmTMQi8Xo378/8vLyuA6NEEIIUcqIjvYQ8HmIScrEv+kVT5fO2pMnwIQJJV/LUTreqksT5bsEvs9ITwdujhaVvhYPcQYA/HI9CfGparhfUi6GYTBp0iSMGDFC9iosLMTUqVMxYsQI3L17F6GhobCyssK1a9fw+eefY9q0aRg9ejS6du2KmJgY9O/fHxMmTEB+fj6kUikaNmyIX3/9Fffu3cPSpUvx1VdfydbN0tfXx4EDB7B371789ddfkEgk+OSTT9CvXz98+umnVcYbGxuLvn37wtnZGVFRUbh48SKGDh0KieS/rsV79+6FkZERrl69irVr12LlypU4c+aM0j+bRYsWYd26dbhx4wZ0dHQqjC8uLg7du3fH+PHjsXHjxiqXH5gxYwZEIhHOnz+P27dvY82aNTA2Ni732CNHjiAlJUX2GjFiBFq2bClbTHr06NFIT0/HyZMnER0djY4dO6Jv376yRaCJ+tSJboFhYWFy23v27IG1tTWio6PRs2dPjqIihBBClGdtoo8+La1x9n4aDt94hq8Gta6xa8uNt1JBcqWITk6WGNDGFmF3U7H6xH3s+7RzjVy3rvP19S2z75NPPpF9LxAIYGlpicWLFwMAFi5ciODgYFhZWWHKlCkAgKVLl2LLli2Ii4tDly5dsGLFCtn5jRs3RlRUFA4fPoyPP/4YAODq6opVq1Zh8uTJGDt2LJ4+fYpjx44pFO/atWvh7u6OzZs3y/a1adNG7pj27dtj2bJlAIDmzZtj48aNCA8PR79+/RS6RqnVq1ejV69eAIAFCxZg8ODBKCwslGvpu3z5MoYMGYJFixZh7ty5CpWblJSEkSNHol27kuUZmjRpUuGxlpaWsu+///57/PPPP7h69SoMDAxw8eJFXLt2Denp6dDTK1le5bvvvsOff/6J3377Ta51kKhenUiu3peVlQVAvmIqKvPqJVj2/IAWPyRaieo3IbXDmE4OOHs/DUdinuFL75bQFdRMR5R3x1s1ty7/E3V1WDCwFcIfpOF8wkucS3iJXi3q19i166rdu3dX+n7v3r3lkheBQIB69erJEgMAslaU9PR0AMCmTZuwa9cuJCUloaCgAEVFRXB1dZUrd+7cufjzzz+xceNGnDx5EvXqKZbEx8bGYvTo0ZUe8/66XA0aNJDFpox3y2nQoAGAknts1KhkDGRSUhL69euH1atXY/bs2QqXO3PmTEybNg2nT5+Gl5cXRo4cWeVaYidPnsSCBQtw9OhRtGjRAgBw69Yt5ObmlvnZFRQU4NGjRwrHQ9ipE90C3yWVSjF79mx069YNbdu2rfA4kUiE7OxsuRcAZJw5jsyoCzUVLiFqQfWbkNqtd8v6sDLWw6vcIvzzQPmHQ7ZUNd5KWU5WRpjQxQkA8M3x+5BIKx+7QmqGrq787LI8Hk9uX+kHdVKpFIcOHcIXX3wBf39/nD59GrGxsfDz80NRkfxkJunp6UhISIBAIMDDh4qPszMwMGAVr1QqBQDZDIjvjosqHV9WWTnv3mOp+vXro3Pnzvj5559lf18VMXnyZDx+/BgTJkzA7du34e7ujg0bNlR4/L179zB27FgEBwejf//+sv25ublo0KABYmNj5V7x8fH48ssvFY6HsFPnkqsZM2bgzp07OHToUKXHBQUFwczMTPZyeGdK2sKkJ2qOkhD1ovpNSO2mK+BjpJs9AOCwqie2MDICunQp+fqeqEeqG2+lrJl9m8HMQBfxaTnYfuExHqRmK/XKKlB8ynuiepcuXULXrl0xffp0dOjQAc2aNSu3FeXTTz9Fu3btsHfvXsyfPx/37yu2Dlz79u0RHh7OOr769UtaQ1NSUmT72K5VZWBggGPHjkFfXx/e3t7IyVF8rKCDgwOmTp2KI0eOYO7cudi+fXu5x7169QpDhw7FyJEjMWfOHLn3OnbsiNTUVOjo6KBZs2ZyLysrK1b3RBRXp7oFBgQE4NixYzh//nyV04guXLgQgYGBsu3s7GzZA6h+Iyd1hkmI2lH9JqT2G+3mgK3nHiMiPh1p2YWwMVXRNOUtWwJRUWV2FxVLceNpScsVF8mVuaEQn3/QDKuO30fwyQcIPvlAqfN5PKC1rSk6N7aER2NLtLA1gaCSLtACPg8NLQyom7SKNG/eHPv27cOpU6fQuHFj7N+/H9evX0fjxo1lx2zatAlRUVGIi4uDg4MDjh8/Dh8fH1y5cgVCobDS8hcuXIh27dph+vTpmDp1KoRCISIiIjB69GiFEgoDAwN06dIFwcHBaNy4MdLT02XjydgwMjLC8ePHMXDgQAwcOBBhYWEVTk5Ravbs2Rg4cCBatGiBN2/eICIiAq1blz+mcuTIkTA0NMTy5cuRmpoq21+/fn14eXnB09MTH330EdauXYsWLVrgxYsXOH78OIYPHw53d3fW90WqVieSK4Zh8Pnnn+OPP/5AZGSk3H/kiujp6ckGAb7Lst9gmHv2UEeYhNQYqt+E1H7NrI3h7miBG0/fYEvkIwxsW/EaOMb6OnBuYFqtRCHuWSYKxdKS9a1sam681bsmejrhwsNXuPtC8a5WACBlGGTkFeFeSjbupWRjz+UnCp3XtWk9bJvorp7FmuuYzz77DDdv3sSYMWPA4/Ewbtw4TJ8+XTZV+4MHD/Dll19i586dsg/7Nm/ejPbt22PJkiVYs2ZNpeW3aNECp0+fxldffYXOnTvDwMAAHh4eGDdunMIx7tq1C/7+/nBzc0PLli2xdu1aue52yjI2NsbJkyfh7e2NwYMH48SJEzAqp0W4lEQiwYwZM/Ds2TOYmppiwIAB+P7778s99vz58wAAR0dHuf2JiYlwcnLCiRMnsGjRIvj5+eHly5ewtbVFz549ZePgiPrwmKom3dcC06dPx8GDB/HXX3+hZcuWsv1mZmYK9dEFSj7ZNzMzQ5PF30CgT4sYEs0hKSzE41VfISsrC6ampqzKoPpNNJUq63d1ytBUh28kY95vcQodO6idLdaNdoWBUFD5gTExgJsbEB0NdOwo273xn4f47nQCBrWzxWYft+qEzYn07EJce5KBa4klr2dvCio9vkAsgUTKwNXBHHs/7QwzA91Kj2dDm+smIXVVnfgoZsuWLQBKZrZ51+7duzFp0qSaD4gQQghRgaHt7XD6bioSX1W+bmNSRj5O3E7FszdR2DHRHdYsuhCWTmbBRZdAVbA21ceQ9nYY0t5OoeNvP8vChF1XEZucifHbr2C/vwcsjSrvmkYIIXUiuaoDjXOEEELqIAOhADt8O1V53NXHr/HZT9GIe5aFDzddwg5fd7SxM1P4OlyPt+JCu4Zm+HlKF3yy4yruvsjGuG1X8NNkD9Q3KduluiIMw+DZmwLEp+a8nVgjB6lZhbL3xQWVJ8XkPwMHDsSFC+XPZvvVV1/hq6++quGIlKcN90CqVieSK0IIIaQu82hSD3/N6IZP91zHo5d5GLUlCgPb2YJfzhgs+8cJmAPg+zMJeP6opAthTqFYNt6qJte34lrrBqb45TNP+Oy4gvi0HHRafVal5UtF+SotT5vt2LEDBQXld+Vks24pF7ThHkjVKLlSkmE6IKBeAUSDSIqqPkZRVL+JplFl/a7rHOsZ4cj0bphxIAYX/32FIzHPyz2uTWoa5gA4ez8Nd9+YyL3XvZlVnZs9r5m1MQ5/5gm/3dfxuIrul+XRFfDQtL4xWjcwRStbEzhYGqJ0ibC83ByMClFtvNrK3t6e6xCqTRvugVSNkislGb2QQEdXwnUYhMgUi1VXH6l+E02jyvpNADMDXez264S/Y18gPUdU7jGCosY40O8sPqzfAEOE/3WB0xXwMMxVsfFK2saxnhHOBvZCRr7y2b6ZgS50BeUvK5qdXfHMcYSQ2omSKyUZPc+DjoD+2BPNUSwprPogBVH9JppGlfWblChZgLjytR4B5xqJpTbh83mwMlZ8vJU2CQ4OxsKFCzFr1iyEhIQAAAoLCzF37lwcOnQIIpEI3t7e2Lx5s9xU30lJSZg2bRoiIiJgbGwMX19fBAUFQUen6sdPiUSC5cuX46effkJqairs7OwwadIkLF68WNZ6yjAMli1bhu3btyMzMxPdunXDli1b0Lx5c1k5GRkZ+Pzzz3H06FHw+XyMHDkSP/zwQ7lrTp0/fx7ffvstoqOjkZKSgj/++AMfffQRAEAsFmPx4sU4ceIEHj9+DDMzM3h5eSE4OBh2dnZKXS8uLg4zZszA9evXUb9+fXz++eeYN29elTGUun//PubPn49z586huLgYzs7O+P3339GoUaNq/9t8++23OHLkCB48eAADAwN07doVa9askZttW1X/9pGRkQgMDMTdu3fh4OCAxYsXa8VEc5RcKYn3JAU8PvWbIpqDJ1Vdvymq30TTqLJ+EwUlJgJLlgBffw0osC4k0W7Xr1/H1q1b0b59e7n9c+bMwfHjx/Hrr7/CzMwMAQEBGDFiBC5dugSgJDkaPHgwbG1tcfnyZaSkpGDixInQ1dXFN998U+V116xZgy1btmDv3r1o06YNbty4AT8/P5iZmWHmzJkAgLVr1+LHH3/E3r170bhxYyxZsgTe3t64d+8e9N8uK+Lj44OUlBScOXMGYrEYfn5++N///oeDBw+WuWZeXh5cXFzw6aefYsSIEXLv5efnIyYmBkuWLIGLiwvevHmDWbNmYdiwYbhx44bsuKqul52djf79+8PLywuhoaG4ffs2Pv30U5ibm+N///tfpTEAwKNHj9C9e3f4+/tjxYoVMDU1xd27d2X3W91/m5iYGMyYMQOdOnVCcXExvvrqK/Tv3x/37t2TrdGlin/7xMREDB48GFOnTsWBAwcQHh6OyZMno0GDBvD29q6yfmiyOrHOlSqUrkXR18IXOvTwSTRIsbQI4W/2qmQdIKrfRNOosn7TWkIKqmCdK6J6ml43c3Nz0bFjR2zevBmrVq2Cq6srQkJCkJWVhfr16+PgwYMYNWoUgJJFgFu3bo2oqCh06dIFJ0+exJAhQ/DixQtZi0ZoaCjmz5+Ply9fQiis/G/NkCFDYGNjg507d8r2jRw5EgYGBvjpp5/AMAzs7Owwd+5cfPHFFwCArKws2NjYYM+ePRg7dizu378PZ2dnXL9+He7u7gCAsLAwDBo0CM+ePZNrcXofj8crt9XoXdevX0fnzp3x9OlTNGrUSKHrbdmyBYsWLUJqaqrsZ7BgwQL8+eefePDgQZUxjB07Frq6uti/f3+5Man63+bly5ewtrbGuXPn0LNnT5WVP3/+fBw/fhx37tyRu7fMzEyEhYVV+DOvDcrvBEwIIYQQQuq0GTNmYPDgwfDy8pLbHx0dDbFYLLe/VatWaNSoEaKiogAAUVFRaNeunVxXMW9vb2RnZ+Pu3btVXrtr164IDw9HQkICAODWrVu4ePEiBg4cCKCk5SM1NVUuBjMzM3h4eMjFYG5uLkt0AMDLywt8Ph9Xr15V9sdRRlZWFng8HszNzRW+XlRUFHr27CmXwHh7eyM+Ph5v3ryp9HpSqRTHjx9HixYt4O3tDWtra3h4eODPP/+UHaPqf5usrCwA/81mqKryo6KiytQrb29vWRm1GSVXhNRiUqkUt3IiVFZeUuE9WheOaAyGYZBUeI/rMAipkw4dOoSYmBgEBQWVea+01aU0qShlY2OD1NRU2THvPlyXvl/6XlUWLFiAsWPHolWrVtDV1UWHDh0we/Zs+Pj4yJVR3jXejcHa2lrufR0dHVhaWioUQ2UKCwsxf/58jBs3TtbqqMj1qvNzSU9PR25uLoKDgzFgwACcPn0aw4cPx4gRI3Du3DlZGar6t5FKpZg9eza6deuGtm3bqrT8io7Jzs6ucLr62oLGXCmo9IGzmCkCpBwHQ8hbt3Ii8Ko4CUD1FssuPfdhwXUAQCN9GsxOuJdUeE9WJ1VRv7Ozs1USl9bLzf3vK/3M1Kq0Tmrah1rJycmYNWsWzpw5IzeWpyYdPnwYBw4cwMGDB9GmTRvExsZi9uzZsLOzg6+vLycxlRKLxfj444/BMAy2bNlSY9eVSkseQD/88EPMmTMHAODq6orLly8jNDQUvXr1Uun1ZsyYgTt37uDixYsqLVfbUXKloJycHADAucyfOY6EkPLl5OTAzMyM9bmlHhZclz3QEqIpVFG/HRwcVBmS9lPxgxqpWHXqtzpER0cjPT0dHd8ZcyeRSHD+/Hls3LgRp06dQlFRETIzM+VaMNLS0mBrawsAsLW1xbVr1+TKTUtLk71XlS+//FLWegUA7dq1w9OnTxEUFARfX19ZGWlpaWjQoIHcNVxdXWXXSU9Plyu3uLgYGRkZCsVQntLE6unTp/jnn3/kxsopcj1bW1vZz+HdmEvfq4yVlRV0dHTg7Cz/AWjr1q1lCZCtra1K/m0CAgJw7NgxnD9/Hg0b/je7qKrKr+jnYGpqCgMDg0p/DpqOkisF2dnZITk5GSYmJnVuAUWi2RiGQU5OTqUDc6tC9ZtoKqrfRJupon6rQ9++fXH79m25fX5+fmjVqhXmz58PBwcH6OrqIjw8HCNHjgQAxMfHIykpCZ6engAAT09PrF69Gunp6bKucmfOnIGpqWmZ5KA8+fn54PPlR68IBAJZ603jxo1ha2uL8PBwWTKVnZ2Nq1evYtq0abIYMjMzER0dDTc3NwDAP//8A6lUCg8PD6V/LqWJ1cOHDxEREYF69erJva/I9Tw9PbFo0SKIxWLo6urKfi4tW7aEhYVFpdcXCoXo1KkT4uPj5fYnJCTA0dERAODm5latf5vWrVsjICAAf/zxByIjI9H4vRlDq1t+6b+9p6cnTpw4IVf2mTNnZGXUagwhhBBCCCGV6NWrFzNr1izZ9tSpU5lGjRox//zzD3Pjxg3G09OT8fT0lL1fXFzMtG3blunfvz8TGxvLhIWFMfXr12cWLlyo0PV8fX0Ze3t75tixY0xiYiJz5MgRxsrKipk3b57smODgYMbc3Jz566+/mLi4OObDDz9kGjduzBQUFMiOGTBgANOhQwfm6tWrzMWLF5nmzZsz48aNK/eaOTk5zM2bN5mbN28yAJj169czN2/eZJ4+fcoUFRUxw4YNYxo2bMjExsYyKSkpspdIJFL4epmZmYyNjQ0zYcIE5s6dO8yhQ4cYQ0NDZuvWrVXGwDAMc+TIEUZXV5fZtm0b8/DhQ2bDhg2MQCBgLly4oJJ/m2nTpjFmZmZMZGSk3D3m5+er9N/+8ePHjKGhIfPll18y9+/fZzZt2sQIBAImLCxMofqhySi5IoQQQgghlXo/uSooKGCmT5/OWFhYMIaGhszw4cOZlJQUuXOePHnCDBw4kDEwMGCsrKyYuXPnMmKxWKHrZWdnM7NmzWIaNWrE6OvrM02aNGEWLVokl8hIpVJmyZIljI2NDaOnp8f07duXiY+Plyvn9evXzLhx4xhjY2PG1NSU8fPzY3Jycsq9ZkREBAOgzMvX15dJTEws9z0ATEREhFLXu3XrFtO9e3dGT0+Psbe3Z4KDgxWKodTOnTuZZs2aMfr6+oyLiwvz559/ypVfnX+biu5x9+7dKin//Z+3q6srIxQKmSZNmshdozajda4IIYQQQgghRAVoKnZCCCGEEEIIUQFKrgghhBBCCCFEBSi5IoQQQgghhBAVoOSKEEIIIYQQQlSAkitCCCGEEEIIUQFKrgghhBBCCCFEBSi5IoQQQgghhBAVoOSKEEIIIYSwJhKJsHz5cohEojobA9fXpxg0By0iTAghhBBCWMvOzoaZmRmysrJgampaJ2Pg+voUg+aglitCCCGEEEIIUQFKrgghhBBCCCFEBXS4vHhQUBCOHDmCBw8ewMDAAF27dsWaNWvQsmVL2TGFhYWYO3cuDh06BJFIBG9vb2zevBk2NjayY5KSkjBt2jRERETA2NgYvr6+CAoKgo7Of7cXGRmJwMBA3L17Fw4ODli8eDEmTZqkcKxSqRQvXryAiYkJeDyeSu6fEFVgGAY5OTmws7MDn8/u8xKq30RTUf0m2kxb6nd2drbc17oYA9fX19QY2NTxTZs24dtvv0VqaipcXFywYcMGdO7cucLjMzMzsWjRIhw5cgQZGRlwdHRESEgIBg0aVP0bYoPhkLe3N7N7927mzp07TGxsLDNo0CCmUaNGTG5uruyYqVOnMg4ODkx4eDhz48YNpkuXLkzXrl1l7xcXFzNt27ZlvLy8mJs3bzInTpxgrKysmIULF8qOefz4MWNoaMgEBgYy9+7dYzZs2MAIBAImLCxM4ViTk5MZAPSil8a+kpOTWf9fpPpNL01/Uf2mlza/qH7TS9tfitbxQ4cOMUKhkNm1axdz9+5dZsqUKYy5uTmTlpZW7vEikYhxd3dnBg0axFy8eJFJTExkIiMjmdjYWNb/p6pLoya0ePnyJaytrXHu3Dn07NkTWVlZqF+/Pg4ePIhRo0YBAB48eIDWrVsjKioKXbp0wcmTJzFkyBC8ePFC1poVGhqK+fPn4+XLlxAKhZg/fz6OHz+OO3fuyK41duxYZGZmIiwsTKHYsrKyYG5ujmQApgCS/raEuJmuqn8EhCilo3Oa7PvMzEyYmZmxKofqN9FEqq7fjs28YNekh9yn+3k2OiioD4jqs/xTaFUIW6ssAEBbi1R0MXkEADh7MA2/fPdMdtiYLxrCa7xNuUV0NUgus2//3nwEB+fKthcsMMYEX0OFQmJ7rs5DMWyHvSmzP/WoJYqbyXd0uVzgoFAsylDkZ1bezwrg5udV3XPnzM7C6VMlM6qp5Pd3cnKdnUCAaK7sS5fgMGiQwnXcw8MDnTp1wsaNGwGUtMw6ODjg888/x4IFC8ocHxoaim+//RYPHjyArq5mPLdw2i3wfVlZJX+gLC0tAQDR0dEQi8Xw8vKSHdOqVSs0atRIllxFRUWhXbt2ct0Evb29MW3aNNy9excdOnRAVFSUXBmlx8yePVvh2Er/GJu+fdW/W4zsDnrsbpQQFTEzB7IyS76vTncQqt9EEzVrAfybUPK9Kur303/PQkdoCPsm3WXvCYQ6EOgBfH2WyZUhoGNUCAAQGuvC0EQAABj6vwYQ6vGRcDMXLToYw9vXpsJ7MDEo21Vm2gwj6OnzEB0thpubLj71N1T4Z8D2XKO7EpT3aF58txh5HYRy+wx1BArFoownd/Plt+/ly36epcr7WQHc/Lyqe66np1CWXKnk97epKSVXRPMYGQFQrI4XFRUhOjoaCxculO3j8/nw8vJCVFRUuef8/fff8PT0xIwZM/DXX3+hfv36GD9+PObPnw+BQPW/pxShMcmVVCrF7Nmz0a1bN7Rt2xYAkJqaCqFQCHNzc7ljbWxskJqaKjvm3cSq9P3S9yo7Jjs7GwUFBTAwMCgTj0gkkpujv7TvqKiJAHgsgfGJQmR/YlSNOyak+hrY6SArs1jp8yqq3w8AdAaofhON8PE4E3yzIkfp8yqq3wCQ/eYJ7NG9vNNUisfjYcAkWwyYxP58/8lG8J9cc+canihJEsXNBciaawKzdTnQfSiB4YlC5H2iWGtMdbToaIyrJzP+2+5grPC5XPy8qnvup/6GEBUyci1fiqisfhOiqd6vp3p6etDTk/8Q99WrV5BIJOU+sz948KDcch8/fox//vkHPj4+OHHiBP79919Mnz4dYrEYy5YtU+1NKEhjZgucMWMG7ty5g0OHDnEdCoCSyTbMzMxkLweHki4Qz36rh6xxBjCMKgL/jZTjKEldN2J02Q8GFFFR/e4F4EZHXarfRCP4+hvhiwUmSp9XUf0GAFMLJxVGqD34b6TQiypC7ngDpJ2wQsEQfaSdsELuOAPoXa6Z3wfevjaYsKgRPAZZYsKiRvD2Lb8bpbbg8XgKdyF8V2X1mxBN5eDgIFdvg4KCVFKuVCqFtbU1tm3bBjc3N4wZMwaLFi1CaGioSspnQyOSq4CAABw7dgwRERFo2LChbL+trS2KioqQmZkpd3xaWhpsbW1lx6SlpZV5v/S9yo4xNTUtt9UKABYuXIisrCzZKzn5bT9vfR7S15ojdYM5DK4Usb5nQlSB7cNnRfU7YIEJTP+sR/WbaAQej4fxE5V/+KyofjduORB2jbupOkytoHelCK83muPNWjMwBiXddxgDHt58a4bXG82hVwO/D0pb+2b+0AwDJtnSzI4VqPD5hBBNVL8+ACA5OVmu3r7b9a+UlZUVBAJBuc/spc/072vQoAFatGgh1wWwdevWSE1NRVERN88xnCZXDMMgICAAf/zxB/755x80btxY7n03Nzfo6uoiPDxcti8+Ph5JSUnw9PQEAHh6euL27dtIT0+XHXPmzBmYmprC2dlZdsy7ZZQeU1pGefT09GT9l8vrx5w7xAB5A2hMCuEW24fPiur3+Ikl4wWofpParKL63cDJkx7YK1AwQA8FQ/TLf2+IPgro94HGqOr5hBCNYm8PAGXq7PtdAgFAKBTCzc1N7pldKpUiPDy8wmf2bt264d9//4VU+l/rekJCAho0aAChUFjuOerGaXI1Y8YM/PTTTzh48CBMTEyQmpqK1NRUFBQUAADMzMzg7++PwMBAREREIDo6Gn5+fvD09ESXLl0AAP3794ezszMmTJiAW7du4dSpU1i8eDFmzJgh+4ebOnUqHj9+jHnz5uHBgwfYvHkzDh8+jDlz5lTvBuiPNNFmVL8JqTuq+v9Ovw8IIWzkKjemMDAwENu3b8fevXtx//59TJs2DXl5efDz8wMATJw4Ua7Va9q0acjIyMCsWbOQkJCA48eP45tvvsGMGTNUehvK4HRCiy1btgAAevfuLbd/9+7dsgV+v//+e/D5fIwcOVJuEeFSAoEAx44dw7Rp0+Dp6QkjIyP4+vpi5cqVsmMaN26M48ePY86cOfjhhx/QsGFD7NixA97e3mq/R0IIIYQQQuqkf/9V6vAxY8bg5cuXWLp0KVJTU+Hq6oqwsDDZJBdJSUlyixE7ODjg1KlTmDNnDtq3bw97e3vMmjUL8+fPV+ltKIPT5EqRJbb09fWxadMmbNq0qcJjHB0dceLEiUrL6d27N27evKl0jIRoMoZhcHBfftUHElILqbp+GzzPhY5ALNsucFB+vCKpGMMwOLU3DQkxuWjRsfLp5wkhpCIBAQEICAgo973IyMgy+zw9PXHlyhU1R6U4jZmKnRCivH278vFdsPJTVRNSG1D9rl1O7U3D/tVJACCbUn3ApPIHoRNCiLbSiNkCCSHsxNygGf2I9qL6XbskxMiPrUi4qdxYC0II0QaUXBFSi3V052YmHELU7USuM/Tb0fo9tUmLjvKL/iqzCDAhREvp1L1OcnXvjolGO5HrzHUItYrVxwyGZD/BsfWPuQ6FVBPV/bJ6TnCAWCSh+l1LlC76m3AzFy06GGv9IsCEEAW0bct1BDWOkislnc5rBX0e/diIZuDxeOg+zqHWPXxSIkEUUVvrd11VugjwgElcR0IIIdyhLIEQIkMfHhBCCCFEZe7f5zqCGkdjrgghhBBCCCGqJxJxHUGNo+SKEEIIIYQQQlSAkisl7fvyNqRSKddhEKIWVL8JIWwxDIOwPan4cea/CNuTCoZhuA5JozEMg/17aRF4QrhWXFyMs2fPYuvWrcjJKVlb8cWLF8jNZbecBCVXSroXmYHds25zHQYhakH1mxDCVukiwldPZmD/6iSc2pvGdUgabdfOfAQH01pghHDp6dOnaNeuHT788EPMmDEDL1++BACsWbMGX3zxBasyKbliIflODtchEKI2VL8JIZW5UNCo3P20iLBybtwQcx0CIerXpAnXEVRq1qxZcHd3x5s3b2BgYCDbP3z4cISHh7Mqk5IrFhzamnAdAiFqQ/WbEMIGLSKsHHd3Xa5DIET9TE25jqBSFy5cwOLFiyEUCuX2Ozk54fnz56zKpDmXleTc2xJ+P7TjOgxC1ILqNyGELVpEWDmf+htCVMhQ10Ci3VJTuY6gUlKpFBKJpMz+Z8+ewcSE3YfN1HKlpInftgOfTz82op2ofhNC2CpdRHjmD80wYJIteDyeSsvvYZCk0vK4xuPxMMHXkOswCFEvDU+u+vfvj5CQENk2j8dDbm4uli1bhkGDBrEqk1quCCGEEEIIIXXOunXr4O3tDWdnZxQWFmL8+PF4+PAhrKys8PPPP7Mqk5IrQgghhBBCSJ3TsGFD3Lp1C7/88gtu3bqF3Nxc+Pv7w8fHR26CC2VQckUIIYQQQgipk3R0dODj4wMfHx+VlEeDKwipxRiGwcWfk7kOgxC1oPpNagLDMNi5Iw/TpmZi5448WvyYEFUyN+c6gkoFBQVh165dZfbv2rULa9asYVUmJVeE1GLn9yfj2PrHXIdBiFpQ/SY1YdfOfKxYnoPjxwqxYnkOdu3M5zokQrSHkxPXEVRq69ataNWqVZn9bdq0QWhoKKsyKbkipBZLvJnFdQiEqA3Vb1IT3l/MNzqaFvclRGUKC7mOoFKpqalo0KBBmf3169dHSkoKqzJZJ1eZmZnYsWMHFi5ciIyMDABATEwM6wW3CCHKa9zBjOsQCFEbqt+kJry/mK+bGy3uS4jKPHjAdQSVcnBwwKVLl8rsv3TpEuzs7FiVyWpCi7i4OHh5ecHMzAxPnjzBlClTYGlpiSNHjiApKQn79u1jFQwhRDk9JzhALJJQ1ymilah+k5rwqX/JWlPR0WK4uenKtgkh2m/KlCmYPXs2xGIxPvjgAwBAeHg45s2bh7lz57Iqk1XLVWBgICZNmoSHDx9CX19ftn/QoEE4f/68wuWcP38eQ4cOhZ2dHXg8Hv78888yx9y/fx/Dhg2DmZkZjIyM0KlTJyQl/beQYGFhIWbMmIF69erB2NgYI0eORFpamlwZSUlJGDx4MAwNDWFtbY0vv/wSxcXFyt84IRqGx+Oh+zgHrsMgRC2ofpOawOPx4D/ZCJu3mMN/spHKFz8mhChn06ZNcHJygr6+Pjw8PHDt2rUKj92zZw94PJ7c693cpCpffvkl/P39MX36dDRp0gRNmjTB559/jpkzZ2LhwoWs4meVXF2/fh2fffZZmf329vZIVWIl5ry8PLi4uGDTpk3lvv/o0SN0794drVq1QmRkJOLi4rBkyRK5H9qcOXNw9OhR/Prrrzh37hxevHiBESNGyN6XSCQYPHgwioqKcPnyZezduxd79uzB0qVLlbhjQgghhBBCiDr98ssvCAwMxLJlyxATEwMXFxd4e3sjPT29wnNMTU2RkpIiez19+lTh6/F4PKxZswYvX77ElStXcOvWLWRkZFQrT2DVLVBPTw/Z2dll9ickJKB+/foKlzNw4EAMHDiwwvcXLVqEQYMGYe3atbJ9TZs2lX2flZWFnTt34uDBg7KmvN27d6N169a4cuUKunTpgtOnT+PevXs4e/YsbGxs4Orqiq+//hrz58/H8uXLIRQKFY6XEEIIIYQQoh7r16/HlClT4OfnBwAIDQ3F8ePHsWvXLixYsKDcc3g8Hmxtbat1XWNjY3Tq1KlaZZRi1XI1bNgwrFy5EmJxyYw6PB4PSUlJmD9/PkaOHKmSwKRSKY4fP44WLVrA29sb1tbW8PDwkOs6GB0dDbFYDC8vL9m+Vq1aoVGjRoiKigIAREVFoV27drCxsZEd4+3tjezsbNy9e7fC64tEImRnZ8u9CNE0bNcBovpNagOq3zWP1nwihKiUq6vChxYVFSE6OlruuZ7P58PLy0v2XF+e3NxcODo6wsHBAR9++GGlz/fvy8vLw5IlS9C1a1c0a9ZM1jWw9MUGq5ardevWYdSoUbC2tkZBQQF69eqF1NRUeHp6YvXq1awCeV96ejpyc3MRHByMVatWYc2aNQgLC8OIESMQEREhu6ZQKIT5ewuU2djYyLonpqamyiVWpe+XvleRoKAgrFixQiX3Qoi6sF0HiOo3qQ2ofte80jWfAOD4sZIplP0nG3EZEnmPSCSCSCSSbdOHB6Q2eL+e6unpQU9PT27fq1evIJFIyn1uf1DBrIMtW7bErl270L59e2RlZeG7775D165dcffuXTRs2LDKuCZPnoxz585hwoQJaNCggUrGXLJKrszMzHDmzBlcvHgRcXFxyM3NRceOHeUyzeqSSqUAgA8//BBz5swBALi6uuLy5csIDQ1Fr169VHat8ixcuBCBgYGy7ezsbDg40MBqolkex7BbB4jqN6kNqH7XvPLWfPKfzFEwtQDDMNi1Mx83bojh7l4y06C6J8SgDw9IrfLwIQCU+R28bNkyLF++vNrFe3p6wtPTU7bdtWtXtG7dGlu3bsXXX39d5fknT57E8ePH0a1bt2rHUopVcpWUlAQbGxt0794d3bt3l+1nGAbJyclo1KhRtQOzsrKCjo4OnJ2d5fa3bt0aFy9eBADY2tqiqKgImZmZcq1XaWlpsr6Xtra2ZWYZKZ1NsLL+meVl1IRoGqlEyuo8qt+kNqD6XfPc3XVlLVYArflUFS5a+ujDA1Kr5OUBAJKTk2FqairbXd7vaCsrKwgEgjKzfr/7XF8VXV1ddOjQAf/++69Cx1tYWMDS0lKhYxXFasyVk5MTOnbsiEePHsntT09PR+PGjVUSmFAoRKdOnRAfHy+3PyEhAY6OjgAANzc36OrqIjw8XPZ+fHw8kpKSZFmsp6cnbt++LTfLyJkzZ2BqalomcSOktuELaMpgor2ofte8T/0NsWy5CYYM1cey5Sa05lMVymvpUzc9PT2YmprKvQjRdO/X2fKSK6FQCDc3N7nneqlUivDwcLnWqcpIJBLcvn0bDRo0UOj4r7/+GkuXLkV+fr5iN6IAVi1XQEkLUufOnXH48GH07dtXtl+Zwa+5ublymWViYiJiY2NhaWmJRo0a4csvv8SYMWPQs2dP9OnTB2FhYTh69CgiIyMBlHRP9Pf3R2BgICwtLWFqaorPP/8cnp6e6NKlCwCgf//+cHZ2xoQJE7B27VqkpqZi8eLFmDFjBn2ySWq9xh3McOvUS67DIEQtqH7XvNI1n6groGLc3HRw/Nh/2x07sn6sIoSgZC1dX19fuLu7o3PnzggJCUFeXp5s9sCJEyfC3t4eQUFBAICVK1eiS5cuaNasGTIzM/Htt9/i6dOnmDxZsV9i69atw6NHj2BjYwMnJyfo6sq31sfExCh9D6x+C/B4PGzevBkHDhzA4MGDsXbtWsycOVP2nqJu3LiBPn36yLZLm7l9fX2xZ88eDB8+HKGhoQgKCsLMmTPRsmVL/P7773JdEb///nvw+XyMHDkSIpEI3t7e2Lx5s+x9gUCAY8eOYdq0afD09ISRkRF8fX2xcuVKNrdOiFa7+HMy+k52okU0iWaoI/XwQoHyXel7GCSpIZK6qTrjpt4/jn53ElI9Y8aMwcuXL7F06VKkpqbC1dUVYWFhskkukpKSwOf/1/HuzZs3mDJlClJTU2FhYQE3NzdcvnxZ4d5pH330kcrvgcewmGeVz+cjNTUV1tbWOHnyJMaNG4fRo0dj6dKlcHJygkQiUXmgXMvOzoaZmRmCr/eCvjF9MkU0w7fDr+L5g1wAJeu+se0eUlq/AWD4wuboNbH64yYJqa49c24jNqykS7cq6nfftvOgI/ivx0KBgwlybXVQYA0UWrOccty6EHb1MwEA7S1foLvpQ3blKInL5IpNMqgK6rrnnTvyZOOmAGDZchOFx01Nm5opN0ZtyFB9bN5irvC1c3KkaNM6XSX1uzplEKIu2U+ewKxx4zpVP1mNuXrXwIEDcfnyZURERGDIkCGqiIkQwqHEm+xmaCNE1Rp3MOM6BFIHVGfclLu7fBcimgCEkPeoeLIIdcjMzMSOHTuwcOFCZGRkACjpDvj8+XNW5bFqgunVqxeEQqFs29nZGVevXsWIESNowUFCalCnj2zxPFixGXEURQ+0RFP0nOAAsUjCaq0rQhRVnRkSSyf8iI4Ww81NlyYAIeR9r15xHUGl4uLi4OXlBTMzMzx58gRTpkyBpaUljhw5gqSkJOzbt0/pMlklVxEREWX21atXD+fOnWNTHCGEpV4TG6G4SKqyh88hgU3QcwJN6Us0A4/HQ/dxDpRcEbWqToJEE4AQUoVnz7iOoFKBgYGYNGkS1q5dCxMTE9n+QYMGYfz48azKVDi5ys7OlvWVrGo18LrSp5IQQgjRJlwsiss1SpAIqbuuX7+OrVu3ltlvb2+P1NRUVmUqnFxZWFggJSUF1tbWMDc3L/eXLcMw4PF4WjmhRSmaTY1okvP7k1X6qf6x9Y+hoydAb5rQgpA6iYtFcesqhmGwf6/q1tYhhChPT0+v3EajhIQE1K9fn1WZCidX//zzj2wF4/K6BdYVx9Y/hq6egGZTIxpBHZNPXP8zhZIrQjTchYJGapk9r7zJHahFRz127cxHcHAu12EQUqcNGzYMK1euxOHDhwGUtGQnJSVh/vz5GDlyJKsyFU6uevXqVe73dVHizSz0msh1FISUTD5ROlW1ytCcNITUWdWZ3IEo5/1ElhCt9M44Jk20bt06jBo1CtbW1igoKECvXr2QmpoKT09PrF69mlWZrCa0CAsLg7GxsWwx302bNmH79u1wdnbGpk2bYGFhwSqY2oJmUyOaQh2zqXX6yFZlZRFCahea/a7mvJ/IEqKVmjblOoJKmZmZ4cyZM7h48SLi4uKQm5uLjh07wsvLi3WZrNa5+vLLL2X9E2/fvo3AwEAMGjQIiYmJCAwMZB1MbUCzqRFNUjqbmqoMCWxCXV4JqcNKJ3fYvMUc/pONaHyxGn3qb4gFC4y5DoMQ9aol8zB0794d06dPx7x586qVWAEsW64SExPh7OwMAPj9998xdOhQfPPNN4iJicGgQYOqFZCm6z7Ogf7YEK1F9ZsQQmoGj8fDBF9DGndFtNvt21xHUMaPP/6o8LEzZ85UunxWyZVQKER+fskMN2fPnsXEiSUDkCwtLaucpp0QQgghhBBCuPD999/Lbb98+RL5+fkwNzcHAGRmZsLQ0BDW1tY1l1x1794dgYGB6NatG65du4ZffvkFQMm0hQ0bNmRTJCGEEEIIIYSoVWJiouz7gwcPYvPmzdi5cydatmwJAIiPj8eUKVPw2WefsSqf1ZirjRs3QkdHB7/99hu2bNkCe3t7AMDJkycxYMAAVoEQQgghhBBCSE1ZsmQJNmzYIEusAKBly5b4/vvvsXjxYlZlsmq5atSoEY4dO1Zm//vNbMHBwZg6daqsmY0QQgghhBBCNEFKSgqKi4vL7JdIJEhLS2NVJquWK0V98803yMjIUOclCCGEEEIIIZqoTRuuI6hU37598dlnnyEmJka2Lzo6GtOmTWM9a6BakyuGodVICVEnhmFw8edkrsMgRC2oftc8hmGwc0cepk3NxM4defR3nBBSPbqavRD5rl27YGtrC3d3d+jp6UFPTw+dO3eGjY0NduzYwapMVt0C67J9X97G5E0u4PPVmpcSopDz+5NVuoDwxZ+T0XeyE03HTjTCORXX7/cZJOcg11b7Fr1nGAa7dubjxg0x3N1LFgJW9P/0rp35WLE8BwBkC9z6TzZSW6yEEC33WH2/w1Whfv36OHHiBBISEvDgwQMAQKtWrdCiRQvWZVJypaR7kRnYNTMOkze6ch0KIUi8maXS8o6tfwwdIR+9fR1VWi4hbFz/I4XrEGql6iRIvx4ukN/+tYCSK0IIe7VkiaYWLVpUK6F6FyVXLDy6lsl1CIQAABp3MENsWLpKy7z+ZyolV4TUYjduiOW2o6PF8J+s2LmvX0sr3SaEEG0ikUiwZ88ehIeHIz09HVKp/O+8f/75R+kyqW8bC0IDAdchEAIA6DnBAUMCm3AdBiFq0ekjW65DqJXc3eXHOLi5KT7mwdJSvvtgPcuy3Ql7GCSxC0wL0Rg1Qmq3WbNmYdasWZBIJGjbti1cXFzkXmyovOWqoKAABgYGAIAePXrIvtcmDduacB0CIQBKPnE5t1e1A/4tG+qrtDxC2Oo6rgH+XvcvpOKqj9U2DMPg1N40JMTkokVHY3j72ig8bupTf0MAJS1Wbm66sm1FfDzGUNalEABGf6z4uep2oaCRxiV21emCWVxcjF49XqktNkJI1Q4dOoTDhw9j0KBBKiuTVXI1c+ZM/Pjjj2X25+XlYciQIYiIiAAAnDhxonrRaag3Lwq5DoEQAMDKD6KQ81q1T558AU1mQTTDkq6X62RiBQCn9qZh/+qSROLqyZIlTQZMkm/JqyjZ4PF48J9spHBXwHdVJzGri27cKJLbjr5RpHBy1aXza7yi3IpoOzs7riOolFAoRLNmzVRaJqtugcePH8eyZcvk9uXl5WHAgAHlLsRVkfPnz2Po0KGws7MDj8fDn3/+KXtPLBZj/vz5aNeuHYyMjGBnZ4eJEyfixYsXcmVkZGTAx8cHpqamMDc3h7+/P3Jzc+WOiYuLQ48ePaCvrw8HBwesXbtW+Zt+B82kRjRFzquiqg9SklRC3VqIZhDlSbgOgTMJMfJ/xxJu5lZwpGqVJmabt5jDf7IR/b2rwvuPPGLFH4Hw6hX9riV1gLW10qds2rQJTk5O0NfXh4eHB65du6bQeYcOHQKPx8NHH32k8LXmzp2LH374QaVdelklV6dPn8b27dsREhICAMjJyUG/fv3A4/EQFhamcDl5eXlwcXHBpk2byryXn5+PmJgYLFmyBDExMThy5Aji4+MxbNgwueN8fHxw9+5dnDlzBseOHcP58+fxv//9T/Z+dnY2+vfvD0dHR0RHR+Pbb7/F8uXLsW3bNja3DgDo9FED1ucSokoCXdU/+FDLFdEUekZ1d3xri47G8tsdjCs4knBJ8F4V1VGiP5CVFf2uJXXAmzdKHf7LL78gMDAQy5YtQ0xMDFxcXODt7Y309Mon73ry5Am++OIL9OjRQ6nrXbx4EQcOHEDTpk0xdOhQjBgxQu7FBqtugU2bNkVYWBj69OkDPp+Pn3/+GXp6ejh+/DiMjBSfsnXgwIEYOHBgue+ZmZnhzJkzcvs2btyIzp07IykpCY0aNcL9+/cRFhaG69evw93dHQCwYcMGDBo0CN999x3s7Oxw4MABFBUVYdeuXRAKhWjTpg1iY2Oxfv16uSRMUUMCm6DXBAelzyNEHQbMdMSx756otMwmHc1VWh4hbH19uSsWuF+ok10DvX1tAJS0WLXoYCzbJpqlUychThwXybbd3YUKnxt11RLt275GXp46IiNEQzx9qtTh69evx5QpU+Dn5wcACA0NxfHjx7Fr1y4sWLCg3HMkEgl8fHywYsUKXLhwAZmZmQpfz9zcHMOHD1cqxqqwntCiffv2OHbsGPr16wcPDw8cO3ZM7ZNXZGVlgcfjwdzcHAAQFRUFc3NzWWIFAF5eXuDz+bh69SqGDx+OqKgo9OzZE0Lhf7/wvL29sWbNGrx58wYWFuUvICkSiSAS/fcLM/vtPP3dxzlQNwmiMfjvf2yqoIrq95DAJuhJHx4QDXHp4AtWiVVF9bs8xqnFKLDWvFVJeDweBkyyxYBJXEdCKlOdMWp7dheySqyUqd+EaIr366menh709PTk9hUVFSE6OhoLFy6U7ePz+fDy8kJUVFSFZa9cuRLW1tbw9/fHhQsXlIpr9+7dSh2vCIX/onTo0KHcpEJPTw8vXrxAt27dZPtiYmJUE907CgsLMX/+fIwbNw6mpqYAgNTUVFi/15dTR0cHlpaWSE1NlR3TuHFjuWNsbGxk71WUXAUFBWHFihWqvg1CVCpyN7uZAiuq3/ThAdEkESqu34SoWnUmD9m2NZ/VNal+k9rIwUH+g9tly5Zh+fLlcvtevXoFiUQie04vZWNjgwcPHpRb7sWLF7Fz507Exsayjq24uBiRkZF49OgRxo8fDxMTE7x48QKmpqYwNla+S7bCyZUyg8NUTSwW4+OPPwbDMNiyZUuNXHPhwoUIDAyUbWdnZ5epGIRwrSBLidHT76D6TWoDqt9Em2VmshtAT/Wb1EbJycmyxhEAZVqt2MjJycGECROwfft2WFlZsSrj6dOnGDBgAJKSkiASidCvXz+YmJhgzZo1EIlECA0NVbpMhZOr92cHrCmlidXTp0/xzz//yP3D2NralhngVlxcjIyMDNja2sqOSUtLkzumdLv0mPKU11xJiKYxMNOBOF35GQOpfpPagOo30Wbm5nykpUmVPo/qN6lV3g4ZMjU1lXuGL4+VlRUEAkG5z+3lPbM/evQIT548wdChQ2X7pNKS/1M6OjqIj49H06ZNK73mrFmz4O7ujlu3bqFevXqy/cOHD8eUKVMqv7cKsJotsFRRURGePXuGpKQkuZeqlCZWDx8+xNmzZ+VuGgA8PT2RmZmJ6Oho2b5//vkHUqkUHh4esmPOnz8Psfi/jvtnzpxBy5YtK+wSSEht0efTRlyHQIjaUP0m2uyzqbSGGKkDWrZU+FChUAg3NzeEh4fL9kmlUoSHh8PT07PM8a1atcLt27cRGxsrew0bNgx9+vRBbGysQi26Fy5cwOLFi+XmZgAAJycnPH/+XOHY38VqFG9CQgL8/f1x+fJluf0Mw4DH40EiUWxtktzcXPz777+y7cTERMTGxsLS0hINGjTAqFGjEBMTg2PHjkEikcjGUVlaWkIoFKJ169YYMGAApkyZgtDQUIjFYgQEBGDs2LGwe7to2fjx47FixQr4+/tj/vz5uHPnDn744Qd8//33bG6dEI3Se2IjSIqkOLb+MdehEKJyVL+JNvOfbIQiERAcXDNrmBFSGwQGBsLX1xfu7u7o3LkzQkJCkJeXJ5s9cOLEibC3t0dQUBD09fXRtm1bufNLJ717f39FpFJpuXnLs2fPYGJiwuoeWCVXfn5+0NHRwbFjx9CgQQPWA+Bv3LiBPn36yLZL+xD7+vpi+fLl+PvvvwEArq6ucudFRESgd+/eAIADBw4gICAAffv2BZ/Px8iRI/Hjjz/KjjUzM8Pp06cxY8YMuLm5wcrKCkuXLmU1DTshhBBC1INhGJzam4aEmFy06Fgy/bw2T7DD4/EwwdeQkiui3W7dUurwMWPG4OXLl1i6dClSU1Ph6uqKsLAw2SQXSUlJ4POr1fFOTv/+/RESEiJb/5bH4yE3NxfLli3DoEGDWJXJKrmKjY1FdHQ0WrVqxeqipXr37l3pisiKrJZsaWmJgwcPVnpM+/btlZ6akZDa4Pz+ZPpUn2gtqt91y6m9adi/umRowdWTGQCAAZMqHhtNCKkFFHiWf19AQAACAgLKfS8yMrLSc/fs2aPUtdatWwdvb284OzujsLAQ48ePx8OHD2FlZYWff/5ZqbJKsUqunJ2d8erVK1YXJISoTuLNLK5DIERtqH7XLQkx8i04CTdzaZ0vQohaNWzYELdu3cKhQ4cQFxeH3Nxc+Pv7w8fHh/X6vaySqzVr1mDevHn45ptv0K5dO+jq6sq9X9VsIIQQ1XDqYIbYsPSqD1TQxZ+T0Xeyk1Z3xSG1R2MV129SNYZhsGtnPm7cEMPdvWRR3Jr6fdCio7GsxQoAWnRQfn0ZQghRlo6ODj755BPVlcfmJC8vLwBA37595fYrO6FFbUQPn0QZ/7yqXtfZqjzKLQTwUGXlHVv/GLp6AvSaSLO0Ee71+KQh4qNe415kRtUHE5XYuSMPK1eUtCAdP1YIhmEweUrNJDneviVjKhJu5qJFB2PZNiGEqFN8fDw2bNiA+/fvAwBat26NgIAA1sOfWCVXERERrC6mDejhs3ZRd3LDtYzbKSovM/FmFnpNVHmxRENp8v+RR4djKbGqYb/9Wii//VthjSVXPB4PAybZsuoKyGWLGyGkEkpMxc6F33//HWPHjoW7u7tsuvcrV66gXbt2OHToEEaOHKl0maySq169erE5TWvQw6fiNPnBTRswUuUHilalcQczlZdZm1Ed5k7yyXiuQ6jzakt6smtnPlYszwFQ0uIGlEx1TgjhGMtxSzVl3rx5WLhwIVauXCm3f9myZZg3b17NJVfnz5+v9P2ePXuyKbbW0PSHT3oYrDt4AtVNRwoAQwKboOeEsovuUZ0i3FD9hwekcqM/NpAlKQAwarRmPxiVunFDLLcdHS2G/2SOgiGE/CcpiesIKpWSkoKJE8u2mHzyySf49ttvWZXJKrkqXWPqXe82v2vzmKv3Hz7poZNwydzZGikRj1RWXuQfb8AM6w+BQKCyMglhy2FgK9zdcInrMOqUT/0NAZQkJ25uurJtTefuritrsQIANzfdSo7WDBKJBB8Ne811GISoV4Zmd+3u3bs3Lly4gGbNmsntv3jxInr06MGqTFbJ1Zs3b+S2xWIxbt68iSVLlmD16tWsAqktJP37IuK1kOswCAEAPFNxt6ncxDc4738YffaMU2m5hLDRZLQLpEUS3N96hetQ6gwejwf/yUa1rtWnNiaFA70z8PChlOswCKnThg0bhvnz5yM6OhpdunQBUDLm6tdff8WKFSvw999/yx2rCFbJlZlZ2W5x/fr1g1AoRGBgIKKjo9kUWys8+eM2mvl0pIGyRCPkPstUfZnJqi+TEDZ4PB6chrej5EpJdXFyh9qYFCYmam8vH0Jqi+nTpwMANm/ejM2bN5f7HgClZkNnlVxVxMbGBvHx2j0A+f7WKxDo6aDJaBeuQyEEPAEfDFT8yScNcyEagmEYPPnjNtdh1DpcTe7AMAxO7U1DQkwuWnQsmUpd25O66tDVBUQirqMgpG6TSlXfeswquYqLi5PbZhgGKSkpCA4Ohqurqyri0mgZd1IpuSIagcdXw4MLPQwRDZH4Wxy1WrHA1eQOp/amYf/qksHrpYsBD5hkq/4L11LGxjzk5tKnWUTLWVtzHYHCCgsLoa+vX+1yWE015urqig4dOsDV1VX2/aBBg1BUVIQdO3ZUOyhNZ9mW/lgQzaBjoPpB28YOmj0bJqk7XqthHTdtc6Gg7JqL7u7yvxdqanKHhJhc+e2buRUcqT0YhsHOHXmYNjUTO3fkgWEUT5ZcXGn8NqkD7Oy4jqBSEokEX3/9Nezt7WFsbIzHjx8DAJYsWYKdO3eyKpNVy1ViYqLcNp/PR/369VWS7Wm61p91QeNR7bkOgxAAgHlrG6RdSKz6QAUZN7ZAz50fq6w8QqqDkdBgfza4mtyhRUdjWYsVALToUDOLD3OpOl0wt24zg/+nmQg/W6S2+AjhXE5O1cdwaPXq1di7dy/Wrl2LKVOmyPa3bdsWISEh8Pf3V7pMVsmVo6Mjm9O0gtPwdtSHnGgMvkC1dbHHllE0DTvRGGrp9loHcDW5g7evDYCSFqsWHYxl29qsOl0w+Xw+ftxgjjat09UQGSEa4pHqlotRh3379mHbtm3o27cvpk6dKtvv4uKCBw8esCqT9YQW4eHhCA8PR3p6epnBYLt27WJbLCFECfXa2yEl8jHXYRCiFiqv389SAd47XbEaNVBd2QQ8Hg8DJtliwCSuI6k5tXF9LULIf54/f15mjSugZKILsVhczhlVY5VcrVixAitXroS7uzsaNGhQp1pyaCp2okkaj2oPiaiYBv0TreQ0sh3SbyQj/fJTrkMhWqw6U9fXxvW1CCH/cXZ2xoULF8r0yvvtt9/QoUMHVmWySq5CQ0OxZ88eTJgwgdVFazOaip1oEloHiGizJ7/fpsSKqF11xk3VxvW1CCH/Wbp0KXx9ffH8+XNIpVIcOXIE8fHx2LdvH44dO8aqTFazBRYVFaFr166sLqgNMu6kch0CIYRovQyaLZDUgPLGTRFCVERXs7vKfvjhhzh69CjOnj0LIyMjLF26FPfv38fRo0fRr18/VmWySq4mT56MgwcPsrqgNqCp2AkhRP0s29GYKKJ+XE1dT0id0KYN1xFUqUePHjhz5gzS09ORn5+Pixcvon///qzLU7hbYGBgoOx7qVSKbdu24ezZs2jfvj1038tK169fzzogTUdTsRNCSM2gMYWkJtC4KUKIKimcXN28eVNu29XVFQBw584duf3aPtEDTcVOCCE1g8YUkppA46YIUaO7d7mOoAwLCwuFn+UzMjKqPug9CidXERERSheuChKJBMuXL8dPP/2E1NRU2NnZYdKkSVi8eLHsB8MwDJYtW4bt27cjMzMT3bp1w5YtW9C8eXNZORkZGfj8889x9OhR8Pl8jBw5Ej/88AOMjbV/kUNCCCFEG1woaIQeBklch0EIURTL6czVKSQkRPb969evsWrVKnh7e8PT0xMAEBUVhVOnTmHJkiWsymc15iorK6vcTC4jIwPZ2dmsAqnImjVrsGXLFmzcuBH379/HmjVrsHbtWmzYsEF2zNq1a/Hjjz8iNDQUV69ehZGREby9vVFY+N/aEz4+Prh79y7OnDmDY8eO4fz58/jf//6n0lgJIYQQQggh7G3atAlOTk7Q19eHh4cHrl27VuGxR44cgbu7O8zNzWFkZARXV1fs37+/0vJ9fX1lr0uXLmHlypX4+eefMXPmTMycORM///wzVq5ciXPnzrGKn1VyNXbsWBw6dKjM/sOHD2Ps2LGsAqnI5cuX8eGHH2Lw4MFwcnLCqFGj0L9/f9kPmmEYhISEYPHixfjwww/Rvn177Nu3Dy9evMCff/4JALh//z7CwsKwY8cOeHh4oHv37tiwYQMOHTqEFy9eKBXPkz9ug2EYld4jIZqC6jchhNQMhmGwf28+12EQolF++eUXBAYGYtmyZYiJiYGLiwu8vb2Rnp5e7vGWlpZYtGgRoqKiEBcXBz8/P/j5+eHUqVMKXe/UqVMYMGBAmf0DBgzA2bNnWd0Dq+Tq6tWr6NOnT5n9vXv3xtWrV1kFUpGuXbsiPDwcCQkJAIBbt27h4sWLGDhwIAAgMTERqamp8PLykp1jZmYGDw8PREVFAShp3jM3N4e7u7vsGC8vL/D5fKXjvb/1ChJ/i6vubRGikah+E0JIzdi1Mx/Bwblch0GIRlm/fj2mTJkCPz8/ODs7IzQ0FIaGhti1a1e5x/fu3RvDhw9H69at0bRpU8yaNQvt27fHxYsXFbpevXr18Ndff5XZ/9dff6FevXqs7oHVIsIikQjFxcVl9ovFYhQUFLAKpCILFixAdnY2WrVqBYFAAIlEgtWrV8PHxwcAkJpasuaUjY2N3Hk2Njay91JTU2FtbS33vo6ODiwtLWXHvE8kEkEkEsm23+3umHEnlRYRJrUa1W+izSqr34RoivfX11IU1W9SqzRtCqBsPdXT04Oenp7cvqKiIkRHR2PhwoWyfXw+H15eXrIGk8owDIN//vkH8fHxWLNmjULhrVixApMnT0ZkZCQ8PDwAlDQihYWFYfv27QqV8T5WLVedO3fGtm3byuwPDQ2Fm5sbq0AqcvjwYRw4cAAHDx5ETEwM9u7di++++w579+5V6XXeFxQUBDMzM9nLwcFB9h6tc0VqO6rfRJtVVr8J0RTvr6+lKKrfpFYxMQEAODg4yNXboKCgMoe+evUKEomk0gaT8mRlZcHY2BhCoRCDBw/Ghg0bFF4AeNKkSbh06RJMTU1x5MgRHDlyBKamprh48SImTZqk+H2+g1XL1apVq+Dl5YVbt26hb9++AIDw8HBcv34dp0+fZhVIRb788kssWLBANparXbt2ePr0KYKCguDr6wtb25IHwbS0NDRo8N+Ck2lpabLp4m1tbcv01SwuLkZGRobs/PctXLhQbm2v7OxsODg4wLqrI5xGtlPlLRJS4yqq37SOG9EkDMPgyR+3lT6vovpNqsYwDHbtzMeNG2K4u5es+UTLj6jHp/6GEBUySncNpPpNapW3cxskJyfD1NRUtvv9VqvqMDExQWxsLHJzcxEeHo7AwEA0adIEvXv3Vuh8Dw8PHDhwQGXxsEquunXrhqioKHz77bc4fPgwDAwM0L59e+zcuVNu+nNVyM/PB58v38AmEAgglUoBAI0bN4atrS3Cw8NlyVR2djauXr2KadOmAQA8PT2RmZmJ6OhoWcvaP//8A6lUKmsCfF95zZUAkH75KZ78fpu6TZFaraL6Teu4EU2S+FscqzWuKqrfpGq7duZjxfIcAMDxYyUz7vpPNuIyJK3F4/EwwddQ6eSK6jepVd42bpiamsolV+WxsrKCQCBAWlqa3P60tLQKG0OAkq6DzZo1A1CyDu/9+/cRFBSkcHKlaqy6BQIlwR84cAB3797FjRs3sGvXrjKJVXBwMDIzM6sV4NChQ7F69WocP34cT548wR9//IH169dj+PDhAEp+Oc2ePRurVq3C33//jdu3b2PixImws7PDRx99BABo3bo1BgwYgClTpuDatWu4dOkSAgICMHbsWNjZ2SkdU8adipsmCSGEqEbG7RSuQ6hz3h8HFB2teWvUEEK0k1AohJubG8LDw2X7pFIpwsPDZWtQKUIqlcqNS6xprFquFPXNN9/g448/hrm5OesyNmzYgCVLlmD69OlIT0+HnZ0dPvvsMyxdulR2zLx585CXl4f//e9/yMzMRPfu3REWFgZ9fX3ZMQcOHEBAQAD69u0rW0T4xx9/ZBUTjUkhhBD1s2zXAC8iHnEdRp3i7q4ra7ECADc3duOCCCGEjcDAQPj6+sLd3R2dO3dGSEgI8vLy4OfnBwCYOHEi7O3tZWO2goKC4O7ujqZNm0IkEuHEiRPYv38/tmzZwtk9qDW5UsV6OSYmJggJCZFbTfl9PB4PK1euxMqVKys8xtLSEgcPHqx2PDQmhRBCakbjUe0hERWz6hpI2PnU3xBASYuVm5uubJsQQmrCmDFj8PLlSyxduhSpqalwdXVFWFiYbJKLpKQkueFCeXl5mD59Op49ewYDAwO0atUKP/30E8aMGcPVLag3udJGNCaFEEJqBo/Hg9PwdpRc1SAejwf/yUbwn8x1JIQQrWBpqfQpAQEBCAgIKPe9yMhIue1Vq1Zh1apVbCJTG0qulPTkj9to5tOREiyilah+E0KI4mh2RUKq0KgR1xGUMWLECIWPPXLkiNLlU3KlpPtbr0Cgp0OzBRK1up9qU/VBb0nyC6s+SNHrUv0mhBCF0eyKhFShoIDrCMowMzNTa/mUXLGQcSeVHj6hXAJAag9Nqt9Ux4gqPzwgRNXKm12RulQS8o74eK4jKGP37t1qLV+tyVWPHj1gYGCgzktwoqhRc3roI1qL6jchhCiGZlckhLyPVXK1Z88eTJo0qcz+4uJiLFmyRDY94okTJ6oVnCayHO8F80FduA6DELWg+k0IIYqj2RUJqf1+++03HD58GElJSSgqKpJ7LyYmRunyWC0iPHPmTIwePRpv3ryR7YuPj4eHhwd+/vlnNkXWGubenWmwKtFaVL8JIURxpbMrbt5iDv/JRvT7k5Ba5scff4Sfnx9sbGxw8+ZNdO7cGfXq1cPjx48xcOBAVmWySq5u3ryJZ8+eoV27djhz5gw2bdqEjh07olWrVrh16xarQAghhBBCCCFaRMM/cNi8eTO2bduGDRs2QCgUYt68eThz5gxmzpyJrKwsVmWy6hbYtGlTXLp0CbNnz8aAAQMgEAiwd+9ejBs3jlUQhBBCCCGEEC3johkTZFUkKSkJXbt2BQAYGBggJ6dk9s8JEyagS5cu2Lhxo9Jlsmq5AoDjx4/j0KFD8PT0hLm5OXbu3IkXL16wLY4QQgghhBBCaoytrS0yMjIAAI0aNcKVKyWL1icmJoJhGFZlskquPvvsM4wePRrz58/HhQsXEBcXB6FQiHbt2uHw4cOsAiGEEEIIIYRoEQ2civ1dH3zwAf7++28AgJ+fH+bMmYN+/fphzJgxGD58OKsyWXULvHTpEq5evQqXt019tra2OHHiBDZt2oRPP/0UH3/8MatgCCGEEKLZLhQ04joEQkhtoYGLCL9r27ZtkEqlAIAZM2agXr16uHz5MoYNG4bPPvuMVZmskqvo6Gjo6emV2T9jxgx4eXmxCoQQQgghhBBCagqfzwef/19HvrFjx2Ls2LHVKpNVclVeYlWqZcuWrIMhhBBCCCGEEHWJi4tD27ZtwefzERcXV+mx7du3V7p8VskVoPoFt2qLzFPXYPlRD1rLgmglqt+EkNqIYRjs2pmPGzfEcHcvWcyXfo8RQsrj6uqK1NRUWFtbw9XVFTwer9zJK3g8HiQSidLls0qufvzxRyxatAiTJk3CX3/9BT8/Pzx69AjXr1/HjBkz2BRZa2QcPAu+UBcWgz25DoUQMAyDzFPXVFZeSf3WgcXgriork5C6iGEYnNqbhoSYXLToaAxvXxt62FejXTvzsWJ5yRTKx48VAgD8JxtxGRIhBAAcHbmOoIzExETUr19f9r2qsZotUB0LbtUmhfHJXIdACAAg88QVZBw8q9IysyNjVVoeIXXRqb1p2L86CVdPZmD/6iSc2pvGdUha7cYNsdx2dLS4giMJITXKwoLrCMpwdHSUfdj19OlT2Nvbw9HRUe5lb2+Pp0+fsiqfVcuVOhbcqk30WzpwHQLhSPFzzfoktCCW1pYjRBMlxOTKb9/MxYBJ3MRSF7i768parADAzU2Xw2gIITLp6VxHUKk+ffogJSUF1tbWcvuzsrLQp0+fmusWWLrglqOjo2zBLRcXl2otuFVbWI73gvmgLgA070Gb1D16jZ2QF3tLpWUauXamuk00hrRQwHUIrLToaIyrJzP+2+5gzGE02u9Tf0MAJS1Wbm66sm1CCMdeaPaHwAzDlNtl+/Xr1zAyYvcsxCq5Kl1wq0OHDrIFt3777TfcuHEDI0aMYBVIbWHSvg8kL/S5DoMQAIBpzx5gxGK8OXZCJeVZDBkEs149VVIWIbWOdWHVxyjI29cGQEmLVYsOxrJtoh48Hg/+k43gP5nrSAghtUFpvsLj8TBp0iS5mdAlEgni4uJkvfSUxSq5en/BLSsrK1y6dAnDhg3D1KlTWQVCCFEej8eDafduKkuuTLt3o0H3hKgAj8fDgEm21BVQDS4UNEIPgySuwyCE1GJmZmYASlquTExMYGBgIHtPKBSiS5cumDJlCquyWSVXfD4fRUVFiImJQXp6OgwMDGSLB4eFhWHo0KGsgiGEEEIIIcobueUyXJo0QBs7U7SxM0PrBiYw0aexZ4SUZ/fu3bKhTBs2bICxseq6brNKrsLCwjBhwgS8fv26zHts54RXRHBwMBYuXIhZs2YhJCQEAFBYWIi5c+fi0KFDEIlE8Pb2xubNm2Fj818XjKSkJEybNg0REREwNjaGr68vgoKCoKPDepkvQgghhBCNEZ+ag4dvJPgt+r99jSwN4dzAFM52pmjdwBStbE3Q0MKAeiiQmmNqynUEFWIYBgcOHMBXX32F5s2bq6xcVlOxf/755/j444+RkpICqVQq91JXYnX9+nVs3bq1zErJc+bMwdGjR/Hrr7/i3LlzePHihdy4L4lEgsGDB6OoqAiXL1/G3r17sWfPHixdulQtcRJCCCGE1LSQsa6Y2bc5vFpbo4FZydjwpIx8hN1NxfozCZiy7wZ6rI1Au+WnMXLLZXz1x23svfwEUY9eIyOviOPoidZq0kTpUzZt2gQnJyfo6+vDw8MD165VvJ7n9u3b0aNHD1hYWMDCwgJeXl6VHv8uPp+P5s2bl9tYVB2smm7S0tIQGBgo1zqkTrm5ufDx8cH27duxatUq2f6srCzs3LkTBw8exAcffACgpJmvdevWuHLlCrp06YLTp0/j3r17OHv2LGxsbODq6oqvv/4a8+fPx/LlyyEUCmvkHgghhBCiOS4UNKrymNo0tsurtQ1M32kleJNXhPsp2biXko27L7JxPyUbj17mIldUjOinbxD99I3c+VbGemhmbYRm1sZoVt8YzaxN0MzaGDametTSRdgTK7fm3C+//ILAwECEhobCw8MDISEh8Pb2Rnx8fJnp0gEgMjIS48aNQ9euXaGvr481a9agf//+uHv3Luzt7au8XnBwML788kts2bIFbdu2VSrWirBKrkaNGoXIyEg0bdpUJUFUZcaMGRg8eDC8vLzkkqvo6GiIxWLZeC8AaNWqFRo1aoSoqCh06dIFUVFRaNeunVwi6O3tjWnTpuHu3bvo0KFDjdwDIYQQQjSDIolVbWdhJETXZlbo2sxKtk8skSLxVR7up2QjIS0H8am5SEjLQVJGPl7livAqV4QrjzPkyjESCtCkvjGa1DdCY6uSl2M9IzjVM4S5IX1ATapw965Sh69fvx5TpkyBn58fACA0NBTHjx/Hrl27sGDBgjLHHzhwQG57x44d+P333xEeHo6JEydWeb2JEyciPz8fLi4uEAqFchNbAEBGRkYFZ1aMVXK1ceNGjB49GhcuXEC7du2gqys/YHLmzJlsii3XoUOHEBMTg+vXr5d5LzU1FUKhEObm5nL7bWxskJqaKjvm/Ra20u3SY8ojEokgEolk29nZ2WxvgRCNQ/WbaDOq30RVNHFmwurUb10BHy1sTNDCxkRuf56oGP+m55a8XubKvk/KyEdekQS3n2fh9vOsMuWZGeiikaUhGlkawsHSEA6WBmhoYQh7c33YmRvAUEhj20mJ9+upnp6e3PTnAFBUVITo6GgsXLhQto/P58PLywtRUVEKXSc/Px9isRiWlpYKHV86h4Mqsar1P//8M06fPg19fX1ERkbKNRfzeDyVJVfJycmYNWsWzpw5A339ml1bKigoCCtWrKjRaxJSU6h+E21G9Vt96kKLj6ZTR/020tOBi4M5XBzM5fYXFUuRlJGHRy/z8OhlLp6+yseT13l4+jofqdmFqGlFbAAAIk5JREFUyCoQV5h4AYCFoS7szA3QwEwftmb6aGBmAFtTfVib6sHaRB/WJnowN9Slbod1gIODg9z2smXLsHz5crl9r169gkQiKbdR5MGDBwpdZ/78+bCzs5Pr1VYZX19fhY5TBqvkatGiRVixYgUWLFgAPp/VnBgKiY6ORnp6Ojp27CjbJ5FIcP78eWzcuBGnTp1CUVERMjMz5Vqv0tLSYGtrCwCwtbUtM7AtLS1N9l5FFi5ciMDAQNl2dnZ2mYpB1M/wOf3CrYpEpPzPiOq35qO6X4LqN1G12p4g1mT9Furw3469MinzXkGRBE8z8pCcUYCkjHwkZ+QjKSMfLzIL8PxNAXJExXiTL8abfDHuvqi4dU0o4KOesbDkZaQHK2M9WBrpwtxQCAtDISwMS743M9CFmaEuzAx0YSQUUEJWyyQnJ8uNCXy/1UoVgoODcejQIURGRrJqlCksLERRkfzkLqYsZjtklVwVFRVhzJgxak2sAKBv3764ffu23D4/Pz+0atUK8+fPh4ODA3R1dREeHo6RI0cCAOLj45GUlARPT08AgKenJ1avXo309HTZQLgzZ87A1NQUzs7OFV67vOZKADB4wYNAj/5Dk9qN6jfRZhXVb0K0gabUbwOhAK1sTdHKtvyHz6wCMZ6/KUBqdgFSsgqRklmIlKxCpGUXIj2nEOk5ImTmi1EkkZa8n1Wo8LUFfB6M9XRgrKcDE/2Sl6GwZNtQKIDR268GugIYCAXQ1y198aGv89/3Qp23L4H897qyF4+SOBUxNTWtMlGxsrKCQCCQNYKUerfRpCLfffcdgoODcfbs2TIzi1cmLy8P8+fPx+HDh8udNZDNLOiskitfX1/88ssv+Oqrr9icrjATE5MyM3cYGRmhXr16sv3+/v4IDAyEpaUlTE1N8fnnn8PT0xNdunQBAPTv3x/Ozs6YMGEC1q5di9TUVCxevBgzZszQiF9OhBBCCCHaxsygpJXJ2a7iB2pRsQSvcovwOleE17lFePl2Uo3MfDHe5BXhTX4R3uSLkZlfhKyCYmQVFEEsYSCRMsgqECOrQLmZ6NjQFfCgw+dDR8CDroAPHf7brwKe7HsBnwedt+/p8HnQEfAg4JdsC97u++/r2/0CHnRLtwXlH6f7zn6dt9fRfVu27ttjS2MR8HkQCviyOEqTw3cTRR1BafJYcnyNJI7t2il8qFAohJubG8LDw/HRRx8BAKRSKcLDwxEQEFDheWvXrsXq1atx6tQpuLu7KxXevHnzEBERgS1btmDChAnYtGkTnj9/jq1btyI4OFipskqxSq4kEgnWrl2LU6dOoX379mUmtFi/fj2rYNj4/vvvwefzMXLkSLlFhEsJBAIcO3YM06ZNg6enJ4yMjODr64uVK1fWWIyEqAvDMMiIvcR1GISoBdXv2oVhGJzam4aEmFy06GgMb18bjfvUn02XQE2c1EJb6OkIYG9uAHtzg6oPRkkdKxBLkF1QjFyRGDmFxcgVFSOnsBh5omLkF0mQKyr5vkAsQaFYgoIiCQrEEhSIpSgUSyAqlkL09r2iYimKJFKIxFKIJFIUFUvLXFMsYSCWSAD153E1iscrmeBE1mon4ENXhydrudPT+a81T09HAKGADz3d//br6QigV/r17X59XUGZr+KCXKXiCgwMhK+vL9zd3dG5c2eEhIQgLy9PNnvgxIkTYW9vj6CgIADAmjVrsHTpUhw8eBBOTk6yyeqMjY1hbGxc5fWOHj2Kffv2oXfv3vDz80OPHj3QrFkzODo64sCBA/Dx8VHyJ8syubp9+7ZsCvM7d+7IvafuX6SRkZFy2/r6+ti0aRM2bdpU4TmOjo44ceKEWuMihAsZMReQfv4412EQohZUv2uXU3vTsH91SRJy9WTJ9MUDJlXelYcQZfB4PBgKdd7OQqj6ic4YpqRVrEgihbi45GuxtOR7sVQKsUSKYgmDYikDiVQqa0UTS6Rvv5ZsF0ulb78yKJYwkDAMJBJpybb07TGSt2VIGUjf3S999xqM7JoShkFxaRmSkuPEEvnjiyXy+8Rvt8Vvz5O/15IJS4qKpYCogh+ICjRM/Vep48eMGYOXL19i6dKlSE1NhaurK8LCwmSTXCQlJckNS9qyZQuKioowatQouXLKmzCjPBkZGWjydqFjU1NT2dTr3bt3x7Rp05SKvRSr5CoiIoLVxbRBRuwlWHX+QOM+jSN1U/7zRK5DIERt8p/V3vpdG1pxVC0hRv4T6oSbuRgwiZtYCGGDxyvp0qcj4ANatoQXwzCyREv8tpWu6J3kq3S7NOF6d19py55ILPlv++1xhcWSt9sSFL79+u62TSYDZfsfBAQEVNgN8P1GlidPnrD6eZRq0qQJEhMT0ahRI7Rq1QqHDx9G586dcfTo0TJLPSmKFiBQUvr54+Dr6KKeW0+uQyEEhvaNkR1/i+swCFELhinbRae2qIutOC06GsvuFQBadKi6S05d8363xPwCCYB0boIhdQqPx4NQhwehjnono3tf9nlg/9YavaRS/Pz8cOvWLfTq1QsLFizA0KFDsXHjRojFYtbDnCi5YiH/+RNKrohGsOzYA9JiMXWdItqpFrf0aGMrTlXjlbx9S7rtJNzMRYsOxrJtbUDjrgjRTnPmzJF97+XlhQcPHiA6OhrNmjVTatbBd1FyxYKhvRPXIRACoOSTKEvXbpRcEa1k1LAJchLiuA6DlbrYisPj8TBgkm2tTyLVpbavr0WINpFKpfj222/x999/o6ioCH379sWyZcvg6OgIR0fHapVNyZWSrHsOhmXHHlyHUasYP6+9XXtqg2Jx3f35Ut3SbkY23aDrUoTkW7VvQiJNb8WhB/2aRT9vUmc1bMh1BOVavXo1li9fDi8vLxgYGOCHH35Aeno6du3aVe2yKblSkoONJ3ReMACYKo8lpLYxSpFCR5cSFqIZeDwerFt0rZXJFbXiaJ7qJjjUNZAQFqysuI6gXPv27cPmzZvx2WefAQDOnj2LwYMHY8eOHXKzEbJRs6PaCCGEEMIpakWpWfTzJnVaRkbVx3AgKSkJgwYNkm17eXmBx+PhxYsX1S6bkitCCCGkFqOHd0KIxkrSzNbe4uJi6OvLr5Wmq6sLsbj6q0VTt0BCCCGEEAVQIkuIdmAYBpMmTYKenp5sX2FhIaZOnQojIyPZviNHjihdNiVXhBBCCCGEkDrD19e3zL5PPvlEJWVTckUIIYTUEdTyQgghwO7du9VWNiVXhBBCCEcuZjdndV5304cqjoQQQtTgnS52dQUlV6QMk8QCrkMgSiguLlRZWSZPCqCjQ8sMEM2hyvqtadgmVoQQUms0r3u/5yi5UhI9fBJCCCGEEELKQ1OxE0IIITWMWq0IIXVCbCzXEdQ4Sq4IIYQQQgghRAUouSKEEELqgLo8U2BdvndCSM2i5IoQQgipQdQlkBBCtBdNaEEIIYSQWoVaogghmoqSK6LVdB495zoE9ZMWcR0BISpR7v9Xqt+EEFJ7tWrFdQQ1jpIrJekkpkCHL+Q6DELUguo30WoNbQGBntyuXFv6M0gIIWqjr891BDWOxlwRQgghNYTGWxFC6pQnT7iOoMZRckUIIYQQQghRvcxMriOocZRcEUIIIYQQQogKUGdzJUVl/IGuFqMgEAi4DoUQiEQiRGTtVll5Z1/txgfmfhAKadwV4V5xcTHOZfyssvKSXt1AY+uu4PF4Ch3PMAyyw6IgSngKvRaOMB3gqfC51cUwDG7/HI/UWy9h61If7ca1VCruXTvzceOGGO7uuvjU31Cpc0/tTUNCTC5adDSGt69Njd4z22vX1nPPHkxT6FhCSO1ByZWCGIYBAORJ3+Dym9/gaTmc44gIgVxiVVpH2fjvXAb/ZO6Gl5VfNSMjpPrOZfwMMQoAqKZ+P0z9BwDQyMpd9p6kqBASESAtZID8Qrnzss9cxZtfTgMA8q7eAVMkhmk/jzLlF+eJAABFQjH+yXWqIhqxQjHf+TUBVzfGAgAenU1CsUiCtqNbyN7P50nkjs8plsq+3783H8HBuQCA48cKISpkMMHXEPkF8ueU5+zBNPzy3TMAwNWTGSgSSeE13kahmKurOteu7eeqon5nZ2ezLoMQdcnOywNQvTpe2/CYunS31fDs2TM4ODhwHQYhFUpOTkbDhg1ZnUv1m2g6qt9Em1H9JtquOnW8tqHkSkFSqRQvXryAiYlJjXWRIEQRDMMgJycHdnZ24PPZDaOk+k00Fdf1Ozs7Gw4ODkhOToapqSmr67PF1bXpnmvuulzXb0LUTRV1vLah5IoQQgipQHZ2NszMzJCVlcVJosHFtemea/aeCSHapW6kkIQQQgghhBCiZpRcEUIIIYQQQogKUHJFCCGEVEBPTw/Lli2Dnp5enbk23TMhhLBHY65IlRiGwWeffYbffvsNb968wc2bN+Hq6qrWa/J4PPzxxx/46KOP1HodQrji5OSE2bNnY/bs2RpZHiGEEEKURy1XpEphYWHYs2cPjh07hpSUFGRnZ2Po0KGws7MDj8fDn3/+We559+/fx7Bhw2BmZgYjIyN06tQJSUlJNRs8IYQQQgghNYSSK1KlR48eoUGDBujatStsbW2Rl5cHFxcXbNq0qdJzunfvjlatWiEyMhJxcXFYsmQJ9PX1azByQgghhBBCag4lV6RSkyZNwueff46kpCTweDw4OTlh4MCBWLVqFYYPH17heYsWLcKgQYOwdu1adOjQAU2bNsWwYcNgbW3NKo5ly5ahQYMGiIuLw1dffQUPD48yx7i4uGDlypWsyie1n5OTE0JCQuT2ubq6Yvny5WAYBsuXL0ejRo2gp6cHOzs7zJw5U3acSCTCF198AXt7exgZGcHDwwORkZEKXXfPnj0wNzfHsWPH0LJlSxgaGmLUqFHIz8/H3r174eTkBAsLC8ycORMSiaTCcpKSkvDhhx/C2NgYpqam+Pjjj5GWliZ3zNGjR9GpUyfo6+vDysqq0v+DO3bsgLm5OcLDwxW6D0IIIYRUHyVXpFI//PADVq5ciYYNGyIlJQXXr1+v8hypVIrjx4+jRYsW8Pb2hrW1NTw8PCrsPlgZhmHw+eefY9++fbhw4QLat28PHx8fXLt2DY8ePZIdd/fuXcTFxWH8+PFKX4Nov99//x3ff/89tm7diocPH+LPP/9Eu3btZO8HBAQgKioKhw4dQlxcHEaPHo0BAwbg4cOHCpWfn5+PH3/8EYcOHUJYWBgiIyMxfPhwnDhxAidOnMD+/fuxdetW/Pbbb+WeL5VK8eGHHyIjIwPnzp3DmTNn8PjxY4wZM0Z2zPHjxzF8+HAMGjQIN2/eRHh4ODp37lxueWvXrsWCBQtw+vRp9O3bV4mfFCGEEEKqQ4frAIhmMzMzg4mJCQQCAWxtbRU6Jz09Hbm5uQgODsaqVauwZs0ahIWFYcSIEYiIiECvXr0UKqe4uBiffPIJbt68iYsXL8Le3h4A0KZNG7i4uODgwYNYsmQJAODAgQPw8PBAs2bN2N0o0WpJSUmwtbWFl5cXdHV10ahRI1likpSUhN27dyMpKQl2dnYAgC+++AJhYWHYvXs3vvnmmyrLF4vF2LJlC5o2bQoAGDVqFPbv34+0tDQYGxvD2dkZffr0QUREhFzCVCo8PBy3b99GYmIiHBwcAAD79u1DmzZtcP36dXTq1AmrV6/G2LFjsWLFCtl5Li4uZcqaP38+9u/fj3PnzqFNmzbK/7AIIYQQwhq1XBGVk0qlAIAPP/wQc+bMgaurKxYsWIAhQ4YgNDRU4XLmzJmDq1ev4vz587LEqpSPjw8OHjwIoKR16+eff4aPj4/qboJoldGjR6OgoABNmjTBlClT8Mcff6C4uBgAcPv2bUgkErRo0QLGxsay17lz5+RaRytjaGgoS6wAwMbGBk5OTjA2Npbbl56eXu759+/fh4ODgyyxAgBnZ2eYm5vj/v37AIDY2NgqW6HWrVuH7du34+LFi5RYqUhNT6ibkpKCe/fu1eg1S+Xn56OoqIiTaxNCiLag5IqonJWVFXR0dODs7Cy3v3Xr1krNFtivXz88f/4cp06dKvPeuHHjEB8fj5iYGFy+fBnJycnltgiQuoPP55d5EBaLxQAABwcHxMfHY/PmzTAwMMD06dPRs2dPiMVi5ObmQiAQIDo6GrGxsbLX/fv38cMPPyh0bV1dXbltHo9X7r7SDx7YMDAwqPKYHj16QCKR4PDhw6yvQ4C8vDzk5OQgOzsbPB6vxq77/PlztGvXDosXL8aNGzdq7LoAcOfOHXz88ce4cuUKRCJRjV332bNnOHz4MI4cOYLbt2/X2HUVQSvVEELYoG6BROWEQiE6deqE+Ph4uf0JCQlwdHRUuJxhw4Zh6NChGD9+PAQCAcaOHSt7r2HDhujVqxcOHDiAgoIC9OvXj/VkGUQ71K9fHykpKbLt7OxsJCYmyrYNDAwwdOhQDB06FDNmzECrVq1w+/ZtdOjQARKJBOnp6ejRowcXoaN169ZITk5GcnKyrPXq3r17yMzMlH1I0b59e4SHh8PPz6/Ccjp37oyAgAAMGDAAOjo6+OKLL2okfm1y7949zJkzBy9fvkRaWhrWrl0LHx8fMAyj9kTr4cOHyMrKQlZWFjZs2IBZs2ahY8eOAKDW69+9exc9evTAmDFj0Lhx4xpbSPf27dsYOnQo6tevj+TkZHTu3Bnff/+9XCtwTUhISMDOnTuRnp4OV1dXDBo0CM2bNwePx6uRf3dCiHah5IooLTc3F//++69sOzExEbGxsbC0tESjRo0AAF9++SXGjBmDnj17ok+fPggLC8PRo0cVnoGt1PDhw7F//35MmDABOjo6GDVqlOw9Hx8fLFu2DEVFRfj+++9Vcm+k9vrggw+wZ88eDB06FObm5li6dCkEAgGAkhn9JBIJPDw8YGhoiJ9++gkGBgZwdHREvXr14OPjg4kTJ2LdunXo0KEDXr58ifDwcLRv3x6DBw9We+xeXl5o164dfHx8EBISguLiYkyfPh29evWCu7s7gJIZM/v27YumTZti7NixKC4uxokTJzB//ny5srp27YoTJ05g4P/bu/ugqMo2DODXWeQjlkVSkA8lFkYsAxa2IMUxdaQgzI/UEjCJYKQZ1BhA0HESYRrNJMqwpoLQlYRGM8WU0EJELRsTUDNZ84MZYRwTJyBTEoHd5/3Dlx03s6TOQsD1m9k/zlnPcz9n14G9OM/eJzISQ4YM4U2Fe0Cv12PSpEl46aWXEBwcjNraWsTHx8PPz8/iN04HbgfoadOm4dlnn0V+fj7eeecdrFixAn5+fhb7kN/W1oa0tDTExMTggw8+AAD89NNPaG9vN/uZLreGhgZERkYiNjYWK1euxOHDh5GQkIDm5uZeDVd6vR4TJkxAaGgolEolsrKyUFZWhqioKCxcuJABi4h6ThD9jfXr1wsvLy/TdlVVlQBw1yMuLs7suI0bN4rRo0cLOzs7ERgYKHbt2nXfNQGI0tJS0/a2bduEnZ2d2LFjh2lfa2ursLW1Ffb29uL69ev/9PRogLh27ZqIiooSjo6OwtPTU2zevFkEBgaKrKwsUVpaKsaNGyccHR2FUqkU48ePF/v37zcd29HRIVatWiXUarWwtrYW7u7uYvbs2eLUqVN/W1en04mhQ4ea7cvKyhKBgYFm++Li4sSsWbNM215eXmL9+vWm7YaGBjFz5kyhVCqFSqUSL7zwgrhy5YrZGDt27BBBQUHCxsZGODs7izlz5txzvEOHDgmlUik2bNjwt+dAQjQ3N4vw8HCRnJxstn/KlCni1VdfFUIIYTQaLVa/q6tLXL16VYwZM0ZcunRJ7Ny5U4SEhIjExEQxYcIEMXfuXIvUbW9vFxMnThTHjx8XXV1dIiIiQoSEhAiVSiXGjx8vCgsLLVI3Pz9fTJkyxew1nTZtmsjPzxdFRUXiwIEDFql7p1u3bokFCxaIxMRE077z58+LqKgoMX78eJGXl2fxORDRwCMJwUXFREQ0uDU1NWHmzJnIzc3Fk08+CaPRCIVCgYSEBHR0dKC4uNii9cX/r44sWLAAsbGxiIiIQHl5OeLi4nDr1i1s2LABL7/8sux1m5qaEBAQgJKSEuzduxd6vR45OTm4fPkyDhw4gC1btuC9994zWzUgh/z8fOTk5ODzzz+HVqvFmjVrkJmZibCwMFy7dg0NDQ1Yt26dRc75TuHh4fD29kZ+fr7pPWhsbERWVhbq6+uRkZGBGTNmWHQORDSwsKEFERENeq6uriguLjZ97677hs8jR46EQmH+q/LGjRuy1+9edmZlZWVaPr1z504YDAZ4enrim2++wbFjx2SvO2LECISFhWH37t04f/48UlNTodFo8MwzzyA5ORlPPfUUKisrYTAYZG3wEB4eDjc3N8ybNw/PP/88MjMzUVpaiq+//hplZWWIjo5GUVERmpubLdJYwmAwoLOzE6NGjUJLS4upiYfRaMRDDz2EzMxMdHV1oaSkRPbaRDSwMVxRryspKTFreX3ng+2j6b8mMjLynv9f7+ceWNR/+Pr6Arj9Abu726MQwqyF/tq1a1FQUGBq5S+X7gAxdepU2NraYtGiRSgvL0dtbS1Wr16NQ4cOQafTob29Xda6kiRh6dKl0Ol0+PLLL81asY8aNQqurq7Q6/VQKBSyfu/I29sbxcXFWLNmDfz9/TF37lzMmjULkiRhxIgR8PDwQGtrK5RKpax1u0OzlZUVrK2tERcXh9LSUuTn50OSJCgUChgMBvj4+GDt2rXYvn076urqZKtPRAMfG1pQr5s5cybGjRv3p8/9sX01UV8rLCzEzZs3//S5YcOG9fJsqDd0t/Xv/lDffeVq1apVWL16NU6cOIEhQ+T99dldy9vbG/Hx8XB1dUVZWRm8vb3h7e0NSZIQGBgIOzs7WesCQHBwMPbu3YvJkyejoKAAPj4+pj90dXZ2YsyYMejq6pL953P3uRUWFqKmpgYdHR2wsbEBcHu5olqtNoUhOZw7dw579uzB/Pnz4e7uDgCYPHky1q1bh9TUVNjb22PhwoWmRjgqlQoPP/wwlEqlbHMgooGP4Yp6nUqlgkql6utpEN2XP97AmgaH7nA1ZMgQeHp6Ijc3Fzk5OaipqUFgYKDF6oaGhqKwsBDBwcHQaDSmeTz33HMWqwncvkfawYMHERMTg4SEBAQEBKCjowO7d+/Gt99+a9E/fE2YMAHp6enIy8uDm5sbTp8+DZ1Oh8OHD8sWbC5cuIDQ0FC0traiubkZaWlpcHZ2BgAkJSWhra0Nr7zyChoaGjBnzhx4eXlh+/bt6OzsZLgioh5hQwsiIqJ76G604OjoiP3795ta41tSdzONvnD27FkUFxfj6NGj8PX1xaJFi+Dv72/xulVVVUhMTIRCocDIkSORl5cHjUYjy9htbW1ITk6G0WhESEgIlixZgvT0dGRkZMDFxQXA7de8uLgYy5cvh5WVFVQqFX777Tfs2bPHdK8xIqL7wXBFRER0DzU1NXjiiSdw+vRp0w2dBwOj0QgAvRryWlpa0NnZCVtbWzg5Ock27s2bN6HT6TB8+HBERUXhs88+Q3R09F0BCwAuXryIxsZG/P777wgICOCVayLqMYYrIiKiv9DW1salYf3cH9/Dbdu2ISYmBkuXLsXy5cvh7OyMrq4uXL582WI3TiaiwYHfuSIiIvoLDFb9X/d7aDAYoFAoEBUVBSEE5s+fD0mSkJKSgtzcXDQ0NOCTTz6Bvb29rF0KiWjw4JUrIiIiGjSEEBBCQKFQYNu2bYiNjYWPjw/q6+tRXV2NoKCgvp4iEfVjDFdEREQ0qHR/9JEkCWFhYTh58iQOHjyIgICAPp4ZEfV3XBZIREREg4okSTAYDMjIyEBVVRVOnjzJYEVEsuibXq9EREREfczPzw/Hjx+Xre07ERGXBRIREdGg1H2TZiIiufDKFREREQ1KDFZEJDeGKyIiIiIiIhkwXBEREREREcmA4YqIiIiIiEgGDFdEREREREQyYLgiIiIiIiKSAcMVERERERGRDBiuiIiIBrHs7GwEBQX19TRMLl68CEmScPLkyb6eChFRjzFcERERDRKSJGHXrl19PQ0iogGL4YqIiIh6pKOjo6+nQET0n8RwRURE1I9MmTIFycnJWLZsGYYNGwY3NzdkZ2f/7XFqtRoAMHv2bEiSZNrutmXLFqjVagwdOhTR0dG4fv26Wc0lS5YgJSUFzs7OiIiIAACcPn0akZGRcHBwgKurK2JjY/HLL7+Yjtu3bx8mTpwIJycnDB8+HNOnT0d9fb1Z3WPHjkGr1cLOzg7BwcE4ceKE2fOtra148cUX4eLiggceeAC+vr7Q6XQ9eMWIiHoPwxUREVE/U1RUBKVSie+//x45OTl4/fXXUVFR8ZfHVFdXAwB0Oh1+/vln0zYA1NfXY9euXSgrK0NZWRkOHTqEN998866aNjY2OHLkCD766CP8+uuvmDp1KrRaLWpqarBv3z40NTVh3rx5pmPa2tqQlpaGmpoaVFZWQqFQYPbs2TAajQCAGzduYPr06Xj00UdRW1uL7OxspKenm9XNzMyEXq/H3r17cebMGXz44Ydwdnb+V68fEZGlDOnrCRAREVHPaDQaZGVlAQB8fX3x/vvvo7KyEk8//fQ9j3FxcQEAODk5wc3Nzew5o9GIzZs3Q6VSAQBiY2NRWVmJNWvWmP6Nr68vcnJyTNurV6+GVqvFG2+8Ydq3adMmeHp64ty5cxgzZgzmzp1rVmfTpk1wcXGBXq+Hv78/Pv30UxiNRmzcuBF2dnbw8/PDpUuXkJSUZDqmsbERWq0WwcHBAHDXFTciov8SXrkiIiLqZzQajdm2u7s7rl69+o/HU6vVpmB1r/Eef/xxs+0ffvgBVVVVcHBwMD0eeeQRADAt/Tt//jxiYmLg4+MDR0dHUzBqbGwEAJw5cwYajQZ2dnamcUNDQ83qJCUlYevWrQgKCsKyZcvw3Xff/ePzJCKyNF65IiIi6mesra3NtiVJMi21s9R4SqXSbPvGjRuYMWMG1q1bd9d47u7uAIAZM2bAy8sLH3/8MTw8PGA0GuHv79+jhhiRkZFoaGhAeXk5KioqEBYWhsWLFyM3N/e+xyAi6i28ckVERDRIWFtbw2AwyDLWY489hrq6OqjVaowePdrsoVQq0dzcjLNnz2LlypUICwvD2LFj0draajbG2LFjcerUKbS3t5v2HT169K5aLi4uiIuLQ3FxMd59910UFBTIcg5ERHJjuCIiIhok1Go1KisrceXKlbuCTk8tXrwYLS0tiImJQXV1Nerr6/HVV18hPj4eBoMBDz74IIYPH46CggJcuHABBw4cQFpamtkY8+fPhyRJSExMhF6vR3l5+V1XpFatWoUvvvgCFy5cQF1dHcrKyjB27Nh/NXciIkthuCIiIhok3n77bVRUVMDT0xNarfZfjeXh4YEjR47AYDAgPDwcAQEBSElJgZOTExQKBRQKBbZu3Yra2lr4+/sjNTUVb731ltkYDg4O2LNnD3788UdotVq89tprdy0ztLGxwYoVK6DRaDBp0iRYWVlh69at/2ruRESWIgkhRF9PgoiIiIiIqL/jlSsiIiIiIiIZMFwRERENACUlJWZt0e98+Pn59fX0iIgGBS4LJCIiGgCuX7+OpqamP33O2toaXl5evTwjIqLBh+GKiIiIiIhIBlwWSEREREREJAOGKyIiIiIiIhkwXBEREREREcmA4YqIiIiIiEgGDFdEREREREQyYLgiIiIiIiKSAcMVERERERGRDBiuiIiIiIiIZPA/gcFmmEnCPq4AAAAASUVORK5CYII=\",\n      \"text/plain\": [\n       \"<Figure size 800x800 with 16 Axes>\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"import matplotlib.pyplot as plt\\n\",\n    \"from skopt.plots import plot_objective\\n\",\n    \"\\n\",\n    \"plot_objective(res)\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"          fun: 0.10675194263458251\\n\",\n       \"            x: [True, True, 6, 2048]\\n\",\n       \"    func_vals: [ 1.373e-01  1.390e-01 ...  1.127e-01  1.138e-01]\\n\",\n       \"      x_iters: [[True, True, 5, 1300], [False, True, 5, 990], [True, True, 7, 1800], [False, False, 10, 1692], [False, True, 6, 1075], [True, False, 3, 291], [False, True, 3, 514], [False, False, 11, 1569], [False, False, 7, 1915], [False, True, 10, 1514], [False, False, 11, 1527], [False, False, 12, 2033], [False, True, 9, 3], [False, True, 1, 2004], [True, True, 12, 1], [False, False, 6, 2048], [False, False, 4, 2048], [False, False, 10, 1], [False, True, 11, 2048], [False, True, 9, 2048], [False, False, 8, 2017], [False, False, 6, 1], [False, True, 4, 1], [False, False, 6, 1587], [False, False, 9, 1056], [True, True, 12, 1450], [False, True, 6, 2048], [False, False, 6, 2048], [False, False, 6, 2048], [False, True, 6, 2048], [False, True, 6, 2048], [False, True, 5, 2048], [False, True, 6, 1464], [False, True, 8, 1], [True, True, 12, 1798], [True, False, 3, 2048], [True, True, 11, 683], [False, True, 11, 1], [True, True, 2, 1], [False, True, 11, 1238], [True, True, 11, 1260], [True, False, 6, 1295], [True, True, 6, 1292], [False, False, 12, 1250], [False, False, 12, 1200], [True, False, 4, 1250], [False, False, 12, 1191], [False, False, 12, 1180], [True, False, 10, 906], [False, False, 12, 1192], [True, True, 10, 2044], [False, False, 6, 1310], [False, False, 8, 1122], [True, False, 5, 4], [False, False, 7, 322], [False, False, 12, 1246], [False, False, 12, 1247], [False, False, 12, 1252], [True, True, 12, 811], [True, False, 6, 2048], [True, True, 12, 998], [False, True, 12, 1021], [False, True, 12, 1021], [False, True, 12, 1019], [True, False, 6, 759], [True, False, 6, 1064], [False, True, 12, 991], [True, True, 9, 533], [False, False, 11, 956], [False, False, 1, 3], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [False, False, 7, 986], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048], [True, True, 6, 2048]]\\n\",\n       \"       models: [GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5) + WhiteKernel(noise_level=1),\\n\",\n       \"                                        n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                        normalize_y=True, random_state=1248744097)]\\n\",\n       \"        space: Space([Categorical(categories=(True, False), prior=None),\\n\",\n       \"                      Categorical(categories=(True, False), prior=None),\\n\",\n       \"                      Integer(low=1, high=12, prior='uniform', transform='normalize'),\\n\",\n       \"                      Integer(low=1, high=2048, prior='uniform', transform='normalize')])\\n\",\n       \" random_state: RandomState(MT19937)\\n\",\n       \"        specs:     args:                    func: <function objective at 0x7f46cd4f8e50>\\n\",\n       \"                                      dimensions: Space([Categorical(categories=(True, False), prior=None),\\n\",\n       \"                                                         Categorical(categories=(True, False), prior=None),\\n\",\n       \"                                                         Integer(low=1, high=12, prior='uniform', transform='normalize'),\\n\",\n       \"                                                         Integer(low=1, high=2048, prior='uniform', transform='normalize')])\\n\",\n       \"                                  base_estimator: GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=[1, 1, 1, 1], nu=2.5),\\n\",\n       \"                                                                           n_restarts_optimizer=2, noise='gaussian',\\n\",\n       \"                                                                           normalize_y=True, random_state=1248744097)\\n\",\n       \"                                         n_calls: 100\\n\",\n       \"                                 n_random_starts: None\\n\",\n       \"                                n_initial_points: 10\\n\",\n       \"                         initial_point_generator: random\\n\",\n       \"                                        acq_func: gp_hedge\\n\",\n       \"                                   acq_optimizer: auto\\n\",\n       \"                                              x0: None\\n\",\n       \"                                              y0: None\\n\",\n       \"                                    random_state: RandomState(MT19937)\\n\",\n       \"                                         verbose: False\\n\",\n       \"                                        callback: None\\n\",\n       \"                                        n_points: 10000\\n\",\n       \"                            n_restarts_optimizer: 5\\n\",\n       \"                                              xi: 0.01\\n\",\n       \"                                           kappa: 1.96\\n\",\n       \"                                          n_jobs: 1\\n\",\n       \"                                model_queue_size: None\\n\",\n       \"               function: base_minimize\"\n      ]\n     },\n     \"execution_count\": 4,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"res\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \".venv\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.10\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/ray/README.md",
    "content": "This is an example of doing LLM inference with [Ray](https://docs.ray.io/en/latest/index.html) and [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).\n\nFirst, install the requirements:\n\n```bash\n$ pip install -r requirements.txt\n```\n\nDeploy a GGUF model to Ray Serve with the following command:\n\n```bash\n$ serve run llm:llm_builder model_path='../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf'\n```\n\nThis will start an API endpoint at `http://localhost:8000/`. You can query the model like this:\n\n```bash\n$ curl -k -d '{\"prompt\": \"tell me a joke\", \"max_tokens\": 128}' -X POST http://localhost:8000\n```\n"
  },
  {
    "path": "examples/ray/llm.py",
    "content": "from starlette.requests import Request\nfrom typing import Dict\nfrom ray import serve\nfrom ray.serve import Application\nfrom llama_cpp import Llama\n\n\n@serve.deployment\nclass LlamaDeployment:\n    def __init__(self, model_path: str):\n        self._llm = Llama(model_path=model_path)\n\n    async def __call__(self, http_request: Request) -> Dict:\n        input_json = await http_request.json()\n        prompt = input_json[\"prompt\"]\n        max_tokens = input_json.get(\"max_tokens\", 64)\n        return self._llm(prompt, max_tokens=max_tokens)\n\n\ndef llm_builder(args: Dict[str, str]) -> Application:\n    return LlamaDeployment.bind(args[\"model_path\"])\n"
  },
  {
    "path": "examples/ray/requirements.txt",
    "content": "ray[serve]\n--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu\nllama-cpp-python\n"
  },
  {
    "path": "llama_cpp/__init__.py",
    "content": "from .llama_cpp import *\nfrom .llama import *\n\n__version__ = \"0.3.16\"\n"
  },
  {
    "path": "llama_cpp/_ctypes_extensions.py",
    "content": "from __future__ import annotations\n\nimport sys\nimport os\nimport ctypes\nimport functools\nimport pathlib\n\nfrom typing import (\n    Any,\n    Callable,\n    List,\n    Union,\n    Optional,\n    TYPE_CHECKING,\n    TypeVar,\n    Generic,\n)\nfrom typing_extensions import TypeAlias\n\n\n# Load the library\ndef load_shared_library(lib_base_name: str, base_path: pathlib.Path):\n    \"\"\"Platform independent shared library loader\"\"\"\n    # Searching for the library in the current directory under the name \"libllama\" (default name\n    # for llamacpp) and \"llama\" (default name for this repo)\n    lib_paths: List[pathlib.Path] = []\n    # Determine the file extension based on the platform\n    if sys.platform.startswith(\"linux\") or sys.platform.startswith(\"freebsd\"):\n        lib_paths += [\n            base_path / f\"lib{lib_base_name}.so\",\n        ]\n    elif sys.platform == \"darwin\":\n        lib_paths += [\n            base_path / f\"lib{lib_base_name}.so\",\n            base_path / f\"lib{lib_base_name}.dylib\",\n        ]\n    elif sys.platform == \"win32\":\n        lib_paths += [\n            base_path / f\"{lib_base_name}.dll\",\n            base_path / f\"lib{lib_base_name}.dll\",\n        ]\n    else:\n        raise RuntimeError(\"Unsupported platform\")\n\n    cdll_args = dict()  # type: ignore\n\n    # Add the library directory to the DLL search path on Windows (if needed)\n    if sys.platform == \"win32\":\n        os.add_dll_directory(str(base_path))\n        os.environ[\"PATH\"] = str(base_path) + os.pathsep + os.environ[\"PATH\"]\n\n    if sys.platform == \"win32\" and sys.version_info >= (3, 8):\n        os.add_dll_directory(str(base_path))\n        if \"CUDA_PATH\" in os.environ:\n            os.add_dll_directory(os.path.join(os.environ[\"CUDA_PATH\"], \"bin\"))\n            os.add_dll_directory(os.path.join(os.environ[\"CUDA_PATH\"], \"lib\"))\n        if \"HIP_PATH\" in os.environ:\n            os.add_dll_directory(os.path.join(os.environ[\"HIP_PATH\"], \"bin\"))\n            os.add_dll_directory(os.path.join(os.environ[\"HIP_PATH\"], \"lib\"))\n        cdll_args[\"winmode\"] = ctypes.RTLD_GLOBAL\n\n    # Try to load the shared library, handling potential errors\n    for lib_path in lib_paths:\n        if lib_path.exists():\n            try:\n                return ctypes.CDLL(str(lib_path), **cdll_args)  # type: ignore\n            except Exception as e:\n                raise RuntimeError(f\"Failed to load shared library '{lib_path}': {e}\")\n\n    raise FileNotFoundError(\n        f\"Shared library with base name '{lib_base_name}' not found\"\n    )\n\n\n# ctypes sane type hint helpers\n#\n# - Generic Pointer and Array types\n# - PointerOrRef type with a type hinted byref function\n#\n# NOTE: Only use these for static type checking not for runtime checks\n# no good will come of that\n\nif TYPE_CHECKING:\n    CtypesCData = TypeVar(\"CtypesCData\", bound=ctypes._CData)  # type: ignore\n\n    CtypesArray: TypeAlias = ctypes.Array[CtypesCData]  # type: ignore\n\n    CtypesPointer: TypeAlias = ctypes._Pointer[CtypesCData]  # type: ignore\n\n    CtypesVoidPointer: TypeAlias = ctypes.c_void_p\n\n    class CtypesRef(Generic[CtypesCData]):\n        pass\n\n    CtypesPointerOrRef: TypeAlias = Union[\n        CtypesPointer[CtypesCData], CtypesRef[CtypesCData]\n    ]\n\n    CtypesFuncPointer: TypeAlias = ctypes._FuncPointer  # type: ignore\n\nF = TypeVar(\"F\", bound=Callable[..., Any])\n\n\ndef ctypes_function_for_shared_library(lib: ctypes.CDLL):\n    \"\"\"Decorator for defining ctypes functions with type hints\"\"\"\n\n    def ctypes_function(\n        name: str, argtypes: List[Any], restype: Any, enabled: bool = True\n    ):\n        def decorator(f: F) -> F:\n            if enabled:\n                func = getattr(lib, name)\n                func.argtypes = argtypes\n                func.restype = restype\n                functools.wraps(f)(func)\n                return func\n            else:\n                return f\n\n        return decorator\n\n    return ctypes_function\n\n\ndef _byref(obj: CtypesCData, offset: Optional[int] = None) -> CtypesRef[CtypesCData]:\n    \"\"\"Type-annotated version of ctypes.byref\"\"\"\n    ...\n\n\nbyref = _byref if TYPE_CHECKING else ctypes.byref\n"
  },
  {
    "path": "llama_cpp/_ggml.py",
    "content": "\"\"\"Internal module use at your own risk\n\nThis module provides a minimal interface for working with ggml tensors from llama-cpp-python\n\"\"\"\nimport os\nimport pathlib\n\nimport llama_cpp._ctypes_extensions as ctypes_ext\n\nlibggml_base_path = pathlib.Path(os.path.abspath(os.path.dirname(__file__))) / \"lib\"\nlibggml = ctypes_ext.load_shared_library(\"ggml\", libggml_base_path)\n\n"
  },
  {
    "path": "llama_cpp/_internals.py",
    "content": "from __future__ import annotations\n\nimport os\nimport ctypes\n\nfrom typing import (\n    Dict,\n    List,\n    Tuple,\n    Optional,\n    Sequence,\n    Callable,\n    Union,\n)\nfrom dataclasses import dataclass, field\nfrom contextlib import ExitStack\n\nimport numpy as np\nimport numpy.typing as npt\n\nfrom .llama_types import *\nfrom .llama_grammar import LlamaGrammar\nfrom ._utils import suppress_stdout_stderr\n\nimport llama_cpp.llama_cpp as llama_cpp\n\n\n# Python wrappers over llama.h structs\n\n\nclass LlamaModel:\n    \"\"\"Intermediate Python wrapper for a llama.cpp llama_model.\n    NOTE: For stability it's recommended you use the Llama class instead.\"\"\"\n\n    def __init__(\n        self,\n        *,\n        path_model: str,\n        params: llama_cpp.llama_model_params,\n        verbose: bool = True,\n    ):\n        self.path_model = path_model\n        self.params = params\n        self.verbose = verbose\n        self._exit_stack = ExitStack()\n\n        model = None\n\n        if not os.path.exists(path_model):\n            raise ValueError(f\"Model path does not exist: {path_model}\")\n\n        with suppress_stdout_stderr(disable=verbose):\n            model = llama_cpp.llama_model_load_from_file(\n                self.path_model.encode(\"utf-8\"), self.params\n            )\n\n        if model is None:\n            raise ValueError(f\"Failed to load model from file: {path_model}\")\n\n        vocab = llama_cpp.llama_model_get_vocab(model)\n\n        if vocab is None:\n            raise ValueError(f\"Failed to get vocab from model: {path_model}\")\n\n        self.model = model\n        self.vocab = vocab\n        self.sampler = None  # LlamaModel doesn't use samplers, but some cleanup code expects this attribute\n\n        def free_model():\n            if self.model is None:\n                return\n            llama_cpp.llama_model_free(self.model)\n            self.model = None\n\n        self._exit_stack.callback(free_model)\n\n    def close(self):\n        if self.sampler is not None:\n            # NOTE: Must remove custom samplers before free or llama.cpp will try to free them\n            for i, _ in reversed(self.custom_samplers):\n                llama_cpp.llama_sampler_chain_remove(self.sampler, i)\n            self.custom_samplers.clear()\n        self._exit_stack.close()\n\n    def __del__(self):\n        self.close()\n\n    def vocab_type(self) -> int:\n        return llama_cpp.llama_vocab_type(self.vocab)\n\n    def n_vocab(self) -> int:\n        return llama_cpp.llama_vocab_n_tokens(self.vocab)\n\n    def n_ctx_train(self) -> int:\n        return llama_cpp.llama_model_n_ctx_train(self.model)\n\n    def n_embd(self) -> int:\n        return llama_cpp.llama_model_n_embd(self.model)\n\n    def rope_freq_scale_train(self) -> float:\n        return llama_cpp.llama_model_rope_freq_scale_train(self.model)\n\n    def desc(self) -> str:\n        buf = ctypes.create_string_buffer(1024)\n        llama_cpp.llama_model_desc(self.model, buf, 1024)\n        return buf.value.decode(\"utf-8\")\n\n    def size(self) -> int:\n        return llama_cpp.llama_model_size(self.model)\n\n    def n_params(self) -> int:\n        return llama_cpp.llama_model_n_params(self.model)\n\n    def get_tensor(self, name: str) -> ctypes.c_void_p:\n        raise NotImplementedError(\"get_tensor is not implemented in llama.cpp\")\n\n    # Vocab\n\n    def token_get_text(self, token: int) -> str:\n        return llama_cpp.llama_vocab_get_text(self.vocab, token).decode(\"utf-8\")\n\n    def token_get_score(self, token: int) -> float:\n        return llama_cpp.llama_vocab_get_score(self.vocab, token)\n\n    def token_get_attr(self, token: int) -> int:\n        return llama_cpp.llama_vocab_get_attr(self.vocab, token)\n\n    # Special tokens\n\n    def token_bos(self) -> int:\n        return llama_cpp.llama_vocab_bos(self.vocab)\n\n    def token_eos(self) -> int:\n        return llama_cpp.llama_vocab_eos(self.vocab)\n\n    def token_cls(self) -> int:\n        return llama_cpp.llama_vocab_cls(self.vocab)\n\n    def token_sep(self) -> int:\n        return llama_cpp.llama_vocab_sep(self.vocab)\n\n    def token_nl(self) -> int:\n        return llama_cpp.llama_vocab_nl(self.vocab)\n\n    def token_prefix(self) -> int:\n        return llama_cpp.llama_vocab_fim_pre(self.vocab)\n\n    def token_middle(self) -> int:\n        return llama_cpp.llama_vocab_fim_mid(self.vocab)\n\n    def token_suffix(self) -> int:\n        return llama_cpp.llama_vocab_fim_suf(self.vocab)\n\n    def token_eot(self) -> int:\n        return llama_cpp.llama_vocab_eot(self.vocab)\n\n    def add_bos_token(self) -> bool:\n        return llama_cpp.llama_vocab_get_add_bos(self.vocab)\n\n    def add_eos_token(self) -> bool:\n        return llama_cpp.llama_vocab_get_add_eos(self.vocab)\n\n    # Tokenization\n\n    def tokenize(self, text: bytes, add_bos: bool, special: bool):\n        n_ctx = self.n_ctx_train()\n        tokens = (llama_cpp.llama_token * n_ctx)()\n        n_tokens = llama_cpp.llama_tokenize(\n            self.vocab, text, len(text), tokens, n_ctx, add_bos, special\n        )\n        if n_tokens < 0:\n            n_tokens = abs(n_tokens)\n            tokens = (llama_cpp.llama_token * n_tokens)()\n            n_tokens = llama_cpp.llama_tokenize(\n                self.vocab, text, len(text), tokens, n_tokens, add_bos, special\n            )\n            if n_tokens < 0:\n                raise RuntimeError(\n                    f'Failed to tokenize: text=\"{text}\" n_tokens={n_tokens}'\n                )\n        return list(tokens[:n_tokens])\n\n    def token_to_piece(self, token: int, special: bool = False) -> bytes:\n        buf = ctypes.create_string_buffer(32)\n        llama_cpp.llama_token_to_piece(self.vocab, token, buf, 32, 0, special)\n        return bytes(buf)\n\n    def detokenize(self, tokens: List[int], special: bool = False) -> bytes:\n        output = b\"\"\n        size = 32\n        buffer = (ctypes.c_char * size)()\n        for token in tokens:\n            n = llama_cpp.llama_token_to_piece(\n                self.vocab, llama_cpp.llama_token(token), buffer, size, 0, special\n            )\n            assert n <= size\n            output += bytes(buffer[:n])\n        # NOTE: Llama1 models automatically added a space at the start of the prompt\n        # this line removes a leading space if the first token is a beginning of sentence token\n        return (\n            output[1:]\n            if len(tokens) > 0 and tokens[0] == self.token_bos() and output[0:1] == b\" \"\n            else output\n        )\n\n    # Extra\n    def metadata(self) -> Dict[str, str]:\n        metadata: Dict[str, str] = {}\n        buffer_size = 1024\n        buffer = ctypes.create_string_buffer(buffer_size)\n        # zero the buffer\n        buffer.value = b\"\\0\" * buffer_size\n        # iterate over model keys\n        for i in range(llama_cpp.llama_model_meta_count(self.model)):\n            nbytes = llama_cpp.llama_model_meta_key_by_index(\n                self.model, i, buffer, buffer_size\n            )\n            if nbytes > buffer_size:\n                buffer_size = nbytes + 1\n                buffer = ctypes.create_string_buffer(buffer_size)\n                nbytes = llama_cpp.llama_model_meta_key_by_index(\n                    self.model, i, buffer, buffer_size\n                )\n            key = buffer.value.decode(\"utf-8\")\n            nbytes = llama_cpp.llama_model_meta_val_str_by_index(\n                self.model, i, buffer, buffer_size\n            )\n            if nbytes > buffer_size:\n                buffer_size = nbytes + 1\n                buffer = ctypes.create_string_buffer(buffer_size)\n                nbytes = llama_cpp.llama_model_meta_val_str_by_index(\n                    self.model, i, buffer, buffer_size\n                )\n            value = buffer.value.decode(\"utf-8\")\n            metadata[key] = value\n        return metadata\n\n    @staticmethod\n    def default_params():\n        \"\"\"Get the default llama_model_params.\"\"\"\n        return llama_cpp.llama_model_default_params()\n\n\nclass LlamaContext:\n    \"\"\"Intermediate Python wrapper for a llama.cpp llama_context.\n    NOTE: For stability it's recommended you use the Llama class instead.\"\"\"\n\n    def __init__(\n        self,\n        *,\n        model: LlamaModel,\n        params: llama_cpp.llama_context_params,\n        verbose: bool = True,\n    ):\n        self.model = model\n        self.params = params\n        self.verbose = verbose\n        self._exit_stack = ExitStack()\n\n        ctx = llama_cpp.llama_init_from_model(self.model.model, self.params)\n\n        if ctx is None:\n            raise ValueError(\"Failed to create llama_context\")\n\n        self.ctx = ctx\n        self.memory = llama_cpp.llama_get_memory(self.ctx)\n        self.sampler = None  # LlamaContext doesn't manage samplers directly, but some cleanup code expects this attribute\n\n        def free_ctx():\n            if self.ctx is None:\n                return\n            llama_cpp.llama_free(self.ctx)\n            self.ctx = None\n\n        self._exit_stack.callback(free_ctx)\n\n    def close(self):\n        self._exit_stack.close()\n\n    def __del__(self):\n        self.close()\n\n    def n_ctx(self) -> int:\n        return llama_cpp.llama_n_ctx(self.ctx)\n\n    def pooling_type(self) -> int:\n        return llama_cpp.llama_pooling_type(self.ctx)\n\n    def kv_cache_clear(self):\n        assert self.memory is not None, \"Memory is not initialized\"\n        llama_cpp.llama_memory_clear(self.memory, True)\n\n    def kv_cache_seq_rm(self, seq_id: int, p0: int, p1: int):\n        assert self.memory is not None, \"Memory is not initialized\"\n        seq_id = seq_id if seq_id >= 0 else 0\n        llama_cpp.llama_memory_seq_rm(self.memory, seq_id, p0, p1)\n\n    def kv_cache_seq_cp(self, seq_id_src: int, seq_id_dst: int, p0: int, p1: int):\n        assert self.memory is not None, \"Memory is not initialized\"\n        llama_cpp.llama_memory_seq_cp(self.memory, seq_id_src, seq_id_dst, p0, p1)\n\n    def kv_cache_seq_keep(self, seq_id: int):\n        assert self.memory is not None, \"Memory is not initialized\"\n        llama_cpp.llama_memory_seq_keep(self.memory, seq_id)\n\n    def kv_cache_seq_shift(self, seq_id: int, p0: int, p1: int, shift: int):\n        assert self.memory is not None, \"Memory is not initialized\"\n        llama_cpp.llama_memory_seq_add(self.memory, seq_id, p0, p1, shift)\n\n    def get_state_size(self) -> int:\n        return llama_cpp.llama_state_get_size(self.ctx)\n\n    # TODO: copy_state_data\n\n    # TODO: set_state_data\n\n    # TODO: llama_load_session_file\n\n    # TODO: llama_save_session_file\n\n    def decode(self, batch: LlamaBatch):\n        return_code = llama_cpp.llama_decode(\n            self.ctx,\n            batch.batch,\n        )\n        if return_code != 0:\n            raise RuntimeError(f\"llama_decode returned {return_code}\")\n\n    def encode(self, batch: LlamaBatch):\n        return_code = llama_cpp.llama_encode(\n            self.ctx,\n            batch.batch,\n        )\n        if return_code != 0:\n            raise RuntimeError(f\"llama_encode returned {return_code}\")\n\n    def set_n_threads(self, n_threads: int, n_threads_batch: int):\n        llama_cpp.llama_set_n_threads(self.ctx, n_threads, n_threads_batch)\n\n    def get_logits(self):\n        return llama_cpp.llama_get_logits(self.ctx)\n\n    def get_logits_ith(self, i: int):\n        return llama_cpp.llama_get_logits_ith(self.ctx, i)\n\n    def get_embeddings(self):\n        return llama_cpp.llama_get_embeddings(self.ctx)\n\n    def get_embeddings_ith(self, i: int):\n        return llama_cpp.llama_get_embeddings_ith(self.ctx, i)\n\n    def get_embeddings_seq(self, seq_id: int):\n        return llama_cpp.llama_get_embeddings_seq(self.ctx, seq_id)\n\n    # Sampling functions - deprecated, use LlamaSampler instead\n\n    def set_rng_seed(self, seed: int):\n        raise NotImplementedError(\"set_rng_seed is deprecated, use LlamaSampler instead\")\n\n    def sample_repetition_penalties(\n        self,\n        candidates: \"_LlamaTokenDataArray\",\n        last_tokens_data: \"llama_cpp.Array[llama_cpp.llama_token]\",\n        penalty_last_n: int,\n        penalty_repeat: float,\n        penalty_freq: float,\n        penalty_present: float,\n    ):\n        raise NotImplementedError(\"sample_repetition_penalties is deprecated, use LlamaSampler instead\")\n\n    def sample_softmax(self, candidates: \"_LlamaTokenDataArray\"):\n        raise NotImplementedError(\"sample_softmax is deprecated, use LlamaSampler instead\")\n\n    def sample_top_k(self, candidates: \"_LlamaTokenDataArray\", k: int, min_keep: int):\n        raise NotImplementedError(\"sample_top_k is deprecated, use LlamaSampler instead\")\n\n    def sample_top_p(self, candidates: \"_LlamaTokenDataArray\", p: float, min_keep: int):\n        raise NotImplementedError(\"sample_top_p is deprecated, use LlamaSampler instead\")\n\n    def sample_min_p(self, candidates: \"_LlamaTokenDataArray\", p: float, min_keep: int):\n        raise NotImplementedError(\"sample_min_p is deprecated, use LlamaSampler instead\")\n\n    def sample_typical(\n        self, candidates: \"_LlamaTokenDataArray\", p: float, min_keep: int\n    ):\n        raise NotImplementedError(\"sample_typical is deprecated, use LlamaSampler instead\")\n\n    def sample_temp(self, candidates: \"_LlamaTokenDataArray\", temp: float):\n        raise NotImplementedError(\"sample_temp is deprecated, use LlamaSampler instead\")\n\n    def sample_grammar(self, candidates: \"_LlamaTokenDataArray\", grammar: LlamaGrammar):\n        raise NotImplementedError(\"sample_grammar is deprecated, use LlamaSampler instead\")\n\n    def sample_token_mirostat(\n        self,\n        candidates: \"_LlamaTokenDataArray\",\n        tau: float,\n        eta: float,\n        m: int,\n        mu: llama_cpp.CtypesPointerOrRef[ctypes.c_float],\n    ) -> int:\n        raise NotImplementedError(\"sample_token_mirostat is deprecated, use LlamaSampler instead\")\n\n    def sample_token_mirostat_v2(\n        self,\n        candidates: \"_LlamaTokenDataArray\",\n        tau: float,\n        eta: float,\n        mu: llama_cpp.CtypesPointerOrRef[ctypes.c_float],\n    ) -> int:\n        raise NotImplementedError(\"sample_token_mirostat_v2 is deprecated, use LlamaSampler instead\")\n\n    def sample_token_greedy(self, candidates: \"_LlamaTokenDataArray\") -> int:\n        raise NotImplementedError(\"sample_token_greedy is deprecated, use LlamaSampler instead\")\n\n    def sample_token(self, candidates: \"_LlamaTokenDataArray\") -> int:\n        raise NotImplementedError(\"sample_token is deprecated, use LlamaSampler instead\")\n\n    # Grammar\n    def grammar_accept_token(self, grammar: LlamaGrammar, token: int):\n        raise NotImplementedError(\"grammar_accept_token is deprecated, use LlamaSampler instead\")\n\n    def reset_timings(self):\n        llama_cpp.llama_perf_context_reset(self.ctx)\n\n    def print_timings(self):\n        llama_cpp.llama_perf_context_print(self.ctx)\n\n    # Utility functions\n    @staticmethod\n    def default_params():\n        \"\"\"Get the default llama_context_params.\"\"\"\n        return llama_cpp.llama_context_default_params()\n\n\nclass LlamaBatch:\n    def __init__(\n        self, *, n_tokens: int, embd: int, n_seq_max: int, verbose: bool = True\n    ):\n        self._n_tokens = n_tokens\n        self.embd = embd\n        self.n_seq_max = n_seq_max\n        self.verbose = verbose\n        self._exit_stack = ExitStack()\n\n        batch = llama_cpp.llama_batch_init(self._n_tokens, self.embd, self.n_seq_max)\n\n        if batch is None:\n            raise ValueError(\"Failed to create llama_batch\")\n\n        self.batch = batch\n        self.sampler = None  # LlamaBatch doesn't use samplers, but some cleanup code expects this attribute\n\n        def free_batch():\n            if self.batch is None:\n                return\n            llama_cpp.llama_batch_free(self.batch)\n            self.batch = None\n\n        self._exit_stack.callback(free_batch)\n\n    def close(self):\n        self._exit_stack.close()\n\n    def __del__(self):\n        self.close()\n\n    def n_tokens(self) -> int:\n        return self.batch.n_tokens\n\n    def reset(self):\n        self.batch.n_tokens = 0\n\n    def set_batch(self, batch: Sequence[int], n_past: int, logits_all: bool):\n        n_tokens = len(batch)\n        self.batch.n_tokens = n_tokens\n        for i in range(n_tokens):\n            self.batch.token[i] = batch[i]\n            self.batch.pos[i] = n_past + i\n            self.batch.seq_id[i][0] = 0\n            self.batch.n_seq_id[i] = 1\n            self.batch.logits[i] = logits_all\n        self.batch.logits[n_tokens - 1] = True\n\n    def add_sequence(self, batch: Sequence[int], seq_id: int, logits_all: bool):\n        n_tokens = len(batch)\n        n_tokens0 = self.batch.n_tokens\n        self.batch.n_tokens += n_tokens\n        for i in range(n_tokens):\n            j = n_tokens0 + i\n            self.batch.token[j] = batch[i]\n            self.batch.pos[j] = i\n            self.batch.seq_id[j][0] = seq_id\n            self.batch.n_seq_id[j] = 1\n            self.batch.logits[j] = logits_all\n        self.batch.logits[n_tokens - 1] = True\n\n\nclass LlamaTokenDataArray:\n    def __init__(self, *, n_vocab: int):\n        self.n_vocab = n_vocab\n        self.candidates_data = np.recarray(\n            (self.n_vocab,),\n            dtype=np.dtype(\n                [(\"id\", np.intc), (\"logit\", np.single), (\"p\", np.single)], align=True\n            ),\n        )\n        self.candidates = llama_cpp.llama_token_data_array(\n            data=self.candidates_data.ctypes.data_as(llama_cpp.llama_token_data_p),\n            size=self.n_vocab,\n            sorted=False,\n        )\n        self.default_candidates_data_id = np.arange(self.n_vocab, dtype=np.intc)  # type: ignore\n        self.default_candidates_data_p = np.zeros(self.n_vocab, dtype=np.single)\n        self.sampler = None  # LlamaTokenDataArray doesn't use samplers, but some cleanup code expects this attribute\n\n    def copy_logits(self, logits: npt.NDArray[np.single]):\n        self.candidates_data.id[:] = self.default_candidates_data_id\n        self.candidates_data.logit[:] = logits\n        self.candidates_data.p[:] = self.default_candidates_data_p\n        self.candidates.sorted = False\n        self.candidates.size = self.n_vocab\n\n\n# Embedding functions\n\n\ndef normalize_embedding(embedding):\n    norm = float(np.linalg.norm(embedding))\n    if norm == 0.0:\n        return embedding\n    return [v / norm for v in embedding]\n\n\n# Python wrappers over common/sampling structs\n\n\n@dataclass\nclass LlamaSamplingParams:\n    n_prev: int = 64\n    n_probs: int = 0\n    top_k: int = 40\n    top_p: float = 0.95\n    min_p: float = 0.05\n    tfs_z: float = 1.00\n    typical_p: float = 1.00\n    temp: float = 0.80\n    penalty_last_n: int = 64\n    penalty_repeat: float = 1.0\n    penalty_freq: float = 0.00\n    penalty_present: float = 0.00\n    mirostat: int = 0\n    mirostat_tau: float = 5.00\n    mirostat_eta: float = 0.10\n    penalize_nl: bool = True\n\n    grammar: str = \"\"\n\n    cfg_negative_prompt: str = \"\"\n    cfg_scale: float = 1.00\n\n    logit_bias: dict[int, float] = field(default_factory=dict)\n\n\n@dataclass\nclass LlamaSamplingContext:\n    params: LlamaSamplingParams = field(default_factory=LlamaSamplingParams)\n    mirostat_mu: ctypes.c_float = field(default_factory=ctypes.c_float)\n    grammar: Optional[LlamaGrammar] = None\n    # NOTE: Missing parsed_grammar\n    prev: list[int] = field(default_factory=list)\n    cur: list[llama_cpp.llama_token_data] = field(default_factory=list)\n\n    def reset(self):\n        self.prev = []\n        self.cur = []\n        if self.grammar is not None:\n            self.grammar.reset()\n\n    def cp(self):\n        return LlamaSamplingContext(\n            params=self.params,\n            mirostat_mu=self.mirostat_mu,\n            grammar=self.grammar,\n            prev=self.prev.copy(),\n            cur=self.cur.copy(),\n        )\n\n    def last(self) -> Optional[int]:\n        if len(self.prev) > 0:\n            return self.prev[-1]\n        else:\n            return None\n\n    def prev_str(self, ctx_main: LlamaContext, n: int) -> str:\n        return ctx_main.model.detokenize(self.prev[-n:]).decode(\"utf-8\")\n\n    def sample(\n        self,\n        ctx_main: LlamaContext,\n        idx: int = 0,\n        logits_array: Optional[npt.NDArray[np.single]] = None,\n    ):\n        # This method is deprecated in favor of using LlamaSampler directly\n        raise NotImplementedError(\"LlamaSamplingContext.sample is deprecated, use LlamaSampler instead\")\n\n    def accept(self, ctx_main: LlamaContext, id: int, apply_grammar: bool):\n        self.prev.append(id)\n\n\nclass CustomSampler:\n    def __init__(\n        self, apply_func: Callable[[llama_cpp.llama_token_data_array], None]\n    ):\n        self.apply_func = apply_func\n\n        def apply_wrapper(\n            sampler: llama_cpp.llama_sampler_p,\n            cur_p: llama_cpp.llama_token_data_array_p,\n        ):\n            self.apply_func(cur_p)\n\n        def free_wrapper(sampler: llama_cpp.llama_sampler_p):\n            pass\n\n        sampler_i = llama_cpp.llama_sampler_i()\n        sampler_i.apply = llama_cpp.llama_sampler_i_apply(apply_wrapper)\n        self._apply_wrapper_ref = apply_wrapper\n\n        sampler_i.name = llama_cpp.llama_sampler_i_name(0)\n        sampler_i.accept = llama_cpp.llama_sampler_i_accept(0)\n        sampler_i.reset = llama_cpp.llama_sampler_i_reset(0)\n        sampler_i.clone = llama_cpp.llama_sampler_i_clone(0)\n        sampler_i.free = llama_cpp.llama_sampler_i_free(0)\n\n        self.sampler = llama_cpp.llama_sampler()\n        self.sampler.iface = ctypes.pointer(sampler_i)\n        self.sampler.ctx = None\n\n    def get_sampler(self) -> llama_cpp.llama_sampler_p:\n        return ctypes.pointer(self.sampler)\n\n\nclass LlamaSampler:\n    def __init__(self):\n        params = llama_cpp.llama_sampler_chain_default_params()\n        self.sampler = llama_cpp.llama_sampler_chain_init(params)\n        self.custom_samplers: List[Tuple[int, CustomSampler]] = []\n        self._exit_stack = ExitStack()\n\n        def free_sampler():\n            if self.sampler is not None:\n                # NOTE: Must remove custom samplers before free or llama.cpp will try to free them\n                for i, _ in reversed(self.custom_samplers):\n                    llama_cpp.llama_sampler_chain_remove(self.sampler, i)\n                llama_cpp.llama_sampler_free(self.sampler)\n                self.sampler = None\n\n        self._exit_stack.callback(free_sampler)\n\n    def close(self):\n        self._exit_stack.close()\n\n    def __del__(self):\n        self.close()\n\n    def add_greedy(self):\n        sampler = llama_cpp.llama_sampler_init_greedy()\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_dist(self, seed: int):\n        sampler = llama_cpp.llama_sampler_init_dist(seed)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_softmax(self):\n        sampler = llama_cpp.llama_sampler_init_softmax()\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_top_k(self, k: int):\n        sampler = llama_cpp.llama_sampler_init_top_k(k)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_top_p(self, p: float, min_keep: int = 1):\n        sampler = llama_cpp.llama_sampler_init_top_p(p, min_keep)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_min_p(self, p: float, min_keep: int = 1):\n        sampler = llama_cpp.llama_sampler_init_min_p(p, min_keep)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_typical(self, p: float, min_keep: int = 1):\n        sampler = llama_cpp.llama_sampler_init_typical(p, min_keep)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_temp(self, temp: float):\n        sampler = llama_cpp.llama_sampler_init_temp(temp)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_temp_ext(self, t: float, delta: float, exponent: float):\n        sampler = llama_cpp.llama_sampler_init_temp_ext(t, delta, exponent)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_xtc(self, p: float, t: float, min_keep: int, seed: int):\n        sampler = llama_cpp.llama_sampler_init_xtc(p, t, min_keep, seed)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_top_n_sigma(self, n: float):\n        sampler = llama_cpp.llama_sampler_init_top_n_sigma(n)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_mirostat(self, n_vocab: int, seed: int, tau: float, eta: float, m: int):\n        sampler = llama_cpp.llama_sampler_init_mirostat(n_vocab, seed, tau, eta, m)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_mirostat_v2(self, seed: int, tau: float, eta: float):\n        sampler = llama_cpp.llama_sampler_init_mirostat_v2(seed, tau, eta)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_grammar(self, model: LlamaModel, grammar: LlamaGrammar):\n        sampler = llama_cpp.llama_sampler_init_grammar(\n            model.vocab, grammar._grammar.encode(\"utf-8\"), grammar._root.encode(\"utf-8\")\n        )\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_grammar_lazy_patterns(\n        self, \n        model: LlamaModel, \n        grammar: LlamaGrammar,\n        trigger_patterns: List[str],\n        trigger_tokens: List[int]\n    ):\n        # Convert patterns to C array\n        pattern_ptrs = (ctypes.c_char_p * len(trigger_patterns))()\n        for i, pattern in enumerate(trigger_patterns):\n            pattern_ptrs[i] = pattern.encode(\"utf-8\")\n        \n        # Convert tokens to C array\n        token_array = (llama_cpp.llama_token * len(trigger_tokens))(*trigger_tokens)\n        \n        sampler = llama_cpp.llama_sampler_init_grammar_lazy_patterns(\n            model.vocab,\n            grammar._grammar.encode(\"utf-8\"),\n            grammar._root.encode(\"utf-8\"),\n            pattern_ptrs,\n            len(trigger_patterns),\n            token_array,\n            len(trigger_tokens)\n        )\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_penalties(\n        self,\n        penalty_last_n: int,\n        penalty_repeat: float,\n        penalty_freq: float,\n        penalty_present: float,\n    ):\n        sampler = llama_cpp.llama_sampler_init_penalties(\n            penalty_last_n,\n            penalty_repeat,\n            penalty_freq,\n            penalty_present,\n        )\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_dry(\n        self,\n        model: LlamaModel,\n        n_ctx_train: int,\n        dry_multiplier: float,\n        dry_base: float,\n        dry_allowed_length: int,\n        dry_penalty_last_n: int,\n        seq_breakers: List[str]\n    ):\n        # Convert seq_breakers to C array\n        breaker_ptrs = (ctypes.c_char_p * len(seq_breakers))()\n        for i, breaker in enumerate(seq_breakers):\n            breaker_ptrs[i] = breaker.encode(\"utf-8\")\n        \n        sampler = llama_cpp.llama_sampler_init_dry(\n            model.vocab,\n            n_ctx_train,\n            dry_multiplier,\n            dry_base,\n            dry_allowed_length,\n            dry_penalty_last_n,\n            breaker_ptrs,\n            len(seq_breakers)\n        )\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_logit_bias(\n        self, \n        n_vocab: int, \n        logit_bias: Dict[int, float]\n    ):\n        # Convert logit_bias dict to C array\n        bias_array = (llama_cpp.llama_logit_bias * len(logit_bias))()\n        for i, (token, bias) in enumerate(logit_bias.items()):\n            bias_array[i].token = token\n            bias_array[i].bias = bias\n        \n        sampler = llama_cpp.llama_sampler_init_logit_bias(\n            n_vocab,\n            len(logit_bias),\n            bias_array\n        )\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_infill(self, model: LlamaModel):\n        sampler = llama_cpp.llama_sampler_init_infill(model.vocab)\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n\n    def add_custom(\n        self, apply_func: Callable[[llama_cpp.llama_token_data_array], None]\n    ):\n        custom_sampler = CustomSampler(apply_func)\n        sampler = custom_sampler.get_sampler()\n        llama_cpp.llama_sampler_chain_add(self.sampler, sampler)\n        # NOTE: Must remove custom samplers before free or llama.cpp will try to free them\n        self.custom_samplers.append(\n            (llama_cpp.llama_sampler_chain_n(self.sampler) - 1, custom_sampler)\n        )\n\n    def get_seed(self) -> int:\n        return llama_cpp.llama_sampler_get_seed(self.sampler)\n\n    def sample(self, ctx: LlamaContext, idx: int = -1) -> int:\n        return llama_cpp.llama_sampler_sample(self.sampler, ctx.ctx, idx)\n\n    def accept(self, token: int):\n        llama_cpp.llama_sampler_accept(self.sampler, token)\n\n    def reset(self):\n        llama_cpp.llama_sampler_reset(self.sampler)\n\n    def clone(self):\n        # NOTE: Custom samplers cannot be cloned due to Python callback limitations\n        if self.custom_samplers:\n            raise NotImplementedError(\"Cannot clone LlamaSampler that contains custom samplers\")\n        \n        cloned_sampler = llama_cpp.llama_sampler_clone(self.sampler)\n        # Create a new wrapper around the cloned sampler\n        new_sampler = LlamaSampler.__new__(LlamaSampler)\n        new_sampler.sampler = cloned_sampler\n        new_sampler.custom_samplers = []\n        new_sampler._exit_stack = ExitStack()\n        \n        def free_sampler():\n            if new_sampler.sampler is not None:\n                llama_cpp.llama_sampler_free(new_sampler.sampler)\n                new_sampler.sampler = None\n\n        new_sampler._exit_stack.callback(free_sampler)\n        return new_sampler\n"
  },
  {
    "path": "llama_cpp/_logger.py",
    "content": "import sys\nimport ctypes\nimport logging\n\nimport llama_cpp\n\n# enum ggml_log_level {\n#     GGML_LOG_LEVEL_NONE  = 0,\n#     GGML_LOG_LEVEL_INFO  = 1,\n#     GGML_LOG_LEVEL_WARN  = 2,\n#     GGML_LOG_LEVEL_ERROR = 3,\n#     GGML_LOG_LEVEL_DEBUG = 4,\n#     GGML_LOG_LEVEL_CONT  = 5, // continue previous log\n# };\nGGML_LOG_LEVEL_TO_LOGGING_LEVEL = {\n    0: logging.CRITICAL,\n    1: logging.INFO,\n    2: logging.WARNING,\n    3: logging.ERROR,\n    4: logging.DEBUG,\n    5: logging.DEBUG,\n}\n\nlogger = logging.getLogger(\"llama-cpp-python\")\n\n_last_log_level = GGML_LOG_LEVEL_TO_LOGGING_LEVEL[0]\n\n# typedef void (*ggml_log_callback)(enum ggml_log_level level, const char * text, void * user_data);\n@llama_cpp.llama_log_callback\ndef llama_log_callback(\n    level: int,\n    text: bytes,\n    user_data: ctypes.c_void_p,\n):\n    # TODO: Correctly implement continue previous log\n    global _last_log_level\n    log_level = GGML_LOG_LEVEL_TO_LOGGING_LEVEL[level] if level != 5 else _last_log_level\n    if logger.level <= GGML_LOG_LEVEL_TO_LOGGING_LEVEL[level]:\n        print(text.decode(\"utf-8\"), end=\"\", flush=True, file=sys.stderr)\n    _last_log_level = log_level\n\n\nllama_cpp.llama_log_set(llama_log_callback, ctypes.c_void_p(0))\n\n\ndef set_verbose(verbose: bool):\n    logger.setLevel(logging.DEBUG if verbose else logging.ERROR)\n"
  },
  {
    "path": "llama_cpp/_utils.py",
    "content": "import os\nimport sys\n\nfrom typing import Any, Dict\n\n# Avoid \"LookupError: unknown encoding: ascii\" when open() called in a destructor\noutnull_file = open(os.devnull, \"w\")\nerrnull_file = open(os.devnull, \"w\")\n\nSTDOUT_FILENO = 1\nSTDERR_FILENO = 2\n\n\nclass suppress_stdout_stderr(object):\n    # NOTE: these must be \"saved\" here to avoid exceptions when using\n    #       this context manager inside of a __del__ method\n    sys = sys\n    os = os\n\n    def __init__(self, disable: bool = True):\n        self.disable = disable\n\n    # Oddly enough this works better than the contextlib version\n    def __enter__(self):\n        if self.disable:\n            return self\n\n        self.old_stdout_fileno_undup = STDOUT_FILENO\n        self.old_stderr_fileno_undup = STDERR_FILENO\n\n        self.old_stdout_fileno = self.os.dup(self.old_stdout_fileno_undup)\n        self.old_stderr_fileno = self.os.dup(self.old_stderr_fileno_undup)\n\n        self.old_stdout = self.sys.stdout\n        self.old_stderr = self.sys.stderr\n\n        self.os.dup2(outnull_file.fileno(), self.old_stdout_fileno_undup)\n        self.os.dup2(errnull_file.fileno(), self.old_stderr_fileno_undup)\n\n        self.sys.stdout = outnull_file\n        self.sys.stderr = errnull_file\n        return self\n\n    def __exit__(self, *_):\n        if self.disable:\n            return\n\n        # Check if sys.stdout and sys.stderr have fileno method\n        self.sys.stdout = self.old_stdout\n        self.sys.stderr = self.old_stderr\n\n        self.os.dup2(self.old_stdout_fileno, self.old_stdout_fileno_undup)\n        self.os.dup2(self.old_stderr_fileno, self.old_stderr_fileno_undup)\n\n        self.os.close(self.old_stdout_fileno)\n        self.os.close(self.old_stderr_fileno)\n\n\nclass MetaSingleton(type):\n    \"\"\"\n    Metaclass for implementing the Singleton pattern.\n    \"\"\"\n\n    _instances: Dict[type, Any] = {}\n\n    def __call__(cls, *args: Any, **kwargs: Any) -> Any:\n        if cls not in cls._instances:\n            cls._instances[cls] = super(MetaSingleton, cls).__call__(*args, **kwargs)\n        return cls._instances[cls]\n\n\nclass Singleton(object, metaclass=MetaSingleton):\n    \"\"\"\n    Base class for implementing the Singleton pattern.\n    \"\"\"\n\n    def __init__(self):\n        super(Singleton, self).__init__()\n"
  },
  {
    "path": "llama_cpp/llama.py",
    "content": "from __future__ import annotations\n\nimport os\nimport sys\nimport uuid\nimport time\nimport json\nimport ctypes\nimport typing\nimport random\nimport fnmatch\nimport warnings\nimport contextlib\nimport multiprocessing\n\nfrom typing import (\n    Any,\n    List,\n    Literal,\n    Optional,\n    Union,\n    Generator,\n    Sequence,\n    Iterator,\n    Deque,\n    Callable,\n    Dict,\n)\nfrom collections import deque\nfrom pathlib import Path\n\n\nfrom .llama_types import *\nfrom .llama_grammar import LlamaGrammar\nfrom .llama_cache import (\n    BaseLlamaCache,\n    LlamaCache,  # type: ignore\n    LlamaDiskCache,  # type: ignore\n    LlamaRAMCache,  # type: ignore\n)\nfrom .llama_tokenizer import BaseLlamaTokenizer, LlamaTokenizer\nimport llama_cpp.llama_cpp as llama_cpp\nimport llama_cpp.llama_chat_format as llama_chat_format\n\nfrom llama_cpp.llama_speculative import LlamaDraftModel\n\nimport numpy as np\nimport numpy.typing as npt\n\nimport llama_cpp._internals as internals\nfrom ._logger import set_verbose\nfrom ._utils import suppress_stdout_stderr\n\n\nclass Llama:\n    \"\"\"High-level Python wrapper for a llama.cpp model.\"\"\"\n\n    __backend_initialized = False\n\n    def __init__(\n        self,\n        model_path: str,\n        *,\n        # Model Params\n        n_gpu_layers: int = 0,\n        split_mode: int = llama_cpp.LLAMA_SPLIT_MODE_LAYER,\n        main_gpu: int = 0,\n        tensor_split: Optional[List[float]] = None,\n        vocab_only: bool = False,\n        use_mmap: bool = True,\n        use_mlock: bool = False,\n        kv_overrides: Optional[Dict[str, Union[bool, int, float, str]]] = None,\n        # Context Params\n        seed: int = llama_cpp.LLAMA_DEFAULT_SEED,\n        n_ctx: int = 512,\n        n_batch: int = 512,\n        n_ubatch: int = 512,\n        n_threads: Optional[int] = None,\n        n_threads_batch: Optional[int] = None,\n        rope_scaling_type: Optional[\n            int\n        ] = llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED,\n        pooling_type: int = llama_cpp.LLAMA_POOLING_TYPE_UNSPECIFIED,\n        rope_freq_base: float = 0.0,\n        rope_freq_scale: float = 0.0,\n        yarn_ext_factor: float = -1.0,\n        yarn_attn_factor: float = 1.0,\n        yarn_beta_fast: float = 32.0,\n        yarn_beta_slow: float = 1.0,\n        yarn_orig_ctx: int = 0,\n        logits_all: bool = False,\n        embedding: bool = False,\n        offload_kqv: bool = True,\n        flash_attn: bool = False,\n        op_offload: Optional[bool] = None,\n        swa_full: Optional[bool] = None,\n        # Sampling Params\n        no_perf: bool = False,\n        last_n_tokens_size: int = 64,\n        # LoRA Params\n        lora_base: Optional[str] = None,\n        lora_scale: float = 1.0,\n        lora_path: Optional[str] = None,\n        # Backend Params\n        numa: Union[bool, int] = False,\n        # Chat Format Params\n        chat_format: Optional[str] = None,\n        chat_handler: Optional[llama_chat_format.LlamaChatCompletionHandler] = None,\n        # Speculative Decoding\n        draft_model: Optional[LlamaDraftModel] = None,\n        # Tokenizer Override\n        tokenizer: Optional[BaseLlamaTokenizer] = None,\n        # KV cache quantization\n        type_k: Optional[int] = None,\n        type_v: Optional[int] = None,\n        # Misc\n        spm_infill: bool = False,\n        verbose: bool = True,\n        # Extra Params\n        **kwargs,  # type: ignore\n    ):\n        \"\"\"Load a llama.cpp model from `model_path`.\n\n        Examples:\n            Basic usage\n\n            >>> import llama_cpp\n            >>> model = llama_cpp.Llama(\n            ...     model_path=\"path/to/model\",\n            ... )\n            >>> print(model(\"The quick brown fox jumps \", stop=[\".\"])[\"choices\"][0][\"text\"])\n            the lazy dog\n\n            Loading a chat model\n\n            >>> import llama_cpp\n            >>> model = llama_cpp.Llama(\n            ...     model_path=\"path/to/model\",\n            ...     chat_format=\"llama-2\",\n            ... )\n            >>> print(model.create_chat_completion(\n            ...     messages=[{\n            ...         \"role\": \"user\",\n            ...         \"content\": \"what is the meaning of life?\"\n            ...     }]\n            ... ))\n\n        Args:\n            model_path: Path to the model.\n            n_gpu_layers: Number of layers to offload to GPU (-ngl). If -1, all layers are offloaded.\n            split_mode: How to split the model across GPUs. See llama_cpp.LLAMA_SPLIT_* for options.\n            main_gpu: main_gpu interpretation depends on split_mode: LLAMA_SPLIT_MODE_NONE: the GPU that is used for the entire model. LLAMA_SPLIT_MODE_ROW: the GPU that is used for small tensors and intermediate results. LLAMA_SPLIT_MODE_LAYER: ignored\n            tensor_split: How split tensors should be distributed across GPUs. If None, the model is not split.\n            vocab_only: Only load the vocabulary no weights.\n            use_mmap: Use mmap if possible.\n            use_mlock: Force the system to keep the model in RAM.\n            kv_overrides: Key-value overrides for the model.\n            seed: RNG seed, -1 for random\n            n_ctx: Text context, 0 = from model\n            n_batch: Prompt processing maximum batch size\n            n_ubatch: Physical batch size\n            n_threads: Number of threads to use for generation\n            n_threads_batch: Number of threads to use for batch processing\n            rope_scaling_type: RoPE scaling type, from `enum llama_rope_scaling_type`. ref: https://github.com/ggerganov/llama.cpp/pull/2054\n            pooling_type: Pooling type, from `enum llama_pooling_type`.\n            rope_freq_base: RoPE base frequency, 0 = from model\n            rope_freq_scale: RoPE frequency scaling factor, 0 = from model\n            yarn_ext_factor: YaRN extrapolation mix factor, negative = from model\n            yarn_attn_factor: YaRN magnitude scaling factor\n            yarn_beta_fast: YaRN low correction dim\n            yarn_beta_slow: YaRN high correction dim\n            yarn_orig_ctx: YaRN original context size\n            logits_all: Return logits for all tokens, not just the last token. Must be True for completion to return logprobs.\n            embedding: Embedding mode only.\n            offload_kqv: Offload K, Q, V to GPU.\n            flash_attn: Use flash attention.\n            op_offload: offload host tensor operations to device\n            swa_full: use full-size SWA cache (https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)\n            no_perf: Measure performance timings.\n            last_n_tokens_size: Maximum number of tokens to keep in the last_n_tokens deque.\n            lora_base: Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.\n            lora_path: Path to a LoRA file to apply to the model.\n            numa: numa policy\n            chat_format: String specifying the chat format to use when calling create_chat_completion.\n            chat_handler: Optional chat handler to use when calling create_chat_completion.\n            draft_model: Optional draft model to use for speculative decoding.\n            tokenizer: Optional tokenizer to override the default tokenizer from llama.cpp.\n            verbose: Print verbose output to stderr.\n            type_k: KV cache data type for K (default: f16)\n            type_v: KV cache data type for V (default: f16)\n            spm_infill: Use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this.\n\n        Raises:\n            ValueError: If the model path does not exist.\n\n        Returns:\n            A Llama instance.\n        \"\"\"\n        self.verbose = verbose\n        self._stack = contextlib.ExitStack()\n\n        set_verbose(verbose)\n\n        if not Llama.__backend_initialized:\n            with suppress_stdout_stderr(disable=verbose):\n                llama_cpp.llama_backend_init()\n            Llama.__backend_initialized = True\n\n        if isinstance(numa, bool):\n            self.numa = (\n                llama_cpp.GGML_NUMA_STRATEGY_DISTRIBUTE\n                if numa\n                else llama_cpp.GGML_NUMA_STRATEGY_DISABLED\n            )\n        else:\n            self.numa = numa\n\n        if self.numa != llama_cpp.GGML_NUMA_STRATEGY_DISABLED:\n            with suppress_stdout_stderr(disable=verbose):\n                llama_cpp.llama_numa_init(self.numa)\n\n        self.model_path = model_path\n\n        # Model Params\n        self.model_params = llama_cpp.llama_model_default_params()\n        self.model_params.n_gpu_layers = (\n            0x7FFFFFFF if n_gpu_layers == -1 else n_gpu_layers\n        )  # 0x7FFFFFFF is INT32 max, will be auto set to all layers\n        self.model_params.split_mode = split_mode\n        self.model_params.main_gpu = main_gpu\n        self.tensor_split = tensor_split\n        self._c_tensor_split = None\n        if self.tensor_split is not None:\n            if len(self.tensor_split) > llama_cpp.LLAMA_MAX_DEVICES:\n                raise ValueError(\n                    f\"Attempt to split tensors that exceed maximum supported devices. Current LLAMA_MAX_DEVICES={llama_cpp.LLAMA_MAX_DEVICES}\"\n                )\n            # Type conversion and expand the list to the length of LLAMA_MAX_DEVICES\n            FloatArray = ctypes.c_float * llama_cpp.LLAMA_MAX_DEVICES\n            self._c_tensor_split = FloatArray(\n                *tensor_split  # type: ignore\n            )  # keep a reference to the array so it is not gc'd\n            self.model_params.tensor_split = self._c_tensor_split\n        self.model_params.vocab_only = vocab_only\n        self.model_params.use_mmap = use_mmap if lora_path is None else False\n        self.model_params.use_mlock = use_mlock\n\n        # kv_overrides is the original python dict\n        self.kv_overrides = kv_overrides\n        if kv_overrides is not None:\n            # _kv_overrides_array is a ctypes.Array of llama_model_kv_override Structs\n            kvo_array_len = len(kv_overrides) + 1  # for sentinel element\n            self._kv_overrides_array = (\n                llama_cpp.llama_model_kv_override * kvo_array_len\n            )()\n\n            for i, (k, v) in enumerate(kv_overrides.items()):\n                self._kv_overrides_array[i].key = k.encode(\"utf-8\")\n                if isinstance(v, bool):\n                    self._kv_overrides_array[\n                        i\n                    ].tag = llama_cpp.LLAMA_KV_OVERRIDE_TYPE_BOOL\n                    self._kv_overrides_array[i].value.val_bool = v\n                elif isinstance(v, int):\n                    self._kv_overrides_array[\n                        i\n                    ].tag = llama_cpp.LLAMA_KV_OVERRIDE_TYPE_INT\n                    self._kv_overrides_array[i].value.val_i64 = v\n                elif isinstance(v, float):\n                    self._kv_overrides_array[\n                        i\n                    ].tag = llama_cpp.LLAMA_KV_OVERRIDE_TYPE_FLOAT\n                    self._kv_overrides_array[i].value.val_f64 = v\n                elif isinstance(v, str):  # type: ignore\n                    v_bytes = v.encode(\"utf-8\")\n                    if len(v_bytes) > 128:  # TODO: Make this a constant\n                        raise ValueError(f\"Value for {k} is too long: {v}\")\n                    v_bytes = v_bytes.ljust(128, b\"\\0\")\n                    self._kv_overrides_array[\n                        i\n                    ].tag = llama_cpp.LLAMA_KV_OVERRIDE_TYPE_STR\n                    # copy min(v_bytes, 128) to str_value\n                    address = typing.cast(\n                        int,\n                        ctypes.addressof(self._kv_overrides_array[i].value)\n                        + llama_cpp.llama_model_kv_override_value.val_str.offset,\n                    )\n                    buffer_start = ctypes.cast(address, ctypes.POINTER(ctypes.c_char))\n                    ctypes.memmove(\n                        buffer_start,\n                        v_bytes,\n                        128,\n                    )\n                else:\n                    raise ValueError(f\"Unknown value type for {k}: {v}\")\n\n            self._kv_overrides_array[\n                -1\n            ].key = b\"\\0\"  # ensure sentinel element is zeroed\n            self.model_params.kv_overrides = self._kv_overrides_array\n\n        self.n_batch = min(n_ctx, n_batch)  # ???\n        self.n_threads = n_threads or max(multiprocessing.cpu_count() // 2, 1)\n        self.n_threads_batch = n_threads_batch or multiprocessing.cpu_count()\n\n        # Used by the sampler\n        self._seed = seed or llama_cpp.LLAMA_DEFAULT_SEED\n\n        # Context Params\n        self.context_params = llama_cpp.llama_context_default_params()\n        self.context_params.n_ctx = n_ctx\n        self.context_params.n_batch = self.n_batch\n        self.context_params.n_ubatch = min(self.n_batch, n_ubatch)\n        self.context_params.n_threads = self.n_threads\n        self.context_params.n_threads_batch = self.n_threads_batch\n        self.context_params.rope_scaling_type = (\n            rope_scaling_type\n            if rope_scaling_type is not None\n            else llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED\n        )\n        self.context_params.pooling_type = pooling_type\n        self.context_params.rope_freq_base = (\n            rope_freq_base if rope_freq_base != 0.0 else 0\n        )\n        self.context_params.rope_freq_scale = (\n            rope_freq_scale if rope_freq_scale != 0.0 else 0\n        )\n        self.context_params.yarn_ext_factor = (\n            yarn_ext_factor if yarn_ext_factor != 0.0 else 0\n        )\n        self.context_params.yarn_attn_factor = (\n            yarn_attn_factor if yarn_attn_factor != 0.0 else 0\n        )\n        self.context_params.yarn_beta_fast = (\n            yarn_beta_fast if yarn_beta_fast != 0.0 else 0\n        )\n        self.context_params.yarn_beta_slow = (\n            yarn_beta_slow if yarn_beta_slow != 0.0 else 0\n        )\n        self.context_params.yarn_orig_ctx = yarn_orig_ctx if yarn_orig_ctx != 0 else 0\n        self._logits_all = logits_all if draft_model is None else True\n        self.context_params.embeddings = embedding  # TODO: Rename to embeddings\n        self.context_params.offload_kqv = offload_kqv\n        self.context_params.flash_attn = flash_attn\n\n        if op_offload is not None:\n            self.context_params.op_offload = op_offload\n\n        if swa_full is not None:\n            self.context_params.swa_full = swa_full\n\n        #  KV cache quantization\n        if type_k is not None:\n            self.context_params.type_k = type_k\n        if type_v is not None:\n            self.context_params.type_v = type_v\n        # Sampling Params\n        self.context_params.no_perf = no_perf\n        self.last_n_tokens_size = last_n_tokens_size\n\n        self.cache: Optional[BaseLlamaCache] = None\n\n        self.lora_base = lora_base\n        self.lora_scale = lora_scale\n        self.lora_path = lora_path\n\n        self.spm_infill = spm_infill\n\n        if not os.path.exists(model_path):\n            raise ValueError(f\"Model path does not exist: {model_path}\")\n\n        self._model = self._stack.enter_context(\n            contextlib.closing(\n                internals.LlamaModel(\n                    path_model=self.model_path,\n                    params=self.model_params,\n                    verbose=self.verbose,\n                )\n            )\n        )\n\n        # Override tokenizer\n        self.tokenizer_ = tokenizer or LlamaTokenizer(self)\n\n        # Set the default value for the context and correct the batch\n        if n_ctx == 0:\n            n_ctx = self._model.n_ctx_train()\n            self.n_batch = min(n_ctx, n_batch)\n            self.context_params.n_ctx = self._model.n_ctx_train()\n            self.context_params.n_batch = self.n_batch\n            self.context_params.n_ubatch = min(self.n_batch, n_ubatch)\n\n        self._ctx = self._stack.enter_context(\n            contextlib.closing(\n                internals.LlamaContext(\n                    model=self._model,\n                    params=self.context_params,\n                    verbose=self.verbose,\n                )\n            )\n        )\n\n        self._batch = self._stack.enter_context(\n            contextlib.closing(\n                internals.LlamaBatch(\n                    n_tokens=self.n_batch,\n                    embd=0,\n                    n_seq_max=self.context_params.n_ctx,\n                    verbose=self.verbose,\n                )\n            )\n        )\n\n        self._lora_adapter: Optional[llama_cpp.llama_adapter_lora_p] = None\n\n        if self.lora_path:\n            self._lora_adapter = llama_cpp.llama_adapter_lora_init(\n                self._model.model,\n                self.lora_path.encode(\"utf-8\"),\n            )\n            if self._lora_adapter is None:\n                raise RuntimeError(\n                    f\"Failed to initialize LoRA adapter from lora path: {self.lora_path}\"\n                )\n\n            def free_lora_adapter():\n                if self._lora_adapter is None:\n                    return\n                llama_cpp.llama_adapter_lora_free(self._lora_adapter)\n                self._lora_adapter = None\n\n            self._stack.callback(free_lora_adapter)\n\n            if llama_cpp.llama_set_adapter_lora(\n                self._ctx.ctx, self._lora_adapter, self.lora_scale\n            ):\n                raise RuntimeError(\n                    f\"Failed to set LoRA adapter from lora path: {self.lora_path}\"\n                )\n\n        if self.verbose:\n            print(llama_cpp.llama_print_system_info().decode(\"utf-8\"), file=sys.stderr)\n\n        self.chat_format = chat_format\n        self.chat_handler = chat_handler\n        self._chat_handlers: Dict[\n            str, llama_chat_format.LlamaChatCompletionHandler\n        ] = {}\n\n        self.draft_model = draft_model\n\n        self._n_vocab = self.n_vocab()\n        self._n_ctx = self.n_ctx()\n\n        self._token_nl = self.token_nl()\n        self._token_eos = self.token_eos()\n\n        self._candidates = internals.LlamaTokenDataArray(n_vocab=self._n_vocab)\n\n        self.n_tokens = 0\n        self.input_ids: npt.NDArray[np.intc] = np.ndarray((n_ctx,), dtype=np.intc)\n        self.scores: npt.NDArray[np.single] = np.ndarray(\n            (n_ctx if logits_all == True else n_batch, self._n_vocab), dtype=np.single\n        )\n\n        self._mirostat_mu = ctypes.c_float(\n            2.0 * 5.0\n        )  # TODO: Move this to sampling context\n\n        try:\n            self.metadata = self._model.metadata()\n        except Exception as e:\n            self.metadata = {}\n            if self.verbose:\n                print(f\"Failed to load metadata: {e}\", file=sys.stderr)\n\n        if self.verbose:\n            print(f\"Model metadata: {self.metadata}\", file=sys.stderr)\n\n        eos_token_id = self.token_eos()\n        bos_token_id = self.token_bos()\n\n        eos_token = (\n            self._model.token_get_text(eos_token_id) if eos_token_id != -1 else \"\"\n        )\n        bos_token = (\n            self._model.token_get_text(bos_token_id) if bos_token_id != -1 else \"\"\n        )\n\n        # Unfortunately the llama.cpp API does not return metadata arrays, so we can't get template names from tokenizer.chat_templates\n        template_choices = dict(\n            (name[10:], template)\n            for name, template in self.metadata.items()\n            if name.startswith(\"tokenizer.chat_template.\")\n        )\n\n        if \"tokenizer.chat_template\" in self.metadata:\n            template_choices[\"chat_template.default\"] = self.metadata[\n                \"tokenizer.chat_template\"\n            ]\n\n        if self.verbose and template_choices:\n            print(\n                f\"Available chat formats from metadata: {', '.join(template_choices.keys())}\",\n                file=sys.stderr,\n            )\n\n        for name, template in template_choices.items():\n            self._chat_handlers[name] = llama_chat_format.Jinja2ChatFormatter(\n                template=template,\n                eos_token=eos_token,\n                bos_token=bos_token,\n                stop_token_ids=[eos_token_id],\n            ).to_chat_handler()\n\n        if (\n            self.chat_format is None\n            and self.chat_handler is None\n            and \"chat_template.default\" in template_choices\n        ):\n            chat_format = llama_chat_format.guess_chat_format_from_gguf_metadata(\n                self.metadata\n            )\n\n            if chat_format is not None:\n                self.chat_format = chat_format\n                if self.verbose:\n                    print(f\"Guessed chat format: {chat_format}\", file=sys.stderr)\n            else:\n                if self.verbose:\n                    print(\n                        f\"Using gguf chat template: {template_choices['chat_template.default']}\",\n                        file=sys.stderr,\n                    )\n                    print(f\"Using chat eos_token: {eos_token}\", file=sys.stderr)\n                    print(f\"Using chat bos_token: {bos_token}\", file=sys.stderr)\n\n                self.chat_format = \"chat_template.default\"\n\n        if self.chat_format is None and self.chat_handler is None:\n            self.chat_format = \"llama-2\"\n            if self.verbose:\n                print(\n                    f\"Using fallback chat format: {self.chat_format}\", file=sys.stderr\n                )\n\n        self._sampler = None\n\n    @property\n    def ctx(self) -> llama_cpp.llama_context_p:\n        return self._ctx.ctx\n\n    @property\n    def model(self) -> llama_cpp.llama_model_p:\n        return self._model.model\n\n    @property\n    def _input_ids(self) -> npt.NDArray[np.intc]:\n        return self.input_ids[: self.n_tokens]\n\n    @property\n    def _scores(self) -> npt.NDArray[np.single]:\n        return self.scores[: self.n_tokens, :]\n\n    @property\n    def eval_tokens(self) -> Deque[int]:\n        return deque(self.input_ids[: self.n_tokens].tolist(), maxlen=self._n_ctx)\n\n    @property\n    def eval_logits(self) -> Deque[List[float]]:\n        return deque(\n            self.scores[: self.n_tokens, :].tolist(),\n            maxlen=self._n_ctx if self._logits_all else 1,\n        )\n\n    def tokenize(\n        self, text: bytes, add_bos: bool = True, special: bool = False\n    ) -> List[int]:\n        \"\"\"Tokenize a string.\n\n        Args:\n            text: The utf-8 encoded string to tokenize.\n            add_bos: Whether to add a beginning of sequence token.\n            special: Whether to tokenize special tokens.\n\n        Raises:\n            RuntimeError: If the tokenization failed.\n\n        Returns:\n            A list of tokens.\n        \"\"\"\n        return self.tokenizer_.tokenize(text, add_bos, special)\n\n    def detokenize(\n        self,\n        tokens: List[int],\n        prev_tokens: Optional[List[int]] = None,\n        special: bool = False,\n    ) -> bytes:\n        \"\"\"Detokenize a list of tokens.\n\n        Args:\n            tokens: The list of tokens to detokenize.\n            prev_tokens: The list of previous tokens. Offset mapping will be performed if provided.\n            special: Whether to detokenize special tokens.\n\n        Returns:\n            The detokenized string.\n        \"\"\"\n        return self.tokenizer_.detokenize(\n            tokens, prev_tokens=prev_tokens, special=special\n        )\n\n    def set_cache(self, cache: Optional[BaseLlamaCache]):\n        \"\"\"Set the cache.\n\n        Args:\n            cache: The cache to set.\n        \"\"\"\n        self.cache = cache\n\n    def set_seed(self, seed: int):\n        \"\"\"Set the random seed.\n\n        Args:\n            seed: The random seed.\n        \"\"\"\n        self._seed = seed\n\n    def reset(self):\n        \"\"\"Reset the model state.\"\"\"\n        self.n_tokens = 0\n\n    def eval(self, tokens: Sequence[int]):\n        \"\"\"Evaluate a list of tokens.\n\n        Args:\n            tokens: The list of tokens to evaluate.\n        \"\"\"\n        self._ctx.kv_cache_seq_rm(-1, self.n_tokens, -1)\n        for i in range(0, len(tokens), self.n_batch):\n            batch = tokens[i : min(len(tokens), i + self.n_batch)]\n            n_past = self.n_tokens\n            n_tokens = len(batch)\n            self._batch.set_batch(\n                batch=batch, n_past=n_past, logits_all=self._logits_all\n            )\n            self._ctx.decode(self._batch)\n            # Save tokens\n            self.input_ids[n_past : n_past + n_tokens] = batch\n            # Save logits\n            if self._logits_all:\n                rows = n_tokens\n                cols = self._n_vocab\n                logits = np.ctypeslib.as_array(\n                    self._ctx.get_logits(), shape=(rows * cols,)\n                )\n                self.scores[n_past : n_past + n_tokens, :].reshape(-1)[::] = logits\n            else:\n                # rows = 1\n                # cols = self._n_vocab\n                # logits = np.ctypeslib.as_array(\n                #     self._ctx.get_logits(), shape=(rows * cols,)\n                # )\n                # self.scores[n_past + n_tokens - 1, :].reshape(-1)[::] = logits\n                # NOTE: Now that sampling is done inside the sampler, logits are only needed for logprobs which requires logits_all\n                pass\n            # Update n_tokens\n            self.n_tokens += n_tokens\n\n    def _init_sampler(\n        self,\n        top_k: int = 40,\n        top_p: float = 0.95,\n        min_p: float = 0.05,\n        typical_p: float = 1.0,\n        temp: float = 0.80,\n        repeat_penalty: float = 1.0,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        tfs_z: float = 1.0,\n        mirostat_mode: int = 0,\n        mirostat_eta: float = 0.1,\n        mirostat_tau: float = 5.0,\n        penalize_nl: bool = True,\n        logits_processor: Optional[LogitsProcessorList] = None,\n        grammar: Optional[LlamaGrammar] = None,\n    ):\n        sampler = internals.LlamaSampler()\n\n        if logits_processor is not None:\n            # Create and add a custom sampler\n            def apply_func(token_data_array: llama_cpp.llama_token_data_array_p):\n                size = token_data_array.contents.size\n                data_soa = token_data_array.contents.data\n                data_soa_address = ctypes.addressof(data_soa.contents)\n                # NOTE: This is probably broken\n                recarray = np.recarray(\n                    shape=(size,),\n                    dtype=np.dtype(\n                        [(\"id\", np.intc), (\"logit\", np.single), (\"p\", np.single)],\n                        align=True,\n                    ),\n                    buf=(llama_cpp.llama_token_data * size).from_address(\n                        data_soa_address\n                    ),\n                )\n                for logit_processor in logits_processor:\n                    recarray.logit[:] = logit_processor(self._input_ids, recarray.logit)\n\n            sampler.add_custom(apply_func)\n\n        sampler.add_penalties(\n            # n_vocab=self._n_vocab,\n            # special_eos_id=self._token_eos,\n            # linefeed_id=self._token_nl,\n            penalty_last_n=self.last_n_tokens_size,\n            penalty_repeat=repeat_penalty,\n            penalty_freq=frequency_penalty,\n            penalty_present=presence_penalty,\n            # penalize_nl=penalize_nl,\n            # ignore_eos=False,\n        )\n\n        if grammar is not None:\n            sampler.add_grammar(self._model, grammar)\n\n        if temp < 0.0:\n            sampler.add_softmax()\n            sampler.add_dist(self._seed)\n        elif temp == 0.0:\n            sampler.add_greedy()\n        else:\n            if mirostat_mode == 1:\n                mirostat_m = 100\n                sampler.add_mirostat(\n                    self._n_vocab,\n                    self._seed,\n                    mirostat_tau,\n                    mirostat_eta,\n                    mirostat_m,\n                )\n            elif mirostat_mode == 2:\n                sampler.add_mirostat_v2(\n                    self._seed,\n                    mirostat_tau,\n                    mirostat_eta,\n                )\n            else:\n                n_probs = 0\n                min_keep = max(1, n_probs)\n                sampler.add_top_k(top_k)\n                sampler.add_typical(typical_p, min_keep)\n                sampler.add_top_p(top_p, min_keep)\n                sampler.add_min_p(min_p, min_keep)\n                sampler.add_temp(temp)\n                sampler.add_dist(self._seed)\n        return sampler\n\n    def sample(\n        self,\n        top_k: int = 40,\n        top_p: float = 0.95,\n        min_p: float = 0.05,\n        typical_p: float = 1.0,\n        temp: float = 0.80,\n        repeat_penalty: float = 1.0,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        tfs_z: float = 1.0,\n        mirostat_mode: int = 0,\n        mirostat_eta: float = 0.1,\n        mirostat_tau: float = 5.0,\n        penalize_nl: bool = True,\n        logits_processor: Optional[LogitsProcessorList] = None,\n        grammar: Optional[LlamaGrammar] = None,\n        idx: Optional[int] = None,\n    ):\n        \"\"\"Sample a token from the model.\n\n        Args:\n            top_k: The top-k sampling parameter.\n            top_p: The top-p sampling parameter.\n            temp: The temperature parameter.\n            repeat_penalty: The repeat penalty parameter.\n\n        Returns:\n            The sampled token.\n        \"\"\"\n        assert self.n_tokens > 0\n\n        tmp_sampler = False\n\n        if self._sampler is None:\n            tmp_sampler = True\n            self._sampler = self._init_sampler(\n                top_k=top_k,\n                top_p=top_p,\n                min_p=min_p,\n                typical_p=typical_p,\n                temp=temp,\n                repeat_penalty=repeat_penalty,\n                frequency_penalty=frequency_penalty,\n                presence_penalty=presence_penalty,\n                tfs_z=tfs_z,\n                mirostat_mode=mirostat_mode,\n                mirostat_tau=mirostat_tau,\n                mirostat_eta=mirostat_eta,\n                penalize_nl=penalize_nl,\n                logits_processor=logits_processor,\n                grammar=grammar,\n            )\n\n        ridx = idx - self.n_tokens if idx is not None else -1\n\n        assert self.ctx is not None\n        token = self._sampler.sample(self._ctx, ridx)\n        if tmp_sampler:\n            self._sampler = None\n        return token\n\n    def generate(\n        self,\n        tokens: Sequence[int],\n        top_k: int = 40,\n        top_p: float = 0.95,\n        min_p: float = 0.05,\n        typical_p: float = 1.0,\n        temp: float = 0.80,\n        repeat_penalty: float = 1.0,\n        reset: bool = True,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        tfs_z: float = 1.0,\n        mirostat_mode: int = 0,\n        mirostat_tau: float = 5.0,\n        mirostat_eta: float = 0.1,\n        penalize_nl: bool = True,\n        logits_processor: Optional[LogitsProcessorList] = None,\n        stopping_criteria: Optional[StoppingCriteriaList] = None,\n        grammar: Optional[LlamaGrammar] = None,\n    ) -> Generator[int, Optional[Sequence[int]], None]:\n        \"\"\"Create a generator of tokens from a prompt.\n\n        Examples:\n            >>> llama = Llama(\"models/ggml-7b.bin\")\n            >>> tokens = llama.tokenize(b\"Hello, world!\")\n            >>> for token in llama.generate(tokens, top_k=40, top_p=0.95, temp=1.0, repeat_penalty=1.0):\n            ...     print(llama.detokenize([token]))\n\n        Args:\n            tokens: The prompt tokens.\n            top_k: The top-k sampling parameter.\n            top_p: The top-p sampling parameter.\n            temp: The temperature parameter.\n            repeat_penalty: The repeat penalty parameter.\n            reset: Whether to reset the model state.\n\n        Yields:\n            The generated tokens.\n        \"\"\"\n        # Reset mirostat sampling\n        self._mirostat_mu = ctypes.c_float(2.0 * mirostat_tau)\n        self._sampler = self._init_sampler(\n            top_k=top_k,\n            top_p=top_p,\n            min_p=min_p,\n            typical_p=typical_p,\n            temp=temp,\n            repeat_penalty=repeat_penalty,\n            frequency_penalty=frequency_penalty,\n            presence_penalty=presence_penalty,\n            tfs_z=tfs_z,\n            mirostat_mode=mirostat_mode,\n            mirostat_tau=mirostat_tau,\n            mirostat_eta=mirostat_eta,\n            penalize_nl=penalize_nl,\n            logits_processor=logits_processor,\n            grammar=grammar,\n        )\n\n        # Check for kv cache prefix match\n        if reset and self.n_tokens > 0:\n            longest_prefix = 0\n            for a, b in zip(self._input_ids, tokens[:-1]):\n                if a == b:\n                    longest_prefix += 1\n                else:\n                    break\n            if longest_prefix > 0:\n                reset = False\n                tokens = tokens[longest_prefix:]\n                self.n_tokens = longest_prefix\n                if self.verbose:\n                    print(\n                        f\"Llama.generate: {longest_prefix} prefix-match hit, \"\n                        f\"remaining {len(tokens)} prompt tokens to eval\",\n                        file=sys.stderr,\n                    )\n\n        # Reset the model state\n        if reset:\n            self.reset()\n\n        # # Reset the grammar\n        # if grammar is not None:\n        #     grammar.reset()\n\n        sample_idx = self.n_tokens + len(tokens) - 1\n        tokens = list(tokens)\n\n        # Eval and sample\n        while True:\n            self.eval(tokens)\n            while sample_idx < self.n_tokens:\n                token = self.sample(\n                    top_k=top_k,\n                    top_p=top_p,\n                    min_p=min_p,\n                    typical_p=typical_p,\n                    temp=temp,\n                    repeat_penalty=repeat_penalty,\n                    frequency_penalty=frequency_penalty,\n                    presence_penalty=presence_penalty,\n                    tfs_z=tfs_z,\n                    mirostat_mode=mirostat_mode,\n                    mirostat_tau=mirostat_tau,\n                    mirostat_eta=mirostat_eta,\n                    logits_processor=logits_processor,\n                    grammar=grammar,\n                    penalize_nl=penalize_nl,\n                    idx=sample_idx,\n                )\n\n                sample_idx += 1\n                if stopping_criteria is not None and stopping_criteria(\n                    self._input_ids[: sample_idx], self._scores[sample_idx - self.n_tokens, :]\n                ):\n                    return\n                tokens_or_none = yield token\n                tokens.clear()\n                tokens.append(token)\n                if tokens_or_none is not None:\n                    tokens.extend(tokens_or_none)\n\n                if sample_idx < self.n_tokens and token != self._input_ids[sample_idx]:\n                    self.n_tokens = sample_idx\n                    self._ctx.kv_cache_seq_rm(-1, self.n_tokens, -1)\n                    break\n\n            if self.draft_model is not None:\n                self.input_ids[self.n_tokens : self.n_tokens + len(tokens)] = tokens\n                draft_tokens = self.draft_model(\n                    self.input_ids[: self.n_tokens + len(tokens)]\n                )\n                tokens.extend(\n                    draft_tokens.astype(int)[\n                        : self._n_ctx - self.n_tokens - len(tokens)\n                    ]\n                )\n\n    def create_embedding(\n        self, input: Union[str, List[str]], model: Optional[str] = None\n    ) -> CreateEmbeddingResponse:\n        \"\"\"Embed a string.\n\n        Args:\n            input: The utf-8 encoded string to embed.\n\n        Returns:\n            An embedding object.\n        \"\"\"\n        model_name: str = model if model is not None else self.model_path\n\n        input = input if isinstance(input, list) else [input]\n\n        # get numeric embeddings\n        embeds: Union[List[List[float]], List[List[List[float]]]]\n        total_tokens: int\n        embeds, total_tokens = self.embed(input, return_count=True)  # type: ignore\n\n        # convert to CreateEmbeddingResponse\n        data: List[Embedding] = [\n            {\n                \"object\": \"embedding\",\n                \"embedding\": emb,\n                \"index\": idx,\n            }\n            for idx, emb in enumerate(embeds)\n        ]\n\n        return {\n            \"object\": \"list\",\n            \"data\": data,\n            \"model\": model_name,\n            \"usage\": {\n                \"prompt_tokens\": total_tokens,\n                \"total_tokens\": total_tokens,\n            },\n        }\n\n    def embed(\n        self,\n        input: Union[str, List[str]],\n        normalize: bool = False,\n        truncate: bool = True,\n        return_count: bool = False,\n    ):\n        \"\"\"Embed a string.\n\n        Args:\n            input: The utf-8 encoded string to embed.\n\n        Returns:\n            A list of embeddings\n        \"\"\"\n        n_embd = self.n_embd()\n        n_batch = self.n_batch\n\n        # get pooling information\n        pooling_type = self.pooling_type()\n        logits_all = pooling_type == llama_cpp.LLAMA_POOLING_TYPE_NONE\n\n        if self.context_params.embeddings is False:\n            raise RuntimeError(\n                \"Llama model must be created with embedding=True to call this method\"\n            )\n\n        if self.verbose:\n            llama_cpp.llama_perf_context_reset(self._ctx.ctx)\n\n        if isinstance(input, str):\n            inputs = [input]\n        else:\n            inputs = input\n\n        # reset batch\n        self._batch.reset()\n\n        # decode and fetch embeddings\n        data: Union[List[List[float]], List[List[List[float]]]] = []\n\n        def decode_batch(seq_sizes: List[int]):\n            llama_cpp.llama_kv_self_clear(self._ctx.ctx)\n            self._ctx.decode(self._batch)\n            self._batch.reset()\n\n            # store embeddings\n            if pooling_type == llama_cpp.LLAMA_POOLING_TYPE_NONE:\n                pos: int = 0\n                for i, size in enumerate(seq_sizes):\n                    ptr = llama_cpp.llama_get_embeddings(self._ctx.ctx)\n                    embedding: List[List[float]] = [\n                        ptr[pos + j * n_embd : pos + (j + 1) * n_embd]\n                        for j in range(size)\n                    ]\n                    if normalize:\n                        embedding = [\n                            internals.normalize_embedding(e) for e in embedding\n                        ]\n                    data.append(embedding)\n                    pos += size\n            else:\n                for i in range(len(seq_sizes)):\n                    ptr = llama_cpp.llama_get_embeddings_seq(self._ctx.ctx, i)\n                    embedding: List[float] = ptr[:n_embd]\n                    if normalize:\n                        embedding = internals.normalize_embedding(embedding)\n                    data.append(embedding)\n\n        # init state\n        total_tokens = 0\n        s_batch = []\n        t_batch = 0\n        p_batch = 0\n\n        # accumulate batches and encode\n        for text in inputs:\n            tokens = self.tokenize(text.encode(\"utf-8\"))\n            if truncate:\n                tokens = tokens[:n_batch]\n\n            n_tokens = len(tokens)\n            total_tokens += n_tokens\n\n            # check for overrun\n            if n_tokens > n_batch:\n                raise ValueError(\n                    f\"Requested tokens ({n_tokens}) exceed batch size of {n_batch}\"\n                )\n\n            # time to eval batch\n            if t_batch + n_tokens > n_batch:\n                decode_batch(s_batch)\n                s_batch = []\n                t_batch = 0\n                p_batch = 0\n\n            # add to batch\n            self._batch.add_sequence(tokens, p_batch, logits_all)\n\n            # update batch stats\n            s_batch.append(n_tokens)\n            t_batch += n_tokens\n            p_batch += 1\n\n        # hanlde last batch\n        decode_batch(s_batch)\n\n        if self.verbose:\n            llama_cpp.llama_perf_context_print(self._ctx.ctx)\n\n        output = data[0] if isinstance(input, str) else data\n\n        llama_cpp.llama_kv_self_clear(self._ctx.ctx)\n        self.reset()\n\n        if return_count:\n            return output, total_tokens\n        else:\n            return output\n\n    def _create_completion(\n        self,\n        prompt: Union[str, List[int]],\n        suffix: Optional[str] = None,\n        max_tokens: Optional[int] = 16,\n        temperature: float = 0.8,\n        top_p: float = 0.95,\n        min_p: float = 0.05,\n        typical_p: float = 1.0,\n        logprobs: Optional[int] = None,\n        echo: bool = False,\n        stop: Optional[Union[str, List[str]]] = [],\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        repeat_penalty: float = 1.0,\n        top_k: int = 40,\n        stream: bool = False,\n        seed: Optional[int] = None,\n        tfs_z: float = 1.0,\n        mirostat_mode: int = 0,\n        mirostat_tau: float = 5.0,\n        mirostat_eta: float = 0.1,\n        model: Optional[str] = None,\n        stopping_criteria: Optional[StoppingCriteriaList] = None,\n        logits_processor: Optional[LogitsProcessorList] = None,\n        grammar: Optional[LlamaGrammar] = None,\n        logit_bias: Optional[Dict[int, float]] = None,\n    ) -> Union[\n        Iterator[CreateCompletionResponse], Iterator[CreateCompletionStreamResponse]\n    ]:\n        assert suffix is None or suffix.__class__ is str\n\n        completion_id: str = f\"cmpl-{str(uuid.uuid4())}\"\n        created: int = int(time.time())\n        bos_token_id: int = self.token_bos()\n        cls_token_id: int = self._model.token_cls()\n        sep_token_id: int = self._model.token_sep()\n        prefix_token_id: int = 0 # self._model.token_prefix() # TODO: Fix\n        middle_token_id: int = 0 # self._model.token_middle() # TODO: Fix\n        suffix_token_id: int = 0 # self._model.token_suffix() # TODO: Fix\n        add_space_prefix: bool = (\n            self.metadata.get(\"tokenizer.ggml.add_space_prefix\", \"true\") == \"true\"\n        )\n        bos_tokens: List[int] = [cls_token_id if cls_token_id != -1 else bos_token_id]\n        eos_tokens: List[int] = [\n            sep_token_id if sep_token_id != -1 else self.token_eos()\n        ]\n\n        if (\n            (isinstance(prompt, list) and suffix is None)\n            or not self._model.add_bos_token()\n            or bos_tokens[:1] == [-1]\n        ):\n            bos_tokens = []\n\n        if (isinstance(prompt, list) and suffix is None) or (\n            not self._model.add_eos_token() and sep_token_id == -1\n        ):\n            eos_tokens = []\n\n        suffix_space_prefix: int = 0\n        # Tokenizer hack to remove leading space\n        if add_space_prefix and suffix_token_id >= 0 and suffix:\n            suffix = \"☺\" + suffix\n            suffix_space_prefix = 2\n\n        # If prompt is empty, initialize completion with BOS token to avoid\n        # detokenization including a space at the beginning of the completion\n        completion_tokens: List[int] = [] if len(prompt) > 0 else [bos_token_id]\n        # Add blank space to start of prompt to match OG llama tokenizer\n        prefix_tokens: List[int] = (\n            [prefix_token_id] if prefix_token_id >= 0 and suffix is not None else []\n        ) + (\n            (\n                self.tokenize(\n                    prompt.encode(\"utf-8\"),\n                    add_bos=False,\n                    special=(prefix_token_id < 0 or suffix is None),\n                )\n                if prompt != \"\"\n                else []\n            )\n            if isinstance(prompt, str)\n            else prompt\n        )\n        suffix_tokens: List[int] = (\n            (\n                [suffix_token_id]\n                + (\n                    self.tokenize(suffix.encode(\"utf-8\"), add_bos=False, special=False)[\n                        suffix_space_prefix:\n                    ]\n                    if suffix\n                    else []\n                )\n            )\n            if suffix_token_id >= 0 and suffix is not None\n            else []\n        )\n        middle_tokens: List[int] = (\n            [middle_token_id] if middle_token_id >= 0 and suffix is not None else []\n        )\n        prompt_tokens: List[int] = (\n            bos_tokens\n            + (\n                (suffix_tokens + prefix_tokens + middle_tokens)\n                if self.spm_infill\n                else (prefix_tokens + suffix_tokens + middle_tokens)\n            )\n            + eos_tokens\n        )\n        text: bytes = b\"\"\n        returned_tokens: int = 0\n        stop = (\n            stop if isinstance(stop, list) else [stop] if isinstance(stop, str) else []\n        )\n        model_name: str = model if model is not None else self.model_path\n\n        if prompt_tokens[:2] == [self.token_bos()] * 2:\n            warnings.warn(\n                f'Detected duplicate leading \"{self._model.token_get_text(self.token_bos())}\" in prompt, this will likely reduce response quality, consider removing it...',\n                RuntimeWarning,\n            )\n\n        # NOTE: This likely doesn't work correctly for the first token in the prompt\n        # because of the extra space added to the start of the prompt_tokens\n        if logit_bias is not None:\n            logit_bias_map = {int(k): float(v) for k, v in logit_bias.items()}\n\n            def logit_bias_processor(\n                input_ids: npt.NDArray[np.intc],\n                scores: npt.NDArray[np.single],\n            ) -> npt.NDArray[np.single]:\n                new_scores = np.copy(\n                    scores\n                )  # Does it make sense to copy the whole array or can we just overwrite the original one?\n                for input_id, score in logit_bias_map.items():\n                    new_scores[input_id] = score + scores[input_id]\n                return new_scores\n\n            _logit_bias_processor = LogitsProcessorList([logit_bias_processor])\n            if logits_processor is None:\n                logits_processor = _logit_bias_processor\n            else:\n                logits_processor = logits_processor.extend(_logit_bias_processor)\n\n        if self.verbose:\n            self._ctx.reset_timings()\n\n        if len(prompt_tokens) >= self._n_ctx:\n            raise ValueError(\n                f\"Requested tokens ({len(prompt_tokens)}) exceed context window of {llama_cpp.llama_n_ctx(self.ctx)}\"\n            )\n\n        if max_tokens is None or max_tokens <= 0:\n            # Unlimited, depending on n_ctx.\n            max_tokens = self._n_ctx - len(prompt_tokens)\n\n        # Truncate max_tokens if requested tokens would exceed the context window\n        max_tokens = (\n            max_tokens\n            if max_tokens + len(prompt_tokens) < self._n_ctx\n            else (self._n_ctx - len(prompt_tokens))\n        )\n\n        if stop != []:\n            stop_sequences = [s.encode(\"utf-8\") for s in stop]\n        else:\n            stop_sequences = []\n\n        if logprobs is not None and self._logits_all is False:\n            raise ValueError(\n                \"logprobs is not supported for models created with logits_all=False\"\n            )\n\n        if self.cache:\n            try:\n                cache_item = self.cache[prompt_tokens]\n                cache_prefix_len = Llama.longest_token_prefix(\n                    cache_item.input_ids.tolist(), prompt_tokens\n                )\n                eval_prefix_len = Llama.longest_token_prefix(\n                    self._input_ids.tolist(), prompt_tokens\n                )\n                if cache_prefix_len > eval_prefix_len:\n                    self.load_state(cache_item)\n                    if self.verbose:\n                        print(\"Llama._create_completion: cache hit\", file=sys.stderr)\n            except KeyError:\n                if self.verbose:\n                    print(\"Llama._create_completion: cache miss\", file=sys.stderr)\n\n        if seed is not None:\n            self.set_seed(seed)\n        else:\n            self.set_seed(random.Random(self._seed).randint(0, 2 ** 32))\n\n        finish_reason = \"length\"\n        multibyte_fix = 0\n        for token in self.generate(\n            prompt_tokens,\n            top_k=top_k,\n            top_p=top_p,\n            min_p=min_p,\n            typical_p=typical_p,\n            temp=temperature,\n            tfs_z=tfs_z,\n            mirostat_mode=mirostat_mode,\n            mirostat_tau=mirostat_tau,\n            mirostat_eta=mirostat_eta,\n            frequency_penalty=frequency_penalty,\n            presence_penalty=presence_penalty,\n            repeat_penalty=repeat_penalty,\n            stopping_criteria=stopping_criteria,\n            logits_processor=logits_processor,\n            grammar=grammar,\n        ):\n            if llama_cpp.llama_token_is_eog(self._model.vocab, token):\n                text = self.detokenize(completion_tokens, prev_tokens=prompt_tokens)\n                finish_reason = \"stop\"\n                break\n\n            completion_tokens.append(token)\n\n            all_text = self.detokenize(completion_tokens, prev_tokens=prompt_tokens)\n\n            # Contains multi-byte UTF8\n            for k, char in enumerate(all_text[-3:]):\n                k = 3 - k\n                for num, pattern in [(2, 192), (3, 224), (4, 240)]:\n                    # Bitwise AND check\n                    if num > k and pattern & char == pattern:\n                        multibyte_fix = num - k\n\n            # Stop incomplete bytes from passing\n            if multibyte_fix > 0:\n                multibyte_fix -= 1\n                continue\n\n            any_stop = [s for s in stop_sequences if s in all_text]\n            if len(any_stop) > 0:\n                first_stop = any_stop[0]\n                text = all_text[: all_text.index(first_stop)]\n                finish_reason = \"stop\"\n                break\n\n            if stream:\n                remaining_tokens = completion_tokens[returned_tokens:]\n                remaining_text = self.detokenize(\n                    remaining_tokens,\n                    prev_tokens=prompt_tokens + completion_tokens[:returned_tokens],\n                )\n                remaining_length = len(remaining_text)\n\n                # We want to avoid yielding any characters from\n                # the generated text if they are part of a stop\n                # sequence.\n                first_stop_position = 0\n                for s in stop_sequences:\n                    for i in range(min(len(s), remaining_length), 0, -1):\n                        if remaining_text.endswith(s[:i]):\n                            if i > first_stop_position:\n                                first_stop_position = i\n                            break\n\n                token_end_position = 0\n\n                if logprobs is not None:\n                    # not sure how to handle this branch when dealing\n                    # with CJK output, so keep it unchanged\n                    for token in remaining_tokens:\n                        if token == bos_token_id:\n                            continue\n                        token_end_position += len(\n                            self.detokenize(\n                                [token],\n                                prev_tokens=prompt_tokens\n                                + completion_tokens[:returned_tokens],\n                            )\n                        )\n                        # Check if stop sequence is in the token\n                        if token_end_position > (\n                            remaining_length - first_stop_position\n                        ):\n                            break\n                        token_str = self.detokenize(\n                            [token],\n                            prev_tokens=prompt_tokens\n                            + completion_tokens[:returned_tokens],\n                        ).decode(\"utf-8\", errors=\"ignore\")\n                        text_offset = len(prompt) + len(\n                            self.detokenize(\n                                completion_tokens[:returned_tokens],\n                                prev_tokens=prompt_tokens\n                                + completion_tokens[:returned_tokens],\n                            ).decode(\"utf-8\", errors=\"ignore\")\n                        )\n                        token_offset = len(prompt_tokens) + returned_tokens\n                        logits = self._scores[token_offset - 1, :]\n                        current_logprobs = Llama.logits_to_logprobs(logits).tolist()\n                        sorted_logprobs = list(\n                            sorted(\n                                zip(current_logprobs, range(len(current_logprobs))),\n                                reverse=True,\n                            )\n                        )\n                        top_logprob = {\n                            self.detokenize([i]).decode(\n                                \"utf-8\", errors=\"ignore\"\n                            ): logprob\n                            for logprob, i in sorted_logprobs[:logprobs]\n                        }\n                        top_logprob.update({token_str: current_logprobs[int(token)]})\n                        logprobs_or_none = {\n                            \"tokens\": [\n                                self.detokenize(\n                                    [token],\n                                    prev_tokens=prompt_tokens\n                                    + completion_tokens[:returned_tokens],\n                                ).decode(\"utf-8\", errors=\"ignore\")\n                            ],\n                            \"text_offset\": [text_offset],\n                            \"token_logprobs\": [current_logprobs[int(token)]],\n                            \"top_logprobs\": [top_logprob],\n                        }\n                        returned_tokens += 1\n                        yield {\n                            \"id\": completion_id,\n                            \"object\": \"text_completion\",\n                            \"created\": created,\n                            \"model\": model_name,\n                            \"choices\": [\n                                {\n                                    \"text\": self.detokenize(\n                                        [token],\n                                        prev_tokens=prompt_tokens\n                                        + completion_tokens[:returned_tokens],\n                                    ).decode(\"utf-8\", errors=\"ignore\"),\n                                    \"index\": 0,\n                                    \"logprobs\": logprobs_or_none,\n                                    \"finish_reason\": None,\n                                }\n                            ],\n                        }\n                else:\n                    while len(remaining_tokens) > 0:\n                        decode_success = False\n                        for i in range(1, len(remaining_tokens) + 1):\n                            try:\n                                bs = self.detokenize(\n                                    remaining_tokens[:i],\n                                    prev_tokens=prompt_tokens\n                                    + completion_tokens[:returned_tokens],\n                                )\n                                ts = bs.decode(\"utf-8\")\n                                decode_success = True\n                                break\n                            except UnicodeError:\n                                pass\n                        else:\n                            break\n                        if not decode_success:\n                            # all remaining tokens cannot be decoded to a UTF-8 character\n                            break\n                        token_end_position += len(bs)\n                        if token_end_position > (\n                            remaining_length - first_stop_position\n                        ):\n                            break\n                        remaining_tokens = remaining_tokens[i:]\n                        returned_tokens += i\n\n                        yield {\n                            \"id\": completion_id,\n                            \"object\": \"text_completion\",\n                            \"created\": created,\n                            \"model\": model_name,\n                            \"choices\": [\n                                {\n                                    \"text\": ts,\n                                    \"index\": 0,\n                                    \"logprobs\": None,\n                                    \"finish_reason\": None,\n                                }\n                            ],\n                        }\n\n            if len(completion_tokens) >= max_tokens:\n                text = self.detokenize(completion_tokens, prev_tokens=prompt_tokens)\n                finish_reason = \"length\"\n                break\n\n        if stopping_criteria is not None and stopping_criteria(\n            self._input_ids, self._scores[-1, :]\n        ):\n            text = self.detokenize(completion_tokens, prev_tokens=prompt_tokens)\n            finish_reason = \"stop\"\n\n        if self.verbose:\n            self._ctx.print_timings()\n\n        if stream:\n            remaining_tokens = completion_tokens[returned_tokens:]\n            remaining_text = self.detokenize(\n                remaining_tokens,\n                prev_tokens=prompt_tokens + completion_tokens[:returned_tokens],\n            )\n            any_stop = [s for s in stop_sequences if s in remaining_text]\n            if len(any_stop) > 0:\n                end = min(remaining_text.index(stop) for stop in any_stop)\n            else:\n                end = len(remaining_text)\n\n            token_end_position = 0\n            for token in remaining_tokens:\n                token_end_position += len(\n                    self.detokenize(\n                        [token],\n                        prev_tokens=prompt_tokens + completion_tokens[:returned_tokens],\n                    )\n                )\n\n                logprobs_or_none: Optional[CompletionLogprobs] = None\n                if logprobs is not None:\n                    if token == bos_token_id:\n                        continue\n                    token_str = self.detokenize([token]).decode(\n                        \"utf-8\", errors=\"ignore\"\n                    )\n                    text_offset = len(prompt) + len(\n                        self.detokenize(\n                            completion_tokens[:returned_tokens],\n                            prev_tokens=prompt_tokens\n                            + completion_tokens[:returned_tokens],\n                        )\n                    )\n                    token_offset = len(prompt_tokens) + returned_tokens - 1\n                    logits = self._scores[token_offset, :]\n                    current_logprobs = Llama.logits_to_logprobs(logits).tolist()\n                    sorted_logprobs = list(\n                        sorted(\n                            zip(current_logprobs, range(len(current_logprobs))),\n                            reverse=True,\n                        )\n                    )\n                    top_logprob = {\n                        self.detokenize([i]).decode(\"utf-8\", errors=\"ignore\"): logprob\n                        for logprob, i in sorted_logprobs[:logprobs]\n                    }\n                    top_logprob.update({token_str: current_logprobs[int(token)]})\n                    logprobs_or_none = {\n                        \"tokens\": [\n                            self.detokenize([token]).decode(\"utf-8\", errors=\"ignore\")\n                        ],\n                        \"text_offset\": [text_offset],\n                        \"token_logprobs\": [current_logprobs[int(token)]],\n                        \"top_logprobs\": [top_logprob],\n                    }\n\n                if token_end_position >= end:\n                    last_text = self.detokenize([token])\n                    if token_end_position == end - 1:\n                        break\n                    returned_tokens += 1\n                    yield {\n                        \"id\": completion_id,\n                        \"object\": \"text_completion\",\n                        \"created\": created,\n                        \"model\": model_name,\n                        \"choices\": [\n                            {\n                                \"text\": last_text[\n                                    : len(last_text) - (token_end_position - end)\n                                ].decode(\"utf-8\", errors=\"ignore\"),\n                                \"index\": 0,\n                                \"logprobs\": logprobs_or_none,\n                                \"finish_reason\": None,\n                            }\n                        ],\n                    }\n                    break\n                returned_tokens += 1\n                yield {\n                    \"id\": completion_id,\n                    \"object\": \"text_completion\",\n                    \"created\": created,\n                    \"model\": model_name,\n                    \"choices\": [\n                        {\n                            \"text\": self.detokenize([token]).decode(\n                                \"utf-8\", errors=\"ignore\"\n                            ),\n                            \"index\": 0,\n                            \"logprobs\": logprobs_or_none,\n                            \"finish_reason\": None,\n                        }\n                    ],\n                }\n            yield {\n                \"id\": completion_id,\n                \"object\": \"text_completion\",\n                \"created\": created,\n                \"model\": model_name,\n                \"choices\": [\n                    {\n                        \"text\": \"\",\n                        \"index\": 0,\n                        \"logprobs\": None,\n                        \"finish_reason\": finish_reason,\n                    }\n                ],\n            }\n            if self.cache:\n                if self.verbose:\n                    print(\"Llama._create_completion: cache save\", file=sys.stderr)\n                self.cache[prompt_tokens + completion_tokens] = self.save_state()\n                if self.verbose:\n                    print(\"Llama._create_completion: cache saved\", file=sys.stderr)\n            return\n\n        if self.cache:\n            if self.verbose:\n                print(\"Llama._create_completion: cache save\", file=sys.stderr)\n            self.cache[prompt_tokens + completion_tokens] = self.save_state()\n\n        text_str = text.decode(\"utf-8\", errors=\"ignore\")\n\n        if echo:\n            text_str = prompt + text_str\n\n        if suffix_token_id < 0 and suffix is not None:\n            text_str = text_str + suffix\n\n        logprobs_or_none: Optional[CompletionLogprobs] = None\n        if logprobs is not None:\n            text_offset = 0 if echo else len(prompt)\n            token_offset = 0 if echo else len(prompt_tokens[1:])\n            text_offsets: List[int] = []\n            token_logprobs: List[Optional[float]] = []\n            tokens: List[str] = []\n            top_logprobs: List[Optional[Dict[str, float]]] = []\n\n            if echo:\n                # Remove leading BOS token if exists\n                all_tokens = (\n                    prompt_tokens[1 if prompt_tokens[0] == self.token_bos() else 0 :]\n                    + completion_tokens\n                )\n            else:\n                all_tokens = completion_tokens\n\n            all_token_strs = [\n                self.detokenize([token], prev_tokens=all_tokens[:i]).decode(\n                    \"utf-8\", errors=\"ignore\"\n                )\n                for i, token in enumerate(all_tokens)\n            ]\n            all_logprobs = Llama.logits_to_logprobs(self._scores)[token_offset:]\n            # TODO: may be able to change this loop to use np.take_along_dim\n            for idx, (token, token_str, logprobs_token) in enumerate(\n                zip(all_tokens, all_token_strs, all_logprobs)\n            ):\n                if token == bos_token_id:\n                    continue\n                text_offsets.append(\n                    text_offset\n                    + len(\n                        self.detokenize(all_tokens[:idx]).decode(\n                            \"utf-8\", errors=\"ignore\"\n                        )\n                    )\n                )\n                tokens.append(token_str)\n                sorted_logprobs = list(\n                    sorted(\n                        zip(logprobs_token, range(len(logprobs_token))), reverse=True\n                    )\n                )\n                token_logprobs.append(logprobs_token[int(token)])\n                top_logprob: Optional[Dict[str, float]] = {\n                    self.detokenize([i], prev_tokens=all_tokens[:idx]).decode(\n                        \"utf-8\", errors=\"ignore\"\n                    ): logprob\n                    for logprob, i in sorted_logprobs[:logprobs]\n                }\n                top_logprob.update({token_str: logprobs_token[int(token)]})\n                top_logprobs.append(top_logprob)\n            # Weird idosincracy of the OpenAI API where\n            # token_logprobs and top_logprobs are null for\n            # the first token.\n            if echo and len(all_tokens) > 0:\n                token_logprobs[0] = None\n                top_logprobs[0] = None\n            logprobs_or_none = {\n                \"tokens\": tokens,\n                \"text_offset\": text_offsets,\n                \"token_logprobs\": token_logprobs,\n                \"top_logprobs\": top_logprobs,\n            }\n\n        yield {\n            \"id\": completion_id,\n            \"object\": \"text_completion\",\n            \"created\": created,\n            \"model\": model_name,\n            \"choices\": [\n                {\n                    \"text\": text_str,\n                    \"index\": 0,\n                    \"logprobs\": logprobs_or_none,\n                    \"finish_reason\": finish_reason,\n                }\n            ],\n            \"usage\": {\n                \"prompt_tokens\": len(prompt_tokens),\n                \"completion_tokens\": len(completion_tokens),\n                \"total_tokens\": len(prompt_tokens) + len(completion_tokens),\n            },\n        }\n\n    def create_completion(\n        self,\n        prompt: Union[str, List[int]],\n        suffix: Optional[str] = None,\n        max_tokens: Optional[int] = 16,\n        temperature: float = 0.8,\n        top_p: float = 0.95,\n        min_p: float = 0.05,\n        typical_p: float = 1.0,\n        logprobs: Optional[int] = None,\n        echo: bool = False,\n        stop: Optional[Union[str, List[str]]] = [],\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        repeat_penalty: float = 1.0,\n        top_k: int = 40,\n        stream: bool = False,\n        seed: Optional[int] = None,\n        tfs_z: float = 1.0,\n        mirostat_mode: int = 0,\n        mirostat_tau: float = 5.0,\n        mirostat_eta: float = 0.1,\n        model: Optional[str] = None,\n        stopping_criteria: Optional[StoppingCriteriaList] = None,\n        logits_processor: Optional[LogitsProcessorList] = None,\n        grammar: Optional[LlamaGrammar] = None,\n        logit_bias: Optional[Dict[int, float]] = None,\n    ) -> Union[CreateCompletionResponse, Iterator[CreateCompletionStreamResponse]]:\n        \"\"\"Generate text from a prompt.\n\n        Args:\n            prompt: The prompt to generate text from.\n            suffix: A suffix to append to the generated text. If None, no suffix is appended.\n            max_tokens: The maximum number of tokens to generate. If max_tokens <= 0 or None, the maximum number of tokens to generate is unlimited and depends on n_ctx.\n            temperature: The temperature to use for sampling.\n            top_p: The top-p value to use for nucleus sampling. Nucleus sampling described in academic paper \"The Curious Case of Neural Text Degeneration\" https://arxiv.org/abs/1904.09751\n            min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841\n            typical_p: The typical-p value to use for sampling. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.\n            logprobs: The number of logprobs to return. If None, no logprobs are returned.\n            echo: Whether to echo the prompt.\n            stop: A list of strings to stop generation when encountered.\n            frequency_penalty: The penalty to apply to tokens based on their frequency in the prompt.\n            presence_penalty: The penalty to apply to tokens based on their presence in the prompt.\n            repeat_penalty: The penalty to apply to repeated tokens.\n            top_k: The top-k value to use for sampling. Top-K sampling described in academic paper \"The Curious Case of Neural Text Degeneration\" https://arxiv.org/abs/1904.09751\n            stream: Whether to stream the results.\n            seed: The seed to use for sampling.\n            tfs_z: The tail-free sampling parameter. Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.\n            mirostat_mode: The mirostat sampling mode.\n            mirostat_tau: The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.\n            mirostat_eta: The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.\n            model: The name to use for the model in the completion object.\n            stopping_criteria: A list of stopping criteria to use.\n            logits_processor: A list of logits processors to use.\n            grammar: A grammar to use for constrained sampling.\n            logit_bias: A logit bias to use.\n\n        Raises:\n            ValueError: If the requested tokens exceed the context window.\n            RuntimeError: If the prompt fails to tokenize or the model fails to evaluate the prompt.\n\n        Returns:\n            Response object containing the generated text.\n        \"\"\"\n        completion_or_chunks = self._create_completion(\n            prompt=prompt,\n            suffix=suffix,\n            max_tokens=-1 if max_tokens is None else max_tokens,\n            temperature=temperature,\n            top_p=top_p,\n            min_p=min_p,\n            typical_p=typical_p,\n            logprobs=logprobs,\n            echo=echo,\n            stop=stop,\n            frequency_penalty=frequency_penalty,\n            presence_penalty=presence_penalty,\n            repeat_penalty=repeat_penalty,\n            top_k=top_k,\n            stream=stream,\n            seed=seed,\n            tfs_z=tfs_z,\n            mirostat_mode=mirostat_mode,\n            mirostat_tau=mirostat_tau,\n            mirostat_eta=mirostat_eta,\n            model=model,\n            stopping_criteria=stopping_criteria,\n            logits_processor=logits_processor,\n            grammar=grammar,\n            logit_bias=logit_bias,\n        )\n        if stream:\n            chunks: Iterator[CreateCompletionStreamResponse] = completion_or_chunks\n            return chunks\n        completion: Completion = next(completion_or_chunks)  # type: ignore\n        return completion\n\n    def __call__(\n        self,\n        prompt: str,\n        suffix: Optional[str] = None,\n        max_tokens: Optional[int] = 16,\n        temperature: float = 0.8,\n        top_p: float = 0.95,\n        min_p: float = 0.05,\n        typical_p: float = 1.0,\n        logprobs: Optional[int] = None,\n        echo: bool = False,\n        stop: Optional[Union[str, List[str]]] = [],\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        repeat_penalty: float = 1.0,\n        top_k: int = 40,\n        stream: bool = False,\n        seed: Optional[int] = None,\n        tfs_z: float = 1.0,\n        mirostat_mode: int = 0,\n        mirostat_tau: float = 5.0,\n        mirostat_eta: float = 0.1,\n        model: Optional[str] = None,\n        stopping_criteria: Optional[StoppingCriteriaList] = None,\n        logits_processor: Optional[LogitsProcessorList] = None,\n        grammar: Optional[LlamaGrammar] = None,\n        logit_bias: Optional[Dict[int, float]] = None,\n    ) -> Union[CreateCompletionResponse, Iterator[CreateCompletionStreamResponse]]:\n        \"\"\"Generate text from a prompt.\n\n        Args:\n            prompt: The prompt to generate text from.\n            suffix: A suffix to append to the generated text. If None, no suffix is appended.\n            max_tokens: The maximum number of tokens to generate. If max_tokens <= 0 or None, the maximum number of tokens to generate is unlimited and depends on n_ctx.\n            temperature: The temperature to use for sampling.\n            top_p: The top-p value to use for nucleus sampling. Nucleus sampling described in academic paper \"The Curious Case of Neural Text Degeneration\" https://arxiv.org/abs/1904.09751\n            min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841\n            typical_p: The typical-p value to use for sampling. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.\n            logprobs: The number of logprobs to return. If None, no logprobs are returned.\n            echo: Whether to echo the prompt.\n            stop: A list of strings to stop generation when encountered.\n            frequency_penalty: The penalty to apply to tokens based on their frequency in the prompt.\n            presence_penalty: The penalty to apply to tokens based on their presence in the prompt.\n            repeat_penalty: The penalty to apply to repeated tokens.\n            top_k: The top-k value to use for sampling. Top-K sampling described in academic paper \"The Curious Case of Neural Text Degeneration\" https://arxiv.org/abs/1904.09751\n            stream: Whether to stream the results.\n            seed: The seed to use for sampling.\n            tfs_z: The tail-free sampling parameter. Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.\n            mirostat_mode: The mirostat sampling mode.\n            mirostat_tau: The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.\n            mirostat_eta: The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.\n            model: The name to use for the model in the completion object.\n            stopping_criteria: A list of stopping criteria to use.\n            logits_processor: A list of logits processors to use.\n            grammar: A grammar to use for constrained sampling.\n            logit_bias: A logit bias to use.\n\n        Raises:\n            ValueError: If the requested tokens exceed the context window.\n            RuntimeError: If the prompt fails to tokenize or the model fails to evaluate the prompt.\n\n        Returns:\n            Response object containing the generated text.\n        \"\"\"\n        return self.create_completion(\n            prompt=prompt,\n            suffix=suffix,\n            max_tokens=max_tokens,\n            temperature=temperature,\n            top_p=top_p,\n            min_p=min_p,\n            typical_p=typical_p,\n            logprobs=logprobs,\n            echo=echo,\n            stop=stop,\n            frequency_penalty=frequency_penalty,\n            presence_penalty=presence_penalty,\n            repeat_penalty=repeat_penalty,\n            top_k=top_k,\n            stream=stream,\n            seed=seed,\n            tfs_z=tfs_z,\n            mirostat_mode=mirostat_mode,\n            mirostat_tau=mirostat_tau,\n            mirostat_eta=mirostat_eta,\n            model=model,\n            stopping_criteria=stopping_criteria,\n            logits_processor=logits_processor,\n            grammar=grammar,\n            logit_bias=logit_bias,\n        )\n\n    def create_chat_completion(\n        self,\n        messages: List[ChatCompletionRequestMessage],\n        functions: Optional[List[ChatCompletionFunction]] = None,\n        function_call: Optional[ChatCompletionRequestFunctionCall] = None,\n        tools: Optional[List[ChatCompletionTool]] = None,\n        tool_choice: Optional[ChatCompletionToolChoiceOption] = None,\n        temperature: float = 0.2,\n        top_p: float = 0.95,\n        top_k: int = 40,\n        min_p: float = 0.05,\n        typical_p: float = 1.0,\n        stream: bool = False,\n        stop: Optional[Union[str, List[str]]] = [],\n        seed: Optional[int] = None,\n        response_format: Optional[ChatCompletionRequestResponseFormat] = None,\n        max_tokens: Optional[int] = None,\n        presence_penalty: float = 0.0,\n        frequency_penalty: float = 0.0,\n        repeat_penalty: float = 1.0,\n        tfs_z: float = 1.0,\n        mirostat_mode: int = 0,\n        mirostat_tau: float = 5.0,\n        mirostat_eta: float = 0.1,\n        model: Optional[str] = None,\n        logits_processor: Optional[LogitsProcessorList] = None,\n        grammar: Optional[LlamaGrammar] = None,\n        logit_bias: Optional[Dict[int, float]] = None,\n        logprobs: Optional[bool] = None,\n        top_logprobs: Optional[int] = None,\n    ) -> Union[\n        CreateChatCompletionResponse, Iterator[CreateChatCompletionStreamResponse]\n    ]:\n        \"\"\"Generate a chat completion from a list of messages.\n\n        Args:\n            messages: A list of messages to generate a response for.\n            functions: A list of functions to use for the chat completion.\n            function_call: A function call to use for the chat completion.\n            tools: A list of tools to use for the chat completion.\n            tool_choice: A tool choice to use for the chat completion.\n            temperature: The temperature to use for sampling.\n            top_p: The top-p value to use for nucleus sampling. Nucleus sampling described in academic paper \"The Curious Case of Neural Text Degeneration\" https://arxiv.org/abs/1904.09751\n            top_k: The top-k value to use for sampling. Top-K sampling described in academic paper \"The Curious Case of Neural Text Degeneration\" https://arxiv.org/abs/1904.09751\n            min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841\n            typical_p: The typical-p value to use for sampling. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.\n            stream: Whether to stream the results.\n            stop: A list of strings to stop generation when encountered.\n            seed: The seed to use for sampling.\n            response_format: The response format to use for the chat completion. Use { \"type\": \"json_object\" } to contstrain output to only valid json.\n            max_tokens: The maximum number of tokens to generate. If max_tokens <= 0 or None, the maximum number of tokens to generate is unlimited and depends on n_ctx.\n            presence_penalty: The penalty to apply to tokens based on their presence in the prompt.\n            frequency_penalty: The penalty to apply to tokens based on their frequency in the prompt.\n            repeat_penalty: The penalty to apply to repeated tokens.\n            tfs_z: The tail-free sampling parameter.\n            mirostat_mode: The mirostat sampling mode.\n            mirostat_tau: The mirostat sampling tau parameter.\n            mirostat_eta: The mirostat sampling eta parameter.\n            model: The name to use for the model in the completion object.\n            logits_processor: A list of logits processors to use.\n            grammar: A grammar to use.\n            logit_bias: A logit bias to use.\n\n        Returns:\n            Generated chat completion or a stream of chat completion chunks.\n        \"\"\"\n        handler = (\n            self.chat_handler\n            or self._chat_handlers.get(self.chat_format)\n            or llama_chat_format.get_chat_completion_handler(self.chat_format)\n        )\n        return handler(\n            llama=self,\n            messages=messages,\n            functions=functions,\n            function_call=function_call,\n            tools=tools,\n            tool_choice=tool_choice,\n            temperature=temperature,\n            top_p=top_p,\n            top_k=top_k,\n            min_p=min_p,\n            typical_p=typical_p,\n            logprobs=logprobs,\n            top_logprobs=top_logprobs,\n            stream=stream,\n            stop=stop,\n            seed=seed,\n            response_format=response_format,\n            max_tokens=max_tokens,\n            presence_penalty=presence_penalty,\n            frequency_penalty=frequency_penalty,\n            repeat_penalty=repeat_penalty,\n            tfs_z=tfs_z,\n            mirostat_mode=mirostat_mode,\n            mirostat_tau=mirostat_tau,\n            mirostat_eta=mirostat_eta,\n            model=model,\n            logits_processor=logits_processor,\n            grammar=grammar,\n            logit_bias=logit_bias,\n        )\n\n    def create_chat_completion_openai_v1(\n        self,\n        *args: Any,\n        **kwargs: Any,\n    ):\n        \"\"\"Generate a chat completion with return type based on the the OpenAI v1 API.\n\n        OpenAI python package is required to use this method.\n\n        You can install it with `pip install openai`.\n\n        Args:\n            *args: Positional arguments to pass to create_chat_completion.\n            **kwargs: Keyword arguments to pass to create_chat_completion.\n\n        Returns:\n            Generated chat completion or a stream of chat completion chunks.\n        \"\"\"\n        try:\n            from openai.types.chat import ChatCompletion, ChatCompletionChunk\n\n            stream = kwargs.get(\"stream\", False)  # type: ignore\n            assert isinstance(stream, bool)\n            if stream:\n                return (ChatCompletionChunk(**chunk) for chunk in self.create_chat_completion(*args, **kwargs))  # type: ignore\n            else:\n                return ChatCompletion(**self.create_chat_completion(*args, **kwargs))  # type: ignore\n        except ImportError:\n            raise ImportError(\n                \"To use create_chat_completion_openai_v1, you must install the openai package.\"\n                \"You can install it with `pip install openai`.\"\n            )\n\n    def __getstate__(self):\n        return dict(\n            model_path=self.model_path,\n            # Model Params\n            n_gpu_layers=self.model_params.n_gpu_layers,\n            split_mode=self.model_params.split_mode,\n            main_gpu=self.model_params.main_gpu,\n            tensor_split=self.tensor_split,\n            vocab_only=self.model_params.vocab_only,\n            use_mmap=self.model_params.use_mmap,\n            use_mlock=self.model_params.use_mlock,\n            kv_overrides=self.kv_overrides,\n            # Context Params\n            seed=self._seed,\n            n_ctx=self.context_params.n_ctx,\n            n_batch=self.n_batch,\n            n_ubatch=self.context_params.n_ubatch,\n            n_threads=self.context_params.n_threads,\n            n_threads_batch=self.context_params.n_threads_batch,\n            rope_scaling_type=self.context_params.rope_scaling_type,\n            pooling_type=self.context_params.pooling_type,\n            rope_freq_base=self.context_params.rope_freq_base,\n            rope_freq_scale=self.context_params.rope_freq_scale,\n            yarn_ext_factor=self.context_params.yarn_ext_factor,\n            yarn_attn_factor=self.context_params.yarn_attn_factor,\n            yarn_beta_fast=self.context_params.yarn_beta_fast,\n            yarn_beta_slow=self.context_params.yarn_beta_slow,\n            yarn_orig_ctx=self.context_params.yarn_orig_ctx,\n            logits_all=self._logits_all,\n            embedding=self.context_params.embeddings,\n            offload_kqv=self.context_params.offload_kqv,\n            flash_attn=self.context_params.flash_attn,\n            op_offload=self.context_params.op_offload,\n            swa_full=self.context_params.swa_full,\n            # Sampling Params\n            no_perf=self.context_params.no_perf,\n            last_n_tokens_size=self.last_n_tokens_size,\n            # LoRA Params\n            lora_base=self.lora_base,\n            lora_scale=self.lora_scale,\n            lora_path=self.lora_path,\n            # Backend Params\n            numa=self.numa,\n            # Chat Format Params\n            chat_format=self.chat_format,\n            chat_handler=self.chat_handler,\n            # Speculative Decidng\n            draft_model=self.draft_model,\n            # KV cache quantization\n            type_k=self.context_params.type_k,\n            type_v=self.context_params.type_v,\n            # Misc\n            spm_infill=self.spm_infill,\n            verbose=self.verbose,\n        )\n\n    def __setstate__(self, state):\n        self.__init__(**state)\n\n    def save_state(self) -> LlamaState:\n        if self.verbose:\n            print(\"Llama.save_state: saving llama state\", file=sys.stderr)\n        state_size = llama_cpp.llama_get_state_size(self._ctx.ctx)\n        if self.verbose:\n            print(f\"Llama.save_state: got state size: {state_size}\", file=sys.stderr)\n        llama_state = (ctypes.c_uint8 * int(state_size))()\n        if self.verbose:\n            print(\"Llama.save_state: allocated state\", file=sys.stderr)\n        n_bytes = llama_cpp.llama_copy_state_data(self._ctx.ctx, llama_state)\n        if self.verbose:\n            print(f\"Llama.save_state: copied llama state: {n_bytes}\", file=sys.stderr)\n        if int(n_bytes) > int(state_size):\n            raise RuntimeError(\"Failed to copy llama state data\")\n        llama_state_compact = (ctypes.c_uint8 * int(n_bytes))()\n        llama_cpp.ctypes.memmove(llama_state_compact, llama_state, int(n_bytes))\n        if self.verbose:\n            print(\n                f\"Llama.save_state: saving {n_bytes} bytes of llama state\",\n                file=sys.stderr,\n            )\n        return LlamaState(\n            scores=self._scores.copy(),\n            input_ids=self.input_ids.copy(),\n            n_tokens=self.n_tokens,\n            llama_state=bytes(llama_state_compact),\n            llama_state_size=n_bytes,\n            seed=self._seed,\n        )\n\n    def load_state(self, state: LlamaState) -> None:\n        # Only filling in up to `n_tokens` and then zero-ing out the rest\n        self.scores[: state.n_tokens, :] = state.scores.copy()\n        rest = self.scores[state.n_tokens :, :]\n        rest[rest > 0] = 0.0\n        self.input_ids = state.input_ids.copy()\n        self.n_tokens = state.n_tokens\n        self._seed = state.seed\n        state_size = state.llama_state_size\n        LLamaStateArrayType = ctypes.c_uint8 * state_size\n        llama_state = LLamaStateArrayType.from_buffer_copy(state.llama_state)\n\n        if llama_cpp.llama_set_state_data(self._ctx.ctx, llama_state) != state_size:\n            raise RuntimeError(\"Failed to set llama state data\")\n\n    def n_ctx(self) -> int:\n        \"\"\"Return the context window size.\"\"\"\n        return self._ctx.n_ctx()\n\n    def n_embd(self) -> int:\n        \"\"\"Return the embedding size.\"\"\"\n        return self._model.n_embd()\n\n    def n_vocab(self) -> int:\n        \"\"\"Return the vocabulary size.\"\"\"\n        return self._model.n_vocab()\n\n    def tokenizer(self) -> LlamaTokenizer:\n        \"\"\"Return the llama tokenizer for this model.\"\"\"\n        return LlamaTokenizer(self)\n\n    def token_eos(self) -> int:\n        \"\"\"Return the end-of-sequence token.\"\"\"\n        return self._model.token_eos()\n\n    def token_bos(self) -> int:\n        \"\"\"Return the beginning-of-sequence token.\"\"\"\n        return self._model.token_bos()\n\n    def token_nl(self) -> int:\n        \"\"\"Return the newline token.\"\"\"\n        return self._model.token_nl()\n\n    def pooling_type(self) -> str:\n        \"\"\"Return the pooling type.\"\"\"\n        return self._ctx.pooling_type()\n\n    def close(self) -> None:\n        \"\"\"Explicitly free the model from memory.\"\"\"\n        self._stack.close()\n\n    def __del__(self) -> None:\n        self.close()\n\n    @staticmethod\n    def logits_to_logprobs(\n        logits: Union[npt.NDArray[np.single], List], axis: int = -1\n    ) -> npt.NDArray[np.single]:\n        # https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.log_softmax.html\n        logits_maxs: np.ndarray = np.amax(logits, axis=axis, keepdims=True)\n        if logits_maxs.ndim > 0:\n            logits_maxs[~np.isfinite(logits_maxs)] = 0\n        elif not np.isfinite(logits_maxs):\n            logits_maxs = 0\n        subtract_maxs = np.subtract(logits, logits_maxs, dtype=np.single)\n        exp = np.exp(subtract_maxs)\n        # Suppress warnings about log of zero\n        with np.errstate(divide=\"ignore\"):\n            summed = np.sum(exp, axis=axis, keepdims=True)\n            out = np.log(summed)\n        return subtract_maxs - out\n\n    @staticmethod\n    def longest_token_prefix(a: Sequence[int], b: Sequence[int]):\n        longest_prefix = 0\n        for _a, _b in zip(a, b):\n            if _a == _b:\n                longest_prefix += 1\n            else:\n                break\n        return longest_prefix\n\n    @classmethod\n    def from_pretrained(\n        cls,\n        repo_id: str,\n        filename: Optional[str],\n        additional_files: Optional[List] = None,\n        local_dir: Optional[Union[str, os.PathLike[str]]] = None,\n        local_dir_use_symlinks: Union[bool, Literal[\"auto\"]] = \"auto\",\n        cache_dir: Optional[Union[str, os.PathLike[str]]] = None,\n        **kwargs: Any,\n    ) -> \"Llama\":\n        \"\"\"Create a Llama model from a pretrained model name or path.\n        This method requires the huggingface-hub package.\n        You can install it with `pip install huggingface-hub`.\n\n        Args:\n            repo_id: The model repo id.\n            filename: A filename or glob pattern to match the model file in the repo.\n            additional_files: A list of filenames or glob patterns to match additional model files in the repo.\n            local_dir: The local directory to save the model to.\n            local_dir_use_symlinks: Whether to use symlinks when downloading the model.\n            **kwargs: Additional keyword arguments to pass to the Llama constructor.\n\n        Returns:\n            A Llama model.\"\"\"\n        try:\n            from huggingface_hub import hf_hub_download, HfFileSystem\n            from huggingface_hub.utils import validate_repo_id\n        except ImportError:\n            raise ImportError(\n                \"Llama.from_pretrained requires the huggingface-hub package. \"\n                \"You can install it with `pip install huggingface-hub`.\"\n            )\n\n        validate_repo_id(repo_id)\n\n        hffs = HfFileSystem()\n\n        files = [\n            file[\"name\"] if isinstance(file, dict) else file\n            for file in hffs.ls(repo_id, recursive=True)\n        ]\n\n        # split each file into repo_id, subfolder, filename\n        file_list: List[str] = []\n        for file in files:\n            rel_path = Path(file).relative_to(repo_id)\n            file_list.append(str(rel_path))\n\n        # find the only/first shard file:\n        matching_files = [file for file in file_list if fnmatch.fnmatch(file, filename)]  # type: ignore\n\n        if len(matching_files) == 0:\n            raise ValueError(\n                f\"No file found in {repo_id} that match {filename}\\n\\n\"\n                f\"Available Files:\\n{json.dumps(file_list)}\"\n            )\n\n        if len(matching_files) > 1:\n            raise ValueError(\n                f\"Multiple files found in {repo_id} matching {filename}\\n\\n\"\n                f\"Available Files:\\n{json.dumps(files)}\"\n            )\n\n        (matching_file,) = matching_files\n\n        subfolder = str(Path(matching_file).parent)\n        filename = Path(matching_file).name\n\n        # download the file\n        hf_hub_download(\n            repo_id=repo_id,\n            filename=filename,\n            subfolder=subfolder,\n            local_dir=local_dir,\n            local_dir_use_symlinks=local_dir_use_symlinks,\n            cache_dir=cache_dir,\n        )\n\n        if additional_files:\n            for additonal_file_name in additional_files:\n                # find the additional shard file:\n                matching_additional_files = [file for file in file_list if fnmatch.fnmatch(file, additonal_file_name)]\n\n                if len(matching_additional_files) == 0:\n                    raise ValueError(\n                        f\"No file found in {repo_id} that match {additonal_file_name}\\n\\n\"\n                        f\"Available Files:\\n{json.dumps(file_list)}\"\n                    )\n\n                if len(matching_additional_files) > 1:\n                    raise ValueError(\n                        f\"Multiple files found in {repo_id} matching {additonal_file_name}\\n\\n\"\n                        f\"Available Files:\\n{json.dumps(files)}\"\n                    )\n\n                (matching_additional_file,) = matching_additional_files\n\n                # download the additional file\n                hf_hub_download(\n                    repo_id=repo_id,\n                    filename=matching_additional_file,\n                    subfolder=subfolder,\n                    local_dir=local_dir,\n                    local_dir_use_symlinks=local_dir_use_symlinks,\n                    cache_dir=cache_dir,\n                )\n\n        if local_dir is None:\n            model_path = hf_hub_download(\n                repo_id=repo_id,\n                filename=filename,\n                subfolder=subfolder,\n                local_dir=local_dir,\n                local_dir_use_symlinks=local_dir_use_symlinks,\n                cache_dir=cache_dir,\n                local_files_only=True,\n            )\n        else:\n            model_path = os.path.join(local_dir, filename)\n\n        # loading the first file of a sharded GGUF loads all remaining shard files in the subfolder\n        return cls(\n            model_path=model_path,\n            **kwargs,\n        )\n\n\nclass LlamaState:\n    def __init__(\n        self,\n        input_ids: npt.NDArray[np.intc],\n        scores: npt.NDArray[np.single],\n        n_tokens: int,\n        llama_state: bytes,\n        llama_state_size: int,\n        seed: int,\n    ):\n        self.input_ids = input_ids\n        self.scores = scores\n        self.n_tokens = n_tokens\n        self.llama_state = llama_state\n        self.llama_state_size = llama_state_size\n        self.seed = seed\n\n\nLogitsProcessor = Callable[\n    [npt.NDArray[np.intc], npt.NDArray[np.single]], npt.NDArray[np.single]\n]\n\n\nclass LogitsProcessorList(List[LogitsProcessor]):\n    def __call__(\n        self, input_ids: npt.NDArray[np.intc], scores: npt.NDArray[np.single]\n    ) -> npt.NDArray[np.single]:\n        for processor in self:\n            scores = processor(input_ids, scores)\n        return scores\n\n\nStoppingCriteria = Callable[[npt.NDArray[np.intc], npt.NDArray[np.single]], bool]\n\n\nclass StoppingCriteriaList(List[StoppingCriteria]):\n    def __call__(\n        self, input_ids: npt.NDArray[np.intc], logits: npt.NDArray[np.single]\n    ) -> bool:\n        return any([stopping_criteria(input_ids, logits) for stopping_criteria in self])\n\n\nclass MinTokensLogitsProcessor(LogitsProcessor):\n    def __init__(self, min_tokens: int, token_eos: int):\n        self.min_tokens = min_tokens\n        self.token_eos = token_eos\n        self.prompt_tokens = None\n\n    def __call__(\n        self, input_ids: npt.NDArray[np.intc], scores: npt.NDArray[np.single]\n    ) -> npt.NDArray[np.single]:\n        if self.prompt_tokens is None:\n            self.prompt_tokens = len(input_ids)\n        if len(input_ids) - self.prompt_tokens < self.min_tokens:\n            scores[self.token_eos] = -np.inf\n        return scores\n"
  },
  {
    "path": "llama_cpp/llama_cache.py",
    "content": "import sys\nfrom abc import ABC, abstractmethod\nfrom typing import (\n    Optional,\n    Sequence,\n    Tuple,\n)\nfrom collections import OrderedDict\n\nimport diskcache\n\nimport llama_cpp.llama\n\nfrom .llama_types import *\n\n\nclass BaseLlamaCache(ABC):\n    \"\"\"Base cache class for a llama.cpp model.\"\"\"\n\n    def __init__(self, capacity_bytes: int = (2 << 30)):\n        self.capacity_bytes = capacity_bytes\n\n    @property\n    @abstractmethod\n    def cache_size(self) -> int:\n        raise NotImplementedError\n\n    def _find_longest_prefix_key(\n        self,\n        key: Tuple[int, ...],\n    ) -> Optional[Tuple[int, ...]]:\n        pass\n\n    @abstractmethod\n    def __getitem__(self, key: Sequence[int]) -> \"llama_cpp.llama.LlamaState\":\n        raise NotImplementedError\n\n    @abstractmethod\n    def __contains__(self, key: Sequence[int]) -> bool:\n        raise NotImplementedError\n\n    @abstractmethod\n    def __setitem__(\n        self, key: Sequence[int], value: \"llama_cpp.llama.LlamaState\"\n    ) -> None:\n        raise NotImplementedError\n\n\nclass LlamaRAMCache(BaseLlamaCache):\n    \"\"\"Cache for a llama.cpp model using RAM.\"\"\"\n\n    def __init__(self, capacity_bytes: int = (2 << 30)):\n        super().__init__(capacity_bytes)\n        self.capacity_bytes = capacity_bytes\n        self.cache_state: OrderedDict[\n            Tuple[int, ...], \"llama_cpp.llama.LlamaState\"\n        ] = OrderedDict()\n\n    @property\n    def cache_size(self):\n        return sum([state.llama_state_size for state in self.cache_state.values()])\n\n    def _find_longest_prefix_key(\n        self,\n        key: Tuple[int, ...],\n    ) -> Optional[Tuple[int, ...]]:\n        min_len = 0\n        min_key = None\n        keys = (\n            (k, llama_cpp.llama.Llama.longest_token_prefix(k, key))\n            for k in self.cache_state.keys()\n        )\n        for k, prefix_len in keys:\n            if prefix_len > min_len:\n                min_len = prefix_len\n                min_key = k\n        return min_key\n\n    def __getitem__(self, key: Sequence[int]) -> \"llama_cpp.llama.LlamaState\":\n        key = tuple(key)\n        _key = self._find_longest_prefix_key(key)\n        if _key is None:\n            raise KeyError(\"Key not found\")\n        value = self.cache_state[_key]\n        self.cache_state.move_to_end(_key)\n        return value\n\n    def __contains__(self, key: Sequence[int]) -> bool:\n        return self._find_longest_prefix_key(tuple(key)) is not None\n\n    def __setitem__(self, key: Sequence[int], value: \"llama_cpp.llama.LlamaState\"):\n        key = tuple(key)\n        if key in self.cache_state:\n            del self.cache_state[key]\n        self.cache_state[key] = value\n        while self.cache_size > self.capacity_bytes and len(self.cache_state) > 0:\n            self.cache_state.popitem(last=False)\n\n\n# Alias for backwards compatibility\nLlamaCache = LlamaRAMCache\n\n\nclass LlamaDiskCache(BaseLlamaCache):\n    \"\"\"Cache for a llama.cpp model using disk.\"\"\"\n\n    def __init__(\n        self, cache_dir: str = \".cache/llama_cache\", capacity_bytes: int = (2 << 30)\n    ):\n        super().__init__(capacity_bytes)\n        self.cache = diskcache.Cache(cache_dir)\n\n    @property\n    def cache_size(self):\n        return int(self.cache.volume())  # type: ignore\n\n    def _find_longest_prefix_key(\n        self,\n        key: Tuple[int, ...],\n    ) -> Optional[Tuple[int, ...]]:\n        min_len = 0\n        min_key: Optional[Tuple[int, ...]] = None\n        for k in self.cache.iterkeys():  # type: ignore\n            prefix_len = llama_cpp.llama.Llama.longest_token_prefix(k, key)\n            if prefix_len > min_len:\n                min_len = prefix_len\n                min_key = k  # type: ignore\n        return min_key\n\n    def __getitem__(self, key: Sequence[int]) -> \"llama_cpp.llama.LlamaState\":\n        key = tuple(key)\n        _key = self._find_longest_prefix_key(key)\n        if _key is None:\n            raise KeyError(\"Key not found\")\n        value: \"llama_cpp.llama.LlamaState\" = self.cache.pop(_key)  # type: ignore\n        # NOTE: This puts an integer as key in cache, which breaks,\n        # Llama.longest_token_prefix(k, key) above since k is not a tuple of ints/tokens\n        # self.cache.push(_key, side=\"front\")  # type: ignore\n        return value\n\n    def __contains__(self, key: Sequence[int]) -> bool:\n        return self._find_longest_prefix_key(tuple(key)) is not None\n\n    def __setitem__(self, key: Sequence[int], value: \"llama_cpp.llama.LlamaState\"):\n        print(\"LlamaDiskCache.__setitem__: called\", file=sys.stderr)\n        key = tuple(key)\n        if key in self.cache:\n            print(\"LlamaDiskCache.__setitem__: delete\", file=sys.stderr)\n            del self.cache[key]\n        self.cache[key] = value\n        print(\"LlamaDiskCache.__setitem__: set\", file=sys.stderr)\n        while self.cache_size > self.capacity_bytes and len(self.cache) > 0:\n            key_to_remove = next(iter(self.cache))\n            del self.cache[key_to_remove]\n        print(\"LlamaDiskCache.__setitem__: trim\", file=sys.stderr)\n"
  },
  {
    "path": "llama_cpp/llama_chat_format.py",
    "content": "from __future__ import annotations\n\nimport os\nimport sys\nimport json\nimport ctypes\nimport dataclasses\nimport random\nimport string\n\nfrom datetime import datetime\nfrom contextlib import ExitStack\nfrom typing import (\n    Any,\n    Dict,\n    Iterator,\n    List,\n    Literal,\n    Optional,\n    Tuple,\n    Union,\n    Protocol,\n    cast,\n)\n\nimport jinja2\nfrom jinja2.sandbox import ImmutableSandboxedEnvironment\n\nimport numpy as np\nimport numpy.typing as npt\n\nimport llama_cpp.llama_cpp as llama_cpp\nimport llama_cpp.llama as llama\nimport llama_cpp.llama_types as llama_types\nimport llama_cpp.llama_grammar as llama_grammar\n\nfrom ._logger import logger\nfrom ._utils import suppress_stdout_stderr, Singleton\n\n### Common Chat Templates and Special Tokens ###\n\n# Source: https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B/blob/main/tokenizer_config.json\nCHATML_CHAT_TEMPLATE = \"{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>' + '\\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}\"\nCHATML_BOS_TOKEN = \"<s>\"\nCHATML_EOS_TOKEN = \"<|im_end|>\"\n\n# Source: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/blob/main/tokenizer_config.json\nMISTRAL_INSTRUCT_CHAT_TEMPLATE = \"{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token + ' ' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}\"\nMISTRAL_INSTRUCT_BOS_TOKEN = \"<s>\"\nMISTRAL_INSTRUCT_EOS_TOKEN = \"</s>\"\n\n# Source: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/blob/main/tokenizer_config.json\nMIXTRAL_INSTRUCT_CHAT_TEMPLATE = \"{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}\"\n\n# Source: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json\nLLAMA3_INSTRUCT_CHAT_TEMPLATE = \"{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}{% endif %}\"\n\n### Chat Completion Handler ###\n\n\nclass LlamaChatCompletionHandler(Protocol):\n    \"\"\"Base Protocol for a llama chat completion handler.\n\n    Very generic protocol that can be used to implement any chat format.\n    The only hard requirement is that it must return a ChatCompletion when\n    stream=False and an iterator of ChatCompletionChunks when stream=True.\"\"\"\n\n    def __call__(\n        self,\n        *,\n        # llama.cpp instance\n        llama: llama.Llama,\n        # openai api parameters\n        messages: List[llama_types.ChatCompletionRequestMessage],\n        functions: Optional[List[llama_types.ChatCompletionFunction]] = None,\n        function_call: Optional[llama_types.ChatCompletionRequestFunctionCall] = None,\n        tools: Optional[List[llama_types.ChatCompletionTool]] = None,\n        tool_choice: Optional[llama_types.ChatCompletionToolChoiceOption] = None,\n        temperature: float = 0.2,\n        top_p: float = 0.95,\n        top_k: int = 40,\n        stream: bool = False,\n        stop: Optional[Union[str, List[str]]] = [],\n        seed: Optional[int] = None,\n        response_format: Optional[\n            llama_types.ChatCompletionRequestResponseFormat\n        ] = None,\n        max_tokens: Optional[int] = None,\n        presence_penalty: float = 0.0,\n        frequency_penalty: float = 0.0,\n        repeat_penalty: float = 1.1,\n        model: Optional[str] = None,\n        logit_bias: Optional[Dict[str, float]] = None,\n        # llama.cpp parameters\n        min_p: float = 0.05,\n        typical_p: float = 1.0,\n        tfs_z: float = 1.0,\n        mirostat_mode: int = 0,\n        mirostat_tau: float = 5.0,\n        mirostat_eta: float = 0.1,\n        logits_processor: Optional[llama.LogitsProcessorList] = None,\n        grammar: Optional[llama.LlamaGrammar] = None,\n        logprobs: Optional[bool] = None,\n        top_logprobs: Optional[int] = None,\n        **kwargs,  # type: ignore\n    ) -> Union[\n        llama_types.CreateChatCompletionResponse,\n        Iterator[llama_types.CreateChatCompletionStreamResponse],\n    ]: ...\n\n\nclass LlamaChatCompletionHandlerNotFoundException(Exception):\n    pass\n\n\nclass LlamaChatCompletionHandlerRegistry(Singleton):\n    _chat_handlers: Dict[str, LlamaChatCompletionHandler] = {}\n\n    def register_chat_completion_handler(\n        self,\n        name: str,\n        chat_handler: LlamaChatCompletionHandler,\n        overwrite: bool = False,\n    ):\n        if not overwrite and name in self._chat_handlers:\n            raise ValueError(\n                f\"Formatter with name '{name}' is already registered. Use `overwrite=True` to overwrite it.\"\n            )\n        self._chat_handlers[name] = chat_handler\n\n    def unregister_chat_handler(self, name: str):\n        if name in self._chat_handlers:\n            del self._chat_handlers[name]\n        else:\n            raise ValueError(f\"No formatter registered under the name '{name}'.\")\n\n    def get_chat_completion_handler_by_name(\n        self, name: str\n    ) -> LlamaChatCompletionHandler:\n        try:\n            chat_handler = self._chat_handlers[name]\n            return chat_handler\n        except KeyError:\n            raise LlamaChatCompletionHandlerNotFoundException(\n                f\"Invalid chat handler: {name} (valid formats: {list(self._chat_handlers.keys())})\"\n            )\n\n\ndef get_chat_completion_handler(name: str) -> LlamaChatCompletionHandler:\n    return LlamaChatCompletionHandlerRegistry().get_chat_completion_handler_by_name(\n        name\n    )\n\n\ndef register_chat_completion_handler(name: str):\n    def decorator(f: LlamaChatCompletionHandler):\n        LlamaChatCompletionHandlerRegistry().register_chat_completion_handler(name, f)\n        return f\n\n    return decorator\n\n\n### Chat Formatter ###\n\n\n@dataclasses.dataclass\nclass ChatFormatterResponse:\n    \"\"\"Dataclass that stores completion parameters for a given chat format and\n    create_chat_completion request.\n\n    prompt contains the formatted prompt generated from the chat format and messages.\n    stop contains the stop token or list of stop tokens to use for the chat format.\"\"\"\n\n    prompt: str\n    stop: Optional[Union[str, List[str]]] = None\n    stopping_criteria: Optional[llama.StoppingCriteriaList] = None\n    added_special: bool = False\n\n\nclass ChatFormatter(Protocol):\n    \"\"\"Base Protocol for a chat formatter. A chat formatter is a function that\n    takes a list of messages and returns a chat format response which can be used\n    to generate a completion. The response can also include a stop token or list\n    of stop tokens to use for the completion.\"\"\"\n\n    def __call__(\n        self,\n        *,\n        messages: List[llama_types.ChatCompletionRequestMessage],\n        **kwargs: Any,\n    ) -> ChatFormatterResponse: ...\n\n\nclass Jinja2ChatFormatter(ChatFormatter):\n    def __init__(\n        self,\n        template: str,\n        eos_token: str,\n        bos_token: str,\n        add_generation_prompt: bool = True,\n        stop_token_ids: Optional[List[int]] = None,\n    ):\n        \"\"\"A chat formatter that uses jinja2 templates to format the prompt.\"\"\"\n        self.template = template\n        self.eos_token = eos_token\n        self.bos_token = bos_token\n        self.add_generation_prompt = add_generation_prompt\n        self.stop_token_ids = (\n            set(stop_token_ids) if stop_token_ids is not None else None\n        )\n\n        self._environment = ImmutableSandboxedEnvironment(\n            loader=jinja2.BaseLoader(),\n            trim_blocks=True,\n            lstrip_blocks=True,\n        ).from_string(self.template)\n\n    @staticmethod\n    def strftime_now(f: str) -> str:\n        return datetime.now().strftime(f)\n\n    def __call__(\n        self,\n        *,\n        messages: List[llama_types.ChatCompletionRequestMessage],\n        functions: Optional[List[llama_types.ChatCompletionFunction]] = None,\n        function_call: Optional[llama_types.ChatCompletionRequestFunctionCall] = None,\n        tools: Optional[List[llama_types.ChatCompletionTool]] = None,\n        tool_choice: Optional[llama_types.ChatCompletionToolChoiceOption] = None,\n        **kwargs: Any,\n    ) -> ChatFormatterResponse:\n        def raise_exception(message: str):\n            raise ValueError(message)\n\n        prompt = self._environment.render(\n            messages=messages,\n            eos_token=self.eos_token,\n            bos_token=self.bos_token,\n            raise_exception=raise_exception,\n            add_generation_prompt=self.add_generation_prompt,\n            functions=functions,\n            function_call=function_call,\n            tools=tools,\n            tool_choice=tool_choice,\n            strftime_now=self.strftime_now,\n        )\n\n        stopping_criteria = None\n        if self.stop_token_ids is not None:\n\n            def stop_on_last_token(\n                tokens: npt.NDArray[np.intc], logits: npt.NDArray[np.single]\n            ) -> bool:\n                return tokens[-1] in self.stop_token_ids\n\n            stopping_criteria = llama.StoppingCriteriaList([stop_on_last_token])\n\n        return ChatFormatterResponse(\n            prompt=prompt,\n            stop=[self.eos_token],\n            stopping_criteria=stopping_criteria,\n            added_special=True,\n        )\n\n    def to_chat_handler(self) -> LlamaChatCompletionHandler:\n        return chat_formatter_to_chat_completion_handler(self)\n\n\ndef _convert_text_completion_logprobs_to_chat(\n    logprobs: Optional[llama_types.CompletionLogprobs],\n) -> llama_types.ChatCompletionLogprobs:\n    if logprobs is None:\n        return None\n\n    return {\n        \"content\": [\n            {\n                \"token\": token,\n                \"bytes\": None,\n                \"logprob\": logprob,\n                \"top_logprobs\": [\n                    {\n                        \"token\": top_token,\n                        \"logprob\": top_logprob,\n                        \"bytes\": None,\n                    }\n                    for top_token, top_logprob in top_logprobs.items()\n                ],\n            } for (token, logprob, top_logprobs) in zip(logprobs[\"tokens\"], logprobs[\"token_logprobs\"], logprobs[\"top_logprobs\"])\n        ],\n        \"refusal\": None,\n    }\n\ndef _convert_text_completion_to_chat(\n    completion: llama_types.Completion,\n) -> llama_types.ChatCompletion:\n    assert \"usage\" in completion\n    return {\n        \"id\": \"chat\" + completion[\"id\"],\n        \"object\": \"chat.completion\",\n        \"created\": completion[\"created\"],\n        \"model\": completion[\"model\"],\n        \"choices\": [\n            {\n                \"index\": 0,\n                \"message\": {\n                    \"role\": \"assistant\",\n                    \"content\": completion[\"choices\"][0][\"text\"],\n                },\n                \"logprobs\": _convert_text_completion_logprobs_to_chat(completion[\"choices\"][0][\"logprobs\"]),\n                \"finish_reason\": completion[\"choices\"][0][\"finish_reason\"],\n            }\n        ],\n        \"usage\": completion[\"usage\"],\n    }\n\n\ndef _convert_text_completion_chunks_to_chat(\n    chunks: Iterator[llama_types.CreateCompletionStreamResponse],\n) -> Iterator[llama_types.ChatCompletionChunk]:\n    for i, chunk in enumerate(chunks):\n        if i == 0:\n            yield {\n                \"id\": \"chat\" + chunk[\"id\"],\n                \"model\": chunk[\"model\"],\n                \"created\": chunk[\"created\"],\n                \"object\": \"chat.completion.chunk\",\n                \"choices\": [\n                    {\n                        \"index\": 0,\n                        \"delta\": {\n                            \"role\": \"assistant\",\n                        },\n                        \"logprobs\": None,\n                        \"finish_reason\": None,\n                    }\n                ],\n            }\n        yield {\n            \"id\": \"chat\" + chunk[\"id\"],\n            \"model\": chunk[\"model\"],\n            \"created\": chunk[\"created\"],\n            \"object\": \"chat.completion.chunk\",\n            \"choices\": [\n                {\n                    \"index\": 0,\n                    \"delta\": (\n                        {\n                            \"content\": chunk[\"choices\"][0][\"text\"],\n                        }\n                        if chunk[\"choices\"][0][\"finish_reason\"] is None\n                        else {}\n                    ),\n                    \"logprobs\": _convert_text_completion_logprobs_to_chat(chunk[\"choices\"][0][\"logprobs\"]),\n                    \"finish_reason\": chunk[\"choices\"][0][\"finish_reason\"],\n                }\n            ],\n        }\n\n\ndef _convert_completion_to_chat(\n    completion_or_chunks: Union[\n        llama_types.CreateCompletionResponse,\n        Iterator[llama_types.CreateCompletionStreamResponse],\n    ],\n    stream: bool = False,\n) -> Union[\n    llama_types.CreateChatCompletionResponse, Iterator[llama_types.ChatCompletionChunk]\n]:\n    if stream:\n        chunks: Iterator[llama_types.CreateCompletionStreamResponse] = completion_or_chunks  # type: ignore\n        return _convert_text_completion_chunks_to_chat(chunks)\n    else:\n        completion: llama_types.Completion = completion_or_chunks  # type: ignore\n        return _convert_text_completion_to_chat(completion)\n\n\ndef _convert_completion_to_chat_function(\n    tool_name: str,\n    completion_or_chunks: Union[\n        llama_types.CreateCompletionResponse,\n        Iterator[llama_types.CreateCompletionStreamResponse],\n    ],\n    stream: bool,\n):\n    if not stream:\n        completion: llama_types.CreateCompletionResponse = completion_or_chunks  # type: ignore\n        assert \"usage\" in completion\n        tool_id = \"call_\" + \"_0_\" + tool_name + \"_\" + completion[\"id\"]\n        # TODO: Fix for legacy function calls\n        chat_completion: llama_types.CreateChatCompletionResponse = {\n            \"id\": \"chat\" + completion[\"id\"],\n            \"object\": \"chat.completion\",\n            \"created\": completion[\"created\"],\n            \"model\": completion[\"model\"],\n            \"choices\": [\n                {\n                    \"index\": 0,\n                    \"message\": {\n                        \"role\": \"assistant\",\n                        \"content\": None,\n                        \"function_call\": {\n                            \"name\": tool_name,\n                            \"arguments\": completion[\"choices\"][0][\"text\"],\n                        },\n                        \"tool_calls\": [\n                            {\n                                \"id\": tool_id,\n                                \"type\": \"function\",\n                                \"function\": {\n                                    \"name\": tool_name,\n                                    \"arguments\": completion[\"choices\"][0][\"text\"],\n                                },\n                            }\n                        ],\n                    },\n                    \"logprobs\": _convert_text_completion_logprobs_to_chat(completion[\"choices\"][0][\"logprobs\"]),\n                    \"finish_reason\": \"tool_calls\",\n                }\n            ],\n            \"usage\": completion[\"usage\"],\n        }\n        return chat_completion\n    else:\n        chunks: Iterator[llama_types.CreateCompletionStreamResponse] = completion_or_chunks  # type: ignore\n\n        def _stream_response_to_function_stream(\n            chunks: Iterator[llama_types.CreateCompletionStreamResponse],\n        ) -> Iterator[llama_types.CreateChatCompletionStreamResponse]:\n            # blank first message\n            first = True\n            id_ = None\n            created = None\n            model = None\n            tool_id = None\n            for chunk in chunks:\n                if first:\n                    id_ = \"chat\" + chunk[\"id\"]\n                    created = chunk[\"created\"]\n                    model = chunk[\"model\"]\n                    tool_id = \"call_\" + \"_0_\" + tool_name + \"_\" + chunk[\"id\"]\n                    yield {\n                        \"id\": id_,\n                        \"object\": \"chat.completion.chunk\",\n                        \"created\": created,\n                        \"model\": model,\n                        \"choices\": [\n                            {\n                                \"index\": 0,\n                                \"finish_reason\": None,\n                                \"logprobs\": None,\n                                \"delta\": {\n                                    \"role\": \"assistant\",\n                                    \"content\": None,\n                                    \"function_call\": None,\n                                    \"tool_calls\": None,\n                                },\n                            }\n                        ],\n                    }\n                    yield {\n                        \"id\": \"chat\" + chunk[\"id\"],\n                        \"object\": \"chat.completion.chunk\",\n                        \"created\": chunk[\"created\"],\n                        \"model\": chunk[\"model\"],\n                        \"choices\": [\n                            {\n                                \"index\": 0,\n                                \"finish_reason\": None,\n                                \"logprobs\": _convert_text_completion_logprobs_to_chat(chunk[\"choices\"][0][\"logprobs\"]),\n                                \"delta\": {\n                                    \"role\": None,\n                                    \"content\": None,\n                                    \"function_call\": {\n                                        \"name\": tool_name,\n                                        \"arguments\": chunk[\"choices\"][0][\"text\"],\n                                    },\n                                    \"tool_calls\": [\n                                        {\n                                            \"index\": 0,\n                                            \"id\": tool_id,\n                                            \"type\": \"function\",\n                                            \"function\": {\n                                                \"name\": tool_name,\n                                                \"arguments\": chunk[\"choices\"][0][\n                                                    \"text\"\n                                                ],\n                                            },\n                                        }\n                                    ],\n                                },\n                            }\n                        ],\n                    }\n                    first = False\n                    continue\n                assert tool_id is not None\n                yield {\n                    \"id\": \"chat\" + chunk[\"id\"],\n                    \"object\": \"chat.completion.chunk\",\n                    \"created\": chunk[\"created\"],\n                    \"model\": chunk[\"model\"],\n                    \"choices\": [\n                        {\n                            \"index\": 0,\n                            \"finish_reason\": None,\n                            \"logprobs\": _convert_text_completion_logprobs_to_chat(chunk[\"choices\"][0][\"logprobs\"]),\n                            \"delta\": {\n                                \"role\": None,\n                                \"content\": None,\n                                \"function_call\": {\n                                    \"name\": tool_name,\n                                    \"arguments\": chunk[\"choices\"][0][\"text\"],\n                                },\n                                \"tool_calls\": [\n                                    {\n                                        \"index\": 0,\n                                        \"id\": tool_id,\n                                        \"type\": \"function\",\n                                        \"function\": {\n                                            \"name\": tool_name,\n                                            \"arguments\": chunk[\"choices\"][0][\"text\"],\n                                        },\n                                    }\n                                ],\n                            },\n                        }\n                    ],\n                }\n\n            if id_ is not None and created is not None and model is not None:\n                yield {\n                    \"id\": id_,\n                    \"object\": \"chat.completion.chunk\",\n                    \"created\": created,\n                    \"model\": model,\n                    \"choices\": [\n                        {\n                            \"index\": 0,\n                            \"finish_reason\": \"tool_calls\",\n                            \"logprobs\": None,\n                            \"delta\": {\n                                \"role\": None,\n                                \"content\": None,\n                                \"function_call\": None,\n                                \"tool_calls\": None,\n                            },\n                        }\n                    ],\n                }\n\n        return _stream_response_to_function_stream(chunks)\n\n\ndef chat_formatter_to_chat_completion_handler(\n    chat_formatter: ChatFormatter,\n) -> LlamaChatCompletionHandler:\n    def chat_completion_handler(\n        *,\n        llama: llama.Llama,\n        messages: List[llama_types.ChatCompletionRequestMessage],\n        functions: Optional[List[llama_types.ChatCompletionFunction]] = None,\n        function_call: Optional[llama_types.ChatCompletionRequestFunctionCall] = None,\n        tools: Optional[List[llama_types.ChatCompletionTool]] = None,\n        tool_choice: Optional[llama_types.ChatCompletionToolChoiceOption] = None,\n        temperature: float = 0.2,\n        top_p: float = 0.95,\n        top_k: int = 40,\n        min_p: float = 0.05,\n        typical_p: float = 1.0,\n        stream: bool = False,\n        stop: Optional[Union[str, List[str]]] = [],\n        seed: Optional[int] = None,\n        response_format: Optional[\n            llama_types.ChatCompletionRequestResponseFormat\n        ] = None,\n        max_tokens: Optional[int] = None,\n        presence_penalty: float = 0.0,\n        frequency_penalty: float = 0.0,\n        repeat_penalty: float = 1.1,\n        tfs_z: float = 1.0,\n        mirostat_mode: int = 0,\n        mirostat_tau: float = 5.0,\n        mirostat_eta: float = 0.1,\n        model: Optional[str] = None,\n        logits_processor: Optional[llama.LogitsProcessorList] = None,\n        grammar: Optional[llama.LlamaGrammar] = None,\n        logit_bias: Optional[Dict[str, float]] = None,\n        logprobs: Optional[bool] = None,\n        top_logprobs: Optional[int] = None,\n        **kwargs,  # type: ignore\n    ) -> Union[\n        llama_types.CreateChatCompletionResponse,\n        Iterator[llama_types.CreateChatCompletionStreamResponse],\n    ]:\n        result = chat_formatter(\n            messages=messages,\n            functions=functions,\n            function_call=function_call,\n            tools=tools,\n            tool_choice=tool_choice,\n        )\n        prompt = llama.tokenize(\n            result.prompt.encode(\"utf-8\"),\n            add_bos=not result.added_special,\n            special=True,\n        )\n        if result.stop is not None:\n            stop = [] if stop is None else [stop] if isinstance(stop, str) else stop\n            rstop = result.stop if isinstance(result.stop, list) else [result.stop]\n            stop = stop + rstop\n\n        stopping_criteria = None\n        if result.stopping_criteria is not None:\n            stopping_criteria = result.stopping_criteria\n\n        if response_format is not None and response_format[\"type\"] == \"json_object\":\n            grammar = _grammar_for_response_format(\n                response_format, verbose=llama.verbose\n            )\n\n        # Convert legacy functions to tools\n        if functions is not None:\n            tools = [\n                {\n                    \"type\": \"function\",\n                    \"function\": function,\n                }\n                for function in functions\n            ]\n\n        # Convert legacy function_call to tool_choice\n        if function_call is not None:\n            if isinstance(function_call, str) and (\n                function_call == \"none\" or function_call == \"auto\"\n            ):\n                tool_choice = function_call\n            if isinstance(function_call, dict) and \"name\" in function_call:\n                tool_choice = {\n                    \"type\": \"function\",\n                    \"function\": {\n                        \"name\": function_call[\"name\"],\n                    },\n                }\n\n        tool = None\n        if (\n            tool_choice is not None\n            and isinstance(tool_choice, dict)\n            and tools is not None\n        ):\n            name = tool_choice[\"function\"][\"name\"]\n            tool = next((t for t in tools if t[\"function\"][\"name\"] == name), None)\n            if tool is None:\n                raise ValueError(f\"Tool choice '{name}' not found in tools.\")\n            schema = tool[\"function\"][\"parameters\"]\n            try:\n                # create grammar from json schema\n                grammar = llama_grammar.LlamaGrammar.from_json_schema(\n                    json.dumps(schema), verbose=llama.verbose\n                )\n            except Exception as e:\n                if llama.verbose:\n                    print(str(e), file=sys.stderr)\n                grammar = llama_grammar.LlamaGrammar.from_string(\n                    llama_grammar.JSON_GBNF, verbose=llama.verbose\n                )\n\n        completion_or_chunks = llama.create_completion(\n            prompt=prompt,\n            temperature=temperature,\n            top_p=top_p,\n            top_k=top_k,\n            min_p=min_p,\n            typical_p=typical_p,\n            logprobs=top_logprobs if logprobs else None,\n            stream=stream,\n            stop=stop,\n            seed=seed,\n            max_tokens=max_tokens,\n            presence_penalty=presence_penalty,\n            frequency_penalty=frequency_penalty,\n            repeat_penalty=repeat_penalty,\n            tfs_z=tfs_z,\n            mirostat_mode=mirostat_mode,\n            mirostat_tau=mirostat_tau,\n            mirostat_eta=mirostat_eta,\n            model=model,\n            logits_processor=logits_processor,\n            stopping_criteria=stopping_criteria,\n            grammar=grammar,\n            logit_bias=logit_bias,\n        )\n        if tool is not None:\n            tool_name = tool[\"function\"][\"name\"]\n            return _convert_completion_to_chat_function(\n                tool_name, completion_or_chunks, stream\n            )\n        return _convert_completion_to_chat(completion_or_chunks, stream=stream)\n\n    return chat_completion_handler\n\n\ndef hf_autotokenizer_to_chat_formatter(\n    pretrained_model_name_or_path: Union[str, os.PathLike[str]]\n) -> ChatFormatter:\n    # https://huggingface.co/docs/transformers/main/chat_templating\n    # https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1#instruction-format\n    # https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/blob/main/tokenizer_config.json\n    from transformers import AutoTokenizer  # type: ignore\n\n    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path)  # type: ignore\n\n    def format_autotokenizer(\n        messages: List[llama_types.ChatCompletionRequestMessage],\n        **kwargs: Any,\n    ) -> ChatFormatterResponse:\n        tokenizer.use_default_system_prompt = False  # type: ignore\n        prompt: str = tokenizer.apply_chat_template(messages, tokenize=False)  # type: ignore\n        assert isinstance(prompt, str)\n        # Return formatted prompt and eos token by default\n        return ChatFormatterResponse(\n            prompt=prompt, stop=tokenizer.eos_token, added_special=True\n        )\n\n    return format_autotokenizer\n\n\ndef hf_autotokenizer_to_chat_completion_handler(\n    pretrained_model_name_or_path: Union[str, os.PathLike[str]]\n) -> LlamaChatCompletionHandler:\n    chat_formatter = hf_autotokenizer_to_chat_formatter(pretrained_model_name_or_path)\n    return chat_formatter_to_chat_completion_handler(chat_formatter)\n\n\ndef hf_tokenizer_config_to_chat_formatter(\n    tokenizer_config: Dict[str, Any],\n    add_generation_prompt: bool = True,\n) -> ChatFormatter:\n    assert isinstance(tokenizer_config, dict)\n\n    assert \"chat_template\" in tokenizer_config\n    assert isinstance(tokenizer_config[\"chat_template\"], str)\n    chat_template = tokenizer_config[\"chat_template\"]\n\n    assert \"bos_token\" in tokenizer_config\n    assert isinstance(tokenizer_config[\"bos_token\"], str)\n    bos_token = tokenizer_config[\"bos_token\"]\n\n    assert \"eos_token\" in tokenizer_config\n    assert isinstance(tokenizer_config[\"eos_token\"], str)\n    eos_token = tokenizer_config[\"eos_token\"]\n\n    env = ImmutableSandboxedEnvironment(\n        trim_blocks=True,\n        lstrip_blocks=True,\n    ).from_string(chat_template)\n\n    def format_tokenizer_config(\n        messages: List[llama_types.ChatCompletionRequestMessage],\n        **kwargs: Any,\n    ) -> ChatFormatterResponse:\n        # TODO: veryify this is correct\n        # Add a blank assistant message to the end of the messages to prompt the model to generate a response\n        if add_generation_prompt:\n            messages = [\n                *messages,\n                llama_types.ChatCompletionRequestAssistantMessage(\n                    role=\"assistant\", content=\"\"\n                ),\n            ]\n        prompt = env.render(\n            messages=messages,\n            bos_token=bos_token,\n            eos_token=eos_token,\n        )\n        return ChatFormatterResponse(\n            prompt=prompt, stop=[eos_token, bos_token], added_special=True\n        )\n\n    return format_tokenizer_config\n\n\ndef hf_tokenizer_config_to_chat_completion_handler(\n    tokenizer_config: Dict[str, Any],\n    add_generation_prompt: bool = True,\n) -> LlamaChatCompletionHandler:\n    chat_formatter = hf_tokenizer_config_to_chat_formatter(\n        tokenizer_config, add_generation_prompt=add_generation_prompt\n    )\n    return chat_formatter_to_chat_completion_handler(chat_formatter)\n\n\ndef guess_chat_format_from_gguf_metadata(metadata: Dict[str, str]) -> Optional[str]:\n    if \"tokenizer.chat_template\" not in metadata:\n        return None\n\n    if metadata[\"tokenizer.chat_template\"] == CHATML_CHAT_TEMPLATE:\n        return \"chatml\"\n\n    if (\n        metadata[\"tokenizer.chat_template\"] == MISTRAL_INSTRUCT_CHAT_TEMPLATE\n        or metadata[\"tokenizer.chat_template\"] == MIXTRAL_INSTRUCT_CHAT_TEMPLATE\n    ):\n        return \"mistral-instruct\"\n\n    if metadata[\"tokenizer.chat_template\"] == LLAMA3_INSTRUCT_CHAT_TEMPLATE:\n        return \"llama-3\"\n\n    return None\n\n\n### Utility functions for formatting chat prompts ###\n# TODO: Replace these with jinja2 templates\n\n\ndef _get_system_message(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n) -> str:\n    \"\"\"Get the first system message.\"\"\"\n    for message in messages:\n        if message[\"role\"] == \"system\":\n            return message[\"content\"] or \"\"\n    return \"\"\n\n\ndef _map_roles(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    role_map: Dict[str, str],\n) -> List[Tuple[str, Optional[str]]]:\n    \"\"\"Map the message roles.\"\"\"\n    output: List[Tuple[str, Optional[str]]] = []\n    for message in messages:\n        role = message[\"role\"]\n        if role in role_map:\n            content: str | None = (\n                message[\"content\"] if isinstance(message[\"content\"], str) else None\n            )\n            output.append((role_map[role], content))\n    return output\n\n\ndef _format_llama2(\n    system_message: str, messages: List[Tuple[str, Optional[str]]], sep: str, sep2: str\n) -> str:\n    \"\"\"Format the prompt with the llama2 style.\"\"\"\n    seps = [sep, sep2]\n    ret = system_message + sep\n    for i, (role, message) in enumerate(messages):\n        if system_message and i == 0:\n            m = message or \"\"\n            ret += m + seps[i % 2]\n        elif message:\n            ret += role + message + \" \" + seps[i % 2]\n        else:\n            ret += role + \" \"\n    return ret\n\n\ndef _format_add_colon_single(\n    system_message: str, messages: List[Tuple[str, Optional[str]]], sep: str\n) -> str:\n    \"\"\"Format the prompt with the add-colon-single style.\"\"\"\n    ret = system_message + sep\n    for role, message in messages:\n        if message:\n            ret += role + \": \" + message + sep\n        else:\n            ret += role + \":\"\n    return ret\n\n\ndef _format_add_colon_two(\n    system_message: str, messages: List[Tuple[str, Optional[str]]], sep: str, sep2: str\n) -> str:\n    \"\"\"Format the prompt with the add-colon-two style.\"\"\"\n    seps = [sep, sep2]\n    ret = system_message + seps[0]\n    for i, (role, message) in enumerate(messages):\n        if message:\n            ret += role + \": \" + message + seps[i % 2]\n        else:\n            ret += role + \":\"\n    return ret\n\n\ndef _format_no_colon_single(\n    system_message: str, messages: List[Tuple[str, Optional[str]]], sep: str\n) -> str:\n    \"\"\"Format the prompt with the no-colon-single style.\"\"\"\n    ret = system_message\n    for role, message in messages:\n        if message:\n            ret += role + message + sep\n        else:\n            ret += role\n    return ret\n\n\ndef _format_add_colon_space_single(\n    system_message: str, messages: List[Tuple[str, Optional[str]]], sep: str\n) -> str:\n    \"\"\"Format the prompt with the add-colon-space-single style.\"\"\"\n    ret = system_message + sep\n    for role, message in messages:\n        if message:\n            ret += role + \": \" + message + sep\n        else:\n            ret += role + \": \"  # must be end with a space\n    return ret\n\n\ndef _format_chatml(\n    system_message: str, messages: List[Tuple[str, Optional[str]]], sep: str\n) -> str:\n    \"\"\"Format the prompt with the chatml style.\"\"\"\n    ret = \"\" if system_message == \"\" else system_message + sep + \"\\n\"\n    for role, message in messages:\n        if message:\n            ret += role + \"\\n\" + message + sep + \"\\n\"\n        else:\n            ret += role + \"\\n\"\n    return ret\n\n\ndef _format_chatglm3(\n    system_message: str, messages: List[Tuple[str, Optional[str]]], sep: str\n) -> str:\n    \"\"\"Format the prompt with the chatglm3 style.\"\"\"\n    ret = \"\"\n    if system_message:\n        ret += system_message\n    for role, message in messages:\n        if message:\n            ret += role + \"\\n\" + \" \" + message\n        else:\n            ret += role\n    return ret\n\n\ndef _grammar_for_json(verbose: bool = False):\n    return llama_grammar.LlamaGrammar.from_string(\n        llama_grammar.JSON_GBNF, verbose=verbose\n    )\n\n\ndef _grammar_for_json_schema(\n    schema: str, verbose: bool = False, fallback_to_json: bool = True\n):\n    try:\n        return llama_grammar.LlamaGrammar.from_json_schema(schema, verbose=verbose)\n    except Exception as e:\n        if fallback_to_json:\n            return _grammar_for_json(verbose=verbose)\n        else:\n            raise e\n\n\ndef _grammar_for_response_format(\n    response_format: llama_types.ChatCompletionRequestResponseFormat,\n    verbose: bool = False,\n):\n    if response_format[\"type\"] != \"json_object\":\n        return None\n\n    if \"schema\" in response_format:\n        return _grammar_for_json_schema(\n            json.dumps(response_format[\"schema\"]), verbose=verbose\n        )\n    else:\n        return _grammar_for_json(verbose=verbose)\n\n\n### Chat Formats ###\n\n\ndef register_chat_format(name: str):\n    def decorator(f: ChatFormatter):\n        chat_completion_handler = chat_formatter_to_chat_completion_handler(f)\n        LlamaChatCompletionHandlerRegistry().register_chat_completion_handler(\n            name, chat_completion_handler\n        )\n        return f\n\n    return decorator\n\n\n# see https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py\n# system prompt is \"embedded\" in the first message\n@register_chat_format(\"llama-2\")\ndef format_llama2(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _system_template = \"[INST] <<SYS>>\\n{system_message}\\n<</SYS>>\"\n    _roles = dict(user=\"<s>[INST]\", assistant=\"[/INST]\")\n    _messages = _map_roles(messages, _roles)\n    system_message = _get_system_message(messages)\n    if system_message:\n        system_message = _system_template.format(system_message=system_message)\n    _prompt = _format_llama2(system_message, _messages, \" \", \"</s>\") + \"[/INST]\"\n    return ChatFormatterResponse(prompt=_prompt)\n\n\n# Chat format for Llama-3 models, see more details at:\n# https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L202-L229\n@register_chat_format(\"llama-3\")\ndef format_llama3(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _roles = dict(\n        system=\"<|start_header_id|>system<|end_header_id|>\\n\\n\",\n        user=\"<|start_header_id|>user<|end_header_id|>\\n\\n\",\n        assistant=\"<|start_header_id|>assistant<|end_header_id|>\\n\\n\",\n    )\n    _sep = \"<|eot_id|>\"\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_no_colon_single(\"\", _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt, stop=_sep)\n\n\n@register_chat_format(\"alpaca\")\ndef format_alpaca(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _roles = dict(user=\"### Instruction\", assistant=\"### Response\")\n    _sep = \"\\n\\n\"\n    _sep2 = \"</s>\"\n    system_message = _get_system_message(messages)\n    _messages = _map_roles(messages, _roles)\n    _prompt = _format_add_colon_two(system_message, _messages, _sep, _sep2)\n    return ChatFormatterResponse(prompt=_prompt)\n\n\n@register_chat_format(\"qwen\")\ndef format_qwen(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _roles = dict(user=\"<|im_start|>user\", assistant=\"<|im_start|>assistant\")\n    system_message = _get_system_message(messages) or \"You are a helpful assistant.\"\n    system_template = \"<|im_start|>system\\n{system_message}\"\n    system_message = system_template.format(system_message=system_message)\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _sep = \"<|im_end|>\"\n    _prompt = _format_chatml(system_message, _messages, _sep)\n    _sep2 = \"<|endoftext|>\"\n    return ChatFormatterResponse(prompt=_prompt, stop=_sep2)\n\n\n@register_chat_format(\"vicuna\")\ndef format(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _system_message = \"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\"\n    _roles = dict(user=\"USER\", assistant=\"ASSISTANT\")\n    _sep = \" \"\n    _sep2 = \"</s>\"\n    system_message = _system_message\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_add_colon_two(system_message, _messages, _sep, _sep2)\n    return ChatFormatterResponse(prompt=_prompt)\n\n\n@register_chat_format(\"oasst_llama\")\ndef format_oasst_llama(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _system_template = \"[INST] <<SYS>>\\n{system_message}\\n<</SYS>>\\n\\n\"\n    _roles = dict(user=\"<|prompter|>\", assistant=\"<|assistant|>\")\n    _sep = \"</s>\"\n    system_message = _get_system_message(messages)\n    system_message = _system_template.format(system_message=system_message)\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_no_colon_single(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt)\n\n\n@register_chat_format(\"baichuan-2\")\ndef format_baichuan2(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _system_template = \"{system_message}\"\n    _roles = dict(user=\"<reserved_106>\", assistant=\"<reserved_107>\")\n    _sep = \"\"\n    system_message = _get_system_message(messages)\n    system_message = _system_template.format(system_message=system_message)\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_no_colon_single(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt)\n\n\n@register_chat_format(\"baichuan\")\ndef format_baichuan(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _system_template = \"{system_message}\"\n    _roles = dict(user=\"<reserved_102>\", assistant=\"<reserved_103>\")\n    _sep = \"\"\n    system_message = _get_system_message(messages)\n    system_message = _system_template.format(system_message=system_message)\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_no_colon_single(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt)\n\n\n@register_chat_format(\"openbuddy\")\ndef format_openbuddy(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _system_message = \"\"\"You are a helpful, respectful and honest INTP-T AI Assistant named Buddy. You are talking to a human User.\nAlways answer as helpfully and logically as possible, while being safe. Your answers should not include any harmful, political, religious, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\nYou can speak fluently in many languages, for example: English, Chinese.\nYou cannot access the internet, but you have vast knowledge, cutoff: 2021-09.\nYou are trained by OpenBuddy team, (https://openbuddy.ai, https://github.com/OpenBuddy/OpenBuddy), you are based on LLaMA and Falcon transformers model, not related to GPT or OpenAI.\n\n\"\"\"\n    _roles = dict(user=\"User\", assistant=\"Assistant\")\n    _sep = \"\\n\"\n    system_message = _system_message\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_add_colon_single(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt)\n\n\n@register_chat_format(\"redpajama-incite\")\ndef format_redpajama_incite(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _system_message = _get_system_message(messages)\n    _roles = dict(user=\"<human>\", assistant=\"<bot>\")\n    _sep = \"\\n\"\n    _stop = \"<human>\"\n    system_message = _system_message\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_add_colon_single(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt, stop=_stop)\n\n\n@register_chat_format(\"snoozy\")\ndef format_snoozy(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    system_template = \"### Instruction:\\n{system_message}\"\n    default_system_message = \"The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.\"\n    _system_message = _get_system_message(messages)\n    _system_message = (\n        _system_message if _system_message != \"\" else default_system_message\n    )\n    system_message = system_template.format(system_message=_system_message)\n    _roles = dict(user=\"### Prompt\", assistant=\"### Response\")\n    _sep = \"\\n\"\n    _stop = \"###\"\n    system_message = _system_message\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_add_colon_single(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt, stop=_stop)\n\n\n@register_chat_format(\"phind\")\ndef format_phind(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _roles = dict(user=\"### User Message\", assistant=\"### Assistant\")\n    _sep = \"\\n\\n\"\n    _system_message = \"### System Prompt\\nYou are an intelligent programming assistant.\"\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_add_colon_single(_system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt)\n\n\n@register_chat_format(\"intel\")\ndef format_intel(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _roles = dict(user=\"### User:\", assistant=\"### Assistant:\")\n    _sep = \"\\n\"\n    _system_message = \"### System:\\n{system_message}\"\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_add_colon_single(_system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt)\n\n\n@register_chat_format(\"open-orca\")\ndef format_open_orca(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    system_template = \"{system_message}\"\n    system_message = (\n        \"You are a helpful assistant. Please answer truthfully and write out your \"\n        \"thinking step by step to be sure you get the right answer. If you make a mistake or encounter \"\n        \"an error in your thinking, say so out loud and attempt to correct it. If you don't know or \"\n        \"aren't sure about something, say so clearly. You will act as a professional logician, mathematician, \"\n        \"and physicist. You will also act as the most appropriate type of expert to answer any particular \"\n        \"question or solve the relevant problem; state which expert type your are, if so. Also think of \"\n        \"any particular named expert that would be ideal to answer the relevant question or solve the \"\n        \"relevant problem; name and act as them, if appropriate.\"\n    )\n    roles = (\"User\", \"Assistant\")\n    sep = \"<|end_of_turn|>\\n\"\n    # stop_token_ids=[32000, 32001],  # \"<|end_of_turn|>\"\n    stop_str = \"User\"\n    system_message = system_template.format(system_message=system_message)\n    _messages = _map_roles(messages, dict(zip(roles, roles)))\n    _messages.append((roles[1], None))\n    _prompt = _format_add_colon_space_single(system_message, _messages, sep)\n    return ChatFormatterResponse(prompt=_prompt, stop=stop_str)\n\n\n@register_chat_format(\"mistrallite\")\ndef format_mistrallite(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _roles = dict(user=\"<|prompter|>\", assistant=\"</s>\\n<|assistant|>\")\n    _sep = \" \"\n    system_template = \"\"\"<|system|>{system_message}</s>\"\"\"\n    system_message = _get_system_message(messages)\n    system_message = system_template.format(system_message=system_message)\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_no_colon_single(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt)\n\n\n@register_chat_format(\"zephyr\")\ndef format_zephyr(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    system_template = \"\"\"<|system|>\n{system_message}\"\"\"\n    system_message = _get_system_message(messages)\n    system_message = system_template.format(system_message=system_message)\n    _roles = dict(user=\"<|user|>\\n\", assistant=\"<|assistant|>\\n\")\n    _sep = \"</s>\"\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_chatml(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt, stop=_sep)\n\n\n@register_chat_format(\"pygmalion\")\ndef format_pygmalion(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    system_template = \"\"\"<|system|>{system_message}\"\"\"\n    system_message = _get_system_message(messages)\n    system_message = system_template.format(system_message=system_message)\n    _roles = dict(user=\"<|user|>\", assistant=\"<|model|>\")\n    _sep = \"\\n\"\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_chatml(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt, stop=_sep)\n\n\n@register_chat_format(\"chatml\")\ndef format_chatml(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    system_template = \"\"\"<|im_start|>system\n{system_message}\"\"\"\n    system_message = _get_system_message(messages)\n    system_message = system_template.format(system_message=system_message)\n    _roles = dict(user=\"<|im_start|>user\", assistant=\"<|im_start|>assistant\")\n    _sep = \"<|im_end|>\"\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_chatml(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt, stop=_sep)\n\n\n@register_chat_format(\"mistral-instruct\")\ndef format_mistral_instruct(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    eos = \"</s>\"\n    stop = eos\n    prompt = \"\"\n    for message in messages:\n        if (\n            message[\"role\"] == \"user\"\n            and message[\"content\"] is not None\n            and isinstance(message[\"content\"], str)\n        ):\n            prompt += \"[INST] \" + message[\"content\"]\n        elif message[\"role\"] == \"assistant\" and message[\"content\"] is not None:\n            prompt += \" [/INST]\" + message[\"content\"] + eos\n    prompt += \" [/INST]\"\n    return ChatFormatterResponse(prompt=prompt, stop=stop)\n\n\n@register_chat_format(\"chatglm3\")\ndef format_chatglm3(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    system_template = \"\"\"<|system|>\n{system_message}\"\"\"\n    system_message = _get_system_message(messages)\n    system_message = system_template.format(system_message=system_message)\n    _roles = dict(user=\"<|user|>\", assistant=\"<|assistant|>\")\n    _sep = \"</s>\"\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_chatglm3(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt, stop=_sep)\n\n\n@register_chat_format(\"openchat\")\ndef format_openchat(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    system_template = \"{system_message}<|end_of_turn|>\"\n    system_message = _get_system_message(messages)\n    system_message = system_template.format(system_message=system_message)\n    _roles = dict(\n        user=\"GPT4 Correct User: \", assistant=\"<|end_of_turn|>GPT4 Correct Assistant: \"\n    )\n    _sep = \"<|end_of_turn|>\"\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_chatml(system_message, _messages, _sep)\n    return ChatFormatterResponse(prompt=_prompt, stop=_sep)\n\n\n# Chat format for Saiga models, see more details and available models:\n# https://huggingface.co/collections/IlyaGusev/saiga2-saigamistral-6505d4ccc3d1e53166b636cd\n@register_chat_format(\"saiga\")\ndef format_saiga(\n    messages: list[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    _message_template = \"<s>{role}\\n{content}</s>\"\n    _roles = dict(user=\"user\", bot=\"bot\", system=\"system\")\n    _messages = _map_roles(messages, _roles)\n\n    _prompt = \"\"\n    for role, content in _messages:\n        if content:\n            _prompt += _message_template.format(role=role, content=content)\n        else:\n            _prompt += f\"<s>{role}\\n\"\n    # Response template\n    _prompt += \"<s>bot\"\n    return ChatFormatterResponse(prompt=_prompt.strip())\n\n\n# Chat format for Google's Gemma models, see more details and available models:\n# https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b\n@register_chat_format(\"gemma\")\ndef format_gemma(\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    **kwargs: Any,\n) -> ChatFormatterResponse:\n    system_message = _get_system_message(messages)\n    if system_message != \"\":\n        logger.debug(\n            \"`role='system'` messages are not allowed on Google's Gemma models.\"\n        )\n    _roles = dict(user=\"<start_of_turn>user\\n\", assistant=\"<start_of_turn>model\\n\")\n    _sep = \"<end_of_turn>\\n\"\n    _messages = _map_roles(messages, _roles)\n    _messages.append((_roles[\"assistant\"], None))\n    _prompt = _format_no_colon_single(system_message=\"\", messages=_messages, sep=_sep)\n    return ChatFormatterResponse(prompt=_prompt, stop=_sep)\n\n\n# Tricky chat formats that require custom chat handlers\n\n\n@register_chat_completion_handler(\"functionary\")\ndef functionary_chat_handler(\n    llama: llama.Llama,\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    functions: Optional[List[llama_types.ChatCompletionFunction]] = None,\n    function_call: Optional[llama_types.ChatCompletionRequestFunctionCall] = None,\n    tools: Optional[List[llama_types.ChatCompletionTool]] = None,\n    tool_choice: Optional[llama_types.ChatCompletionToolChoiceOption] = None,\n    temperature: float = 0.2,\n    top_p: float = 0.95,\n    top_k: int = 40,\n    min_p: float = 0.05,\n    typical_p: float = 1.0,\n    stream: bool = False,\n    stop: Optional[Union[str, List[str]]] = [],\n    response_format: Optional[llama_types.ChatCompletionRequestResponseFormat] = None,\n    max_tokens: Optional[int] = None,\n    presence_penalty: float = 0.0,\n    frequency_penalty: float = 0.0,\n    repeat_penalty: float = 1.1,\n    tfs_z: float = 1.0,\n    mirostat_mode: int = 0,\n    mirostat_tau: float = 5.0,\n    mirostat_eta: float = 0.1,\n    model: Optional[str] = None,\n    logits_processor: Optional[llama.LogitsProcessorList] = None,\n    grammar: Optional[llama.LlamaGrammar] = None,\n    **kwargs,  # type: ignore\n) -> Union[llama_types.ChatCompletion, Iterator[llama_types.ChatCompletionChunk]]:\n    SYSTEM_MESSAGE = \"\"\"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary\"\"\"\n\n    def generate_type_definition(\n        param: Dict[str, llama_types.JsonType], indent_level: int, shared_defs\n    ) -> str:\n        indent = \"  \" * indent_level\n        if \"$ref\" in param:\n            # Reference to a shared definition\n            ref_name = param[\"$ref\"].split(\"/\")[\n                -1\n            ]  # Extract the type name from the reference\n            return ref_name\n        elif param.get(\"type\") == \"array\":\n            items = param.get(\"items\", {})\n            item_type = generate_type_definition(items, indent_level + 1, shared_defs)\n            return f\"Array<{item_type}>\"\n        elif param.get(\"type\") == \"object\":\n            properties = param.get(\"properties\", {})\n            nested_schema = \"{\\n\"\n            for nested_param_name, nested_param in properties.items():\n                nested_param_type = generate_type_definition(\n                    nested_param, indent_level + 1, shared_defs\n                )\n                nested_schema += (\n                    f\"{indent}  {nested_param_name}: {nested_param_type},\\n\"\n                )\n            nested_schema += indent + \"}\"\n            return nested_schema\n        elif \"enum\" in param:\n            # Enum type\n            return \" | \".join([f'\"{enum_value}\"' for enum_value in param[\"enum\"]])\n        else:\n            # Simple type\n            return param.get(\"type\", \"any\")\n\n    def generate_shared_definitions(shared_defs, indent_level: int) -> str:\n        indent = \"  \" * indent_level\n        shared_definitions = \"\"\n        for def_name, def_properties in shared_defs.items():\n            shared_definitions += f\"{indent}type {def_name} = \"\n            if def_properties.get(\"type\") == \"object\":\n                shared_definitions += generate_type_definition(\n                    def_properties, indent_level, shared_defs\n                )\n            elif \"enum\" in def_properties:\n                # Enum type\n                shared_definitions += \" | \".join(\n                    [f'\"{enum_value}\"' for enum_value in def_properties[\"enum\"]]\n                )\n            shared_definitions += \";\\n\"\n        return shared_definitions\n\n    def generate_schema_from_functions(functions, namespace=\"functions\") -> str:\n        schema = (\n            \"// Supported function definitions that should be called when necessary.\\n\"\n        )\n        schema += f\"namespace {namespace} {{\\n\\n\"\n\n        # Generate shared definitions\n        shared_definitions = {}\n        for function in functions:\n            parameters = function.get(\"parameters\", {})\n            shared_definitions.update(parameters.get(\"$defs\", {}))\n\n        schema += generate_shared_definitions(shared_definitions, 1)\n\n        for function in functions:\n            function_name = function[\"name\"]\n            description = function.get(\"description\", \"\")\n            parameters = function.get(\"parameters\", {})\n            required_params = parameters.get(\"required\", [])\n\n            schema += f\"  // {description}\\n\"\n            schema += f\"  type {function_name} = (_: {{\\n\"\n\n            for param_name, param in parameters.get(\"properties\", {}).items():\n                param_description = param.get(\"description\", \"\")\n                param_type = generate_type_definition(param, 2, shared_definitions)\n                optional_indicator = \"\" if param_name in required_params else \"?\"\n                schema += f\"    // {param_description}\\n\"\n                schema += f\"    {param_name}{optional_indicator}: {param_type},\\n\"\n            schema += \"  }) => any;\\n\\n\"\n\n        schema += \"}} // namespace {}\\n\".format(namespace)\n        return schema\n\n    def prepare_messages_for_inference(\n        messages: List[llama_types.ChatCompletionRequestMessage],\n        functions: Optional[List[llama_types.ChatCompletionFunctions]] = None,\n        tools: Optional[List[llama_types.ChatCompletionTool]] = None,\n    ):\n        all_messages: List[llama_types.ChatCompletionRequestMessage] = []\n        if functions is not None:\n            all_messages.append(\n                llama_types.ChatCompletionRequestSystemMessage(\n                    role=\"system\", content=generate_schema_from_functions(functions)\n                )\n            )\n\n        if tools is not None:\n            all_messages.append(\n                llama_types.ChatCompletionRequestSystemMessage(\n                    role=\"system\",\n                    content=generate_schema_from_functions(\n                        [\n                            tool[\"function\"]\n                            for tool in tools\n                            if tool[\"type\"] == \"function\"\n                        ]\n                    ),\n                )\n            )\n\n        all_messages.append(\n            llama_types.ChatCompletionRequestSystemMessage(\n                role=\"system\", content=SYSTEM_MESSAGE\n            )\n        )\n\n        for message in messages:\n            # Function call responses\n            if message[\"role\"] == \"function\" and \"name\" in message:\n                message[\"name\"] = f\"functions.{message['name']}\"\n            # Function call requests by assistant\n            if \"function_call\" in message:\n                message[\"function_call\"][\n                    \"name\"\n                ] = f\"functions.{message['function_call']['name']}\"\n            all_messages.append(message)\n\n        all_messages.append(\n            llama_types.ChatCompletionRequestAssistantMessage(\n                role=\"assistant\", content=None\n            )\n        )\n\n        def message_to_str(msg: llama_types.ChatCompletionRequestMessage):\n            if msg[\"role\"] == \"system\":\n                return f\"system:\\n{msg['content']}\\n\"\n\n            elif msg[\"role\"] == \"function\" and \"name\" in msg:\n                return f\"function name={msg['name']}:\\n{msg['content']}\\n\"\n            elif msg[\"role\"] == \"function\" and \"function_call\" in msg:\n                return f\"function name={msg['function_call']['name']}:\\n{msg['function_call']['arguments']}\\n\"\n            elif msg[\"role\"] == \"tool\":\n                if msg[\"content\"] is not None:\n                    return f\"function name={msg['tool_call_id']}:\\n{msg['content']}\\n\"\n                else:\n                    return f\"function name={msg['tool_call_id']}\\n\"\n            elif msg[\"role\"] == \"user\":\n                if msg[\"content\"] is None:\n                    return \"user:\\n</s></s>\\n\"\n                else:\n                    return f\"user:\\n</s>{msg['content']}</s>\\n\"\n            elif msg[\"role\"] == \"assistant\":\n                if msg[\"content\"] is not None and \"function_call\" in msg:\n                    return f\"assistant:\\n{msg['content']}\\nassistant to={msg['function_call']['name']}:\\n{msg['function_call']['arguments']}</s>\\n\"\n                elif \"function_call\" in msg:\n                    return f\"assistant to={msg['function_call']['name']}:\\n{msg['function_call']['arguments']}</s>\\n\"\n                elif \"tool_calls\" in msg and len(msg[\"tool_calls\"]) > 0:\n                    for tool_call in msg[\n                        \"tool_calls\"\n                    ]:  # NOTE: probably doesn't work with the functionary model\n                        return f\"assistant to={tool_call['id']}:\\n{tool_call['function']['arguments']}</s>\\n\"\n                elif msg[\"content\"] is None:\n                    return \"assistant\"\n                else:\n                    return f\"assistant:\\n{msg['content']}\\n\"\n            else:\n                raise ValueError(f\"Unsupported role: {msg['role']}\")\n\n        return \"\".join([message_to_str(msg) for msg in all_messages])\n\n    if tools is not None:\n        functions = [tool[\"function\"] for tool in tools if tool[\"type\"] == \"function\"]\n\n    if tool_choice is not None:\n        function_call = (\n            tool_choice if isinstance(tool_choice, str) else tool_choice[\"function\"]\n        )\n\n    prompt = prepare_messages_for_inference(messages, functions, tools)\n\n    if function_call is None and (functions is None or len(functions) == 0):\n        completion_or_completion_chunks = llama.create_completion(\n            prompt=prompt + \":\\n\",\n            temperature=temperature,\n            top_p=top_p,\n            top_k=top_k,\n            min_p=min_p,\n            typical_p=typical_p,\n            stream=stream,\n            stop=[\"user:\", \"</s>\"],\n            max_tokens=max_tokens,\n            presence_penalty=presence_penalty,\n            frequency_penalty=frequency_penalty,\n            repeat_penalty=repeat_penalty,\n            tfs_z=tfs_z,\n            mirostat_mode=mirostat_mode,\n            mirostat_tau=mirostat_tau,\n            mirostat_eta=mirostat_eta,\n            model=model,\n            logits_processor=logits_processor,\n            grammar=grammar,\n        )\n        return _convert_completion_to_chat(completion_or_completion_chunks, stream=stream)  # type: ignore\n\n    if function_call is None or (\n        isinstance(function_call, str) and function_call == \"auto\"\n    ):\n        stop = \"\\n\"\n        completion: llama_types.Completion = llama.create_completion(\n            prompt=prompt, stop=stop, stream=False\n        )  # type: ignore\n        completion_text = completion[\"choices\"][0][\"text\"]\n        # strip \" to=functions.\" and ending \":\"\n        function_call = completion_text.split(\".\")[-1][:-1]\n        new_prompt = prompt + completion_text + stop\n    elif isinstance(function_call, str) and function_call != \"none\":\n        new_prompt = prompt + \":\\n\"\n    elif isinstance(function_call, dict):\n        new_prompt = prompt + f\" to=functions.{function_call['name']}:\\n\"\n        function_call = function_call[\"name\"]\n    else:\n        new_prompt = prompt + \":\\n\"\n\n    function_body = None\n    for function in functions or []:\n        if function[\"name\"] == function_call:\n            function_body = function[\"parameters\"]\n            break\n    for tool in tools or []:\n        if tool[\"type\"] == \"function\" and tool[\"function\"][\"name\"] == function_call:\n            function_body = tool[\"function\"][\"parameters\"]\n            break\n\n    if function_body is not None:\n        try:\n            with suppress_stdout_stderr(disable=llama.verbose):\n                grammar_text = llama_grammar.json_schema_to_gbnf(\n                    json.dumps(function_body)\n                )\n                grammar = llama_grammar.LlamaGrammar.from_string(\n                    llama_grammar.json_schema_to_gbnf(json.dumps(function_body)),\n                    verbose=llama.verbose,\n                )\n                print(grammar_text)\n        except Exception as e:\n            if llama.verbose:\n                print(\n                    \"Failed to parse function body as JSON schema, falling back to default grammar\"\n                )\n                print(e)\n            with suppress_stdout_stderr(disable=llama.verbose):\n                grammar = llama_grammar.LlamaGrammar.from_string(\n                    llama_grammar.JSON_GBNF,\n                    verbose=llama.verbose,\n                )\n    else:\n        with suppress_stdout_stderr(disable=llama.verbose):\n            grammar = llama_grammar.LlamaGrammar.from_string(\n                llama_grammar.JSON_GBNF, verbose=llama.verbose\n            )\n\n    completion: llama_types.Completion = llama.create_completion(\n        prompt=new_prompt,\n        stop=[\"user:\", \"</s>\"],\n        stream=False,\n        grammar=grammar,\n        max_tokens=max_tokens,\n        temperature=temperature,\n        top_p=top_p,\n        top_k=top_k,\n        min_p=min_p,\n        typical_p=typical_p,\n        presence_penalty=presence_penalty,\n        frequency_penalty=frequency_penalty,\n        repeat_penalty=repeat_penalty,\n        tfs_z=tfs_z,\n        mirostat_mode=mirostat_mode,\n        mirostat_tau=mirostat_tau,\n        mirostat_eta=mirostat_eta,\n        model=model,\n        logits_processor=logits_processor,\n    )  # type: ignore\n\n    assert \"usage\" in completion\n    assert isinstance(function_call, str)\n    assert stream is False  # TODO: support stream mode\n\n    if llama.verbose:\n        print(new_prompt)\n        print(completion[\"choices\"][0][\"text\"])\n\n    # TODO: support stream mode\n    return llama_types.CreateChatCompletionResponse(\n        id=\"chat\" + completion[\"id\"],\n        object=\"chat.completion\",\n        created=completion[\"created\"],\n        model=completion[\"model\"],\n        choices=[\n            {\n                \"index\": 0,\n                \"message\": {\n                    \"role\": \"assistant\",\n                    \"content\": None,\n                    \"function_call\": {\n                        \"name\": function_call,\n                        \"arguments\": completion[\"choices\"][0][\"text\"],\n                    },\n                    \"tool_calls\": [\n                        {\n                            \"id\": function_call,\n                            \"type\": \"function\",\n                            \"function\": {\n                                \"name\": function_call,\n                                \"arguments\": completion[\"choices\"][0][\"text\"],\n                            },\n                        }\n                    ],\n                },\n                \"logprobs\": _convert_text_completion_logprobs_to_chat(completion[\"choices\"][0][\"logprobs\"]),\n                \"finish_reason\": \"tool_calls\",\n            }\n        ],\n        usage=completion[\"usage\"],\n    )\n\n\n@register_chat_completion_handler(\"functionary-v1\")\n@register_chat_completion_handler(\"functionary-v2\")\ndef functionary_v1_v2_chat_handler(\n    llama: llama.Llama,\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    functions: Optional[List[llama_types.ChatCompletionFunction]] = None,\n    function_call: Optional[llama_types.ChatCompletionRequestFunctionCall] = None,\n    tools: Optional[List[llama_types.ChatCompletionTool]] = None,\n    tool_choice: Optional[llama_types.ChatCompletionToolChoiceOption] = None,\n    temperature: float = 0.2,\n    top_p: float = 0.95,\n    top_k: int = 40,\n    min_p: float = 0.05,\n    typical_p: float = 1.0,\n    stream: bool = False,\n    stop: Optional[Union[str, List[str]]] = [],\n    response_format: Optional[llama_types.ChatCompletionRequestResponseFormat] = None,\n    max_tokens: Optional[int] = None,\n    presence_penalty: float = 0.0,\n    frequency_penalty: float = 0.0,\n    repeat_penalty: float = 1.1,\n    tfs_z: float = 1.0,\n    mirostat_mode: int = 0,\n    mirostat_tau: float = 5.0,\n    mirostat_eta: float = 0.1,\n    model: Optional[str] = None,\n    logits_processor: Optional[llama.LogitsProcessorList] = None,\n    grammar: Optional[llama.LlamaGrammar] = None,\n    **kwargs,  # type: ignore\n) -> Union[llama_types.ChatCompletion, Iterator[llama_types.ChatCompletionChunk]]:\n    SYSTEM_MESSAGE = \"\"\"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary\"\"\"\n\n    tokenizer = llama.tokenizer_\n    assert hasattr(\n        tokenizer, \"hf_tokenizer\"\n    ), \"Please provide a valid hf_tokenizer_path from https://huggingface.co/meetkai when initializing the Llama class\"\n    from transformers import AutoTokenizer\n\n    if \"<|START_OF_FUNCTION_CALL|>\" in tokenizer.hf_tokenizer.additional_special_tokens:\n        version = \"v1\"\n        END_SYSTEM_TOKEN = \"<|END_OF_SYSTEM|>\"\n        END_USER_TOKEN = \"<|END_OF_USER|>\"\n        END_ASSISTANT_TOKEN = \"<|END_OF_ASSISTANT|>\"\n        END_FUNCTION_RESULT_TOKEN = \"<|END_OF_FUNCTION_RESULT|>\"\n        START_FUNCTION_CALL_TOKEN = \"<|START_OF_FUNCTION_CALL|>\"\n        END_FUNCTION_CALL_TOKEN = \"<|END_OF_FUNCTION_CALL|>\"\n    else:\n        version = \"v2\"\n        RECIPIENT_TOKEN = \"<|recipient|>\"\n        FROM_TOKEN = \"<|from|>\"\n        STOP_TOKEN = \"<|stop|>\"\n        CONTENT_TOKEN = \"<|content|>\"\n\n    def generate_type_definition(\n        param: Dict[str, llama_types.JsonType], indent_level: int, shared_defs\n    ) -> str:\n        indent = \"  \" * indent_level\n        if \"$ref\" in param:\n            # Reference to a shared definition\n            ref_name = param[\"$ref\"].split(\"/\")[\n                -1\n            ]  # Extract the type name from the reference\n            return ref_name\n        elif param.get(\"type\") == \"array\":\n            items = param.get(\"items\", {})\n            item_type = generate_type_definition(items, indent_level + 1, shared_defs)\n            return f\"Array<{item_type}>\"\n        elif param.get(\"type\") == \"object\":\n            properties = param.get(\"properties\", {})\n            nested_schema = \"{\\n\"\n            for nested_param_name, nested_param in properties.items():\n                nested_param_type = generate_type_definition(\n                    nested_param, indent_level + 1, shared_defs\n                )\n                nested_schema += (\n                    f\"{indent}  {nested_param_name}: {nested_param_type},\\n\"\n                )\n            nested_schema += indent + \"}\"\n            return nested_schema\n        elif \"enum\" in param:\n            # Enum type\n            return \" | \".join([f'\"{enum_value}\"' for enum_value in param[\"enum\"]])\n        else:\n            # Simple type\n            return param.get(\"type\", \"any\")\n\n    def generate_shared_definitions(shared_defs, indent_level: int) -> str:\n        indent = \"  \" * indent_level\n        shared_definitions = \"\"\n        for def_name, def_properties in shared_defs.items():\n            shared_definitions += f\"{indent}type {def_name} = \"\n            if def_properties.get(\"type\") == \"object\":\n                shared_definitions += generate_type_definition(\n                    def_properties, indent_level, shared_defs\n                )\n            elif \"enum\" in def_properties:\n                # Enum type\n                shared_definitions += \" | \".join(\n                    [f'\"{enum_value}\"' for enum_value in def_properties[\"enum\"]]\n                )\n            shared_definitions += \";\\n\"\n        return shared_definitions\n\n    def generate_schema_from_functions(functions, namespace=\"functions\") -> str:\n        schema = (\n            \"// Supported function definitions that should be called when necessary.\\n\"\n        )\n        schema += f\"namespace {namespace} {{\\n\\n\"\n\n        # Generate shared definitions\n        shared_definitions = {}\n        for function in functions:\n            parameters = function.get(\"parameters\", {})\n            shared_definitions.update(parameters.get(\"$defs\", {}))\n\n        schema += generate_shared_definitions(shared_definitions, 1)\n\n        for function in functions:\n            function_name = function[\"name\"]\n            description = function.get(\"description\", \"\")\n            parameters = function.get(\"parameters\", {})\n            required_params = parameters.get(\"required\", [])\n\n            schema += f\"// {description}\\n\"\n            schema += f\"type {function_name} = (_: {{\\n\"\n\n            for param_name, param in parameters.get(\"properties\", {}).items():\n                param_description = param.get(\"description\", \"\")\n                param_type = generate_type_definition(param, 2, shared_definitions)\n                optional_indicator = \"\" if param_name in required_params else \"?\"\n                schema += f\"// {param_description}\\n\"\n                schema += f\"{param_name}{optional_indicator}: {param_type},\\n\"\n            schema += \"}) => any;\\n\\n\"\n\n        schema += \"}} // namespace {}\".format(namespace)\n        return schema\n\n    def prepare_messages_for_inference(\n        messages: List[llama_types.ChatCompletionRequestMessage],\n        tokenizer: AutoTokenizer,\n        version: Literal[\"v1\", \"v2\"],\n        functions: Optional[List[llama_types.ChatCompletionFunctions]] = None,\n        tools: Optional[List[llama_types.ChatCompletionTool]] = None,\n        tool_choice: Union[Dict, str] = \"auto\",\n    ):\n        all_messages: List[llama_types.ChatCompletionRequestMessage] = []\n        if tool_choice == \"none\":\n            all_messages.append(\n                llama_types.ChatCompletionRequestSystemMessage(\n                    role=\"system\", content=generate_schema_from_functions([])\n                )\n            )\n        else:\n            if functions is not None:\n                all_messages.append(\n                    llama_types.ChatCompletionRequestSystemMessage(\n                        role=\"system\", content=generate_schema_from_functions(functions)\n                    )\n                )\n            elif tools is not None and tool_choice != \"none\":\n                all_messages.append(\n                    llama_types.ChatCompletionRequestSystemMessage(\n                        role=\"system\",\n                        content=generate_schema_from_functions(\n                            [\n                                tool[\"function\"]\n                                for tool in tools\n                                if tool[\"type\"] == \"function\"\n                            ]\n                        ),\n                    )\n                )\n\n        all_messages.append(\n            llama_types.ChatCompletionRequestSystemMessage(\n                role=\"system\", content=SYSTEM_MESSAGE\n            )\n        )\n\n        for message in messages:\n            # Function call responses\n            if message[\"role\"] == \"function\" and \"name\" in message:\n                message[\"name\"] = f\"functions.{message['name']}\"\n            # Function call requests by assistant\n            if \"function_call\" in message:\n                message[\"function_call\"][\n                    \"name\"\n                ] = f\"functions.{message['function_call']['name']}\"\n            all_messages.append(message)\n\n        if version == \"v1\":\n            suffix = \"assistant:\\n\"\n        else:\n            suffix = \"<|from|>assistant\\n<|recipient|>\"\n\n        return (\n            tokenizer.hf_tokenizer.apply_chat_template(all_messages, tokenize=False)\n            + suffix\n        )\n\n    if tools is not None:\n        functions = [tool[\"function\"] for tool in tools if tool[\"type\"] == \"function\"]\n\n    if tool_choice is not None:\n        function_call = (\n            tool_choice if isinstance(tool_choice, str) else tool_choice[\"function\"]\n        )\n    elif function_call is not None:\n        pass\n    else:\n        function_call = \"auto\"\n\n    prompt = prepare_messages_for_inference(\n        messages, tokenizer, version, functions, tools, function_call\n    )\n\n    # If no tools/functions are provided\n    if function_call == \"none\" or functions is None or len(functions) == 0:\n        if version == \"v1\":\n            stop = END_ASSISTANT_TOKEN\n        else:\n            stop = STOP_TOKEN\n            prompt += \"all\\n<|content|>\"\n\n        completion_or_completion_chunks = llama.create_completion(\n            prompt=prompt,\n            temperature=temperature,\n            top_p=top_p,\n            top_k=top_k,\n            min_p=min_p,\n            typical_p=typical_p,\n            stream=stream,\n            stop=stop,\n            max_tokens=max_tokens,\n            presence_penalty=presence_penalty,\n            frequency_penalty=frequency_penalty,\n            repeat_penalty=repeat_penalty,\n            tfs_z=tfs_z,\n            mirostat_mode=mirostat_mode,\n            mirostat_tau=mirostat_tau,\n            mirostat_eta=mirostat_eta,\n            model=model,\n            logits_processor=logits_processor,\n            grammar=grammar,\n        )\n        if stream is False:\n            completion_or_completion_chunks[\"choices\"][0][\"text\"] = (\n                completion_or_completion_chunks[\"choices\"][0][\"text\"].lstrip()\n            )\n        return _convert_completion_to_chat(completion_or_completion_chunks, stream=stream)  # type: ignore\n\n    def get_grammar(function_call):\n        function_body = None\n        for function in functions or []:\n            if function[\"name\"] == function_call:\n                function_body = function[\"parameters\"]\n                break\n        for tool in tools or []:\n            if tool[\"type\"] == \"function\" and tool[\"function\"][\"name\"] == function_call:\n                function_body = tool[\"function\"][\"parameters\"]\n                break\n\n        try:\n            with suppress_stdout_stderr(disable=llama.verbose):\n                grammar_text = llama_grammar.json_schema_to_gbnf(\n                    json.dumps(function_body)\n                )\n                grammar = llama_grammar.LlamaGrammar.from_string(\n                    llama_grammar.json_schema_to_gbnf(json.dumps(function_body))\n                )\n                print(grammar_text)\n        except Exception as e:\n            if llama.verbose:\n                print(\n                    \"Failed to parse function body as JSON schema, falling back to default grammar\"\n                )\n                print(e)\n            with suppress_stdout_stderr(disable=llama.verbose):\n                grammar = llama_grammar.LlamaGrammar.from_string(\n                    llama_grammar.JSON_GBNF, verbose=llama.verbose\n                )\n\n        return grammar\n\n    def create_completion(prompt, stop, grammar):\n        completion = cast(\n            llama_types.Completion,\n            llama.create_completion(\n                prompt=prompt,\n                temperature=temperature,\n                top_p=top_p,\n                top_k=top_k,\n                min_p=min_p,\n                typical_p=typical_p,\n                stream=stream,\n                stop=stop,\n                max_tokens=max_tokens,\n                presence_penalty=presence_penalty,\n                frequency_penalty=frequency_penalty,\n                repeat_penalty=repeat_penalty,\n                tfs_z=tfs_z,\n                mirostat_mode=mirostat_mode,\n                mirostat_tau=mirostat_tau,\n                mirostat_eta=mirostat_eta,\n                model=model,\n                logits_processor=logits_processor,\n                grammar=grammar,\n            ),\n        )\n\n        return completion\n\n    content = \"\"\n    function_calls, function_bodies = [], []\n    completion_tokens = 0\n\n    def generate_streaming(tools, functions, function_call, prompt):\n        assert version == \"v2\", \"Streaming for v1 is not supported\"\n\n        chunk_id, chunk_created = None, None\n\n        # If tool_choice/function_call is provided\n        if isinstance(function_call, dict):\n            prompt += f\"{function_call['name']}\\n{CONTENT_TOKEN}\"\n            grammar = get_grammar(function_call[\"name\"])\n            stops = [STOP_TOKEN, FROM_TOKEN]\n            tool_id = \"\".join(\n                [random.choice(string.ascii_letters + string.digits) for _ in range(24)]\n            )\n            completion = create_completion(prompt=prompt, stop=stops, grammar=grammar)\n            completion_text = \"\"\n            first = True\n            for chunk in completion:\n                # Yield the tool/function name first\n                if first:\n                    if tools is not None:\n                        func_call_dict = {\n                            \"tool_calls\": [\n                                {\n                                    \"index\": 0,\n                                    \"id\": \"call_\" + tool_id,\n                                    \"type\": \"function\",\n                                    \"function\": {\n                                        \"name\": function_call[\"name\"],\n                                        \"arguments\": \"\",\n                                    },\n                                }\n                            ]\n                        }\n                    else:\n                        func_call_dict = {\n                            \"function_call\": {\n                                \"name\": function_call[\"name\"],\n                                \"arguments\": \"\",\n                            }\n                        }\n                    yield llama_types.CreateChatCompletionStreamResponse(\n                        id=\"chat\" + chunk[\"id\"],\n                        object=\"chat.completion.chunk\",\n                        created=chunk[\"created\"],\n                        model=chunk[\"model\"],\n                        choices=[\n                            {\n                                \"index\": 0,\n                                \"logprobs\": None,\n                                \"delta\": {\n                                    \"role\": None,\n                                    \"content\": None,\n                                    **func_call_dict,\n                                },\n                            }\n                        ],\n                    )\n                    first = False\n                if tools is not None:\n                    func_call_dict = {\n                        \"tool_calls\": [\n                            {\n                                \"index\": 0,\n                                \"id\": \"call_\" + tool_id,\n                                \"type\": \"function\",\n                                \"function\": {\n                                    \"name\": None,\n                                    \"arguments\": chunk[\"choices\"][0][\"text\"].rstrip(),\n                                },\n                            }\n                        ]\n                    }\n                else:\n                    func_call_dict = {\n                        \"function_call\": {\n                            \"name\": None,\n                            \"arguments\": chunk[\"choices\"][0][\"text\"].rstrip(),\n                        }\n                    }\n                if len(chunk[\"choices\"][0][\"text\"].rstrip()) > 0:\n                    yield llama_types.CreateChatCompletionStreamResponse(\n                        id=\"chat\" + chunk[\"id\"],\n                        object=\"chat.completion.chunk\",\n                        created=chunk[\"created\"],\n                        model=chunk[\"model\"],\n                        choices=[\n                            {\n                                \"index\": 0,\n                                \"logprobs\": _convert_text_completion_logprobs_to_chat(chunk[\"choices\"][0][\"logprobs\"]),\n                                \"delta\": {\n                                    \"role\": None,\n                                    \"content\": None,\n                                    **func_call_dict,\n                                },\n                            }\n                        ],\n                    )\n            # Yield tool_call/function_call stop message\n            yield llama_types.CreateChatCompletionStreamResponse(\n                id=\"chat\" + chunk[\"id\"],\n                object=\"chat.completion.chunk\",\n                created=chunk[\"created\"],\n                model=chunk[\"model\"],\n                choices=[\n                    {\n                        \"index\": 0,\n                        \"finish_reason\": (\n                            \"tool_calls\" if tools is not None else \"function_call\"\n                        ),\n                        \"logprobs\": None,\n                        \"delta\": {\n                            \"role\": None,\n                            \"content\": None,\n                            \"function_call\": None,\n                            \"tool_calls\": None,\n                        },\n                    }\n                ],\n            )\n        # If \"auto\" or no tool_choice/function_call\n        elif isinstance(function_call, str) and function_call == \"auto\":\n            tool_index = 0\n            while True:\n                # Generate function name first\n                grammar = None\n                stops = CONTENT_TOKEN\n                completion = create_completion(\n                    prompt=prompt, stop=stops, grammar=grammar\n                )\n                completion_text = \"\"\n                for chunk in completion:\n                    completion_text += chunk[\"choices\"][0][\"text\"]\n                if chunk_id is None:\n                    chunk_id = chunk[\"id\"]\n                if chunk_created is None:\n                    chunk_created = chunk[\"created\"]\n                function_name = completion_text.strip()\n                if function_name == \"all\":\n                    prompt += \"all\\n<|content|>\"\n                    # Yield the first empty message for content\n                    yield llama_types.CreateChatCompletionStreamResponse(\n                        id=\"chat\" + chunk_id,\n                        model=chunk[\"model\"],\n                        created=chunk_created,\n                        object=\"chat.completion.chunk\",\n                        choices=[\n                            {\n                                \"index\": 0,\n                                \"delta\": {\"role\": \"assistant\", \"content\": \"\"},\n                                \"logprobs\": None,\n                                \"finish_reason\": None,\n                            }\n                        ],\n                    )\n                else:\n                    prompt += f\"{function_name}\\n<|content|>\"\n                    grammar = get_grammar(function_name)\n                    tool_id = \"\".join(\n                        [\n                            random.choice(string.ascii_letters + string.digits)\n                            for _ in range(24)\n                        ]\n                    )\n                    if tools is not None:\n                        func_call_dict = {\n                            \"tool_calls\": [\n                                {\n                                    \"index\": tool_index,\n                                    \"id\": \"call_\" + tool_id,\n                                    \"type\": \"function\",\n                                    \"function\": {\n                                        \"name\": function_name,\n                                        \"arguments\": \"\",\n                                    },\n                                }\n                            ]\n                        }\n                    else:\n                        func_call_dict = {\n                            \"function_call\": {\"name\": function_name, \"arguments\": \"\"}\n                        }\n                    # Stream function name\n                    yield llama_types.CreateChatCompletionStreamResponse(\n                        id=\"chat\" + chunk_id,\n                        object=\"chat.completion.chunk\",\n                        created=chunk_created,\n                        model=chunk[\"model\"],\n                        choices=[\n                            {\n                                \"index\": 0,\n                                \"logprobs\": _convert_text_completion_logprobs_to_chat(chunk[\"choices\"][0][\"logprobs\"]),\n                                \"delta\": {\n                                    \"role\": \"assistant\",\n                                    \"content\": None,\n                                    **func_call_dict,\n                                },\n                            }\n                        ],\n                    )\n                # Generate content\n                stops = [RECIPIENT_TOKEN, STOP_TOKEN]\n                completion = create_completion(\n                    prompt=prompt, stop=stops, grammar=grammar\n                )\n                if function_name == \"all\":\n                    completion_text = \"\"\n                    stop_sequence, buffer, is_end = (\n                        \"\\n<|from|>assistant\\n<|recipient|>\",\n                        [],\n                        False,\n                    )\n                    for i, chunk in enumerate(completion):\n                        completion_text += chunk[\"choices\"][0][\"text\"]\n                        if is_end:\n                            buffer.append(chunk[\"choices\"][0][\"text\"].strip(\" \"))\n                            if stop_sequence.startswith(\"\".join(buffer)):\n                                continue\n                            else:\n                                buffer.pop()\n                                while len(buffer) > 0:\n                                    yield llama_types.CreateChatCompletionStreamResponse(\n                                        id=\"chat\" + chunk_id,\n                                        object=\"chat.completion.chunk\",\n                                        created=chunk_created,\n                                        model=chunk[\"model\"],\n                                        choices=[\n                                            {\n                                                \"index\": 0,\n                                                \"logprobs\": _convert_text_completion_logprobs_to_chat(chunk[\"choices\"][0][\"logprobs\"]),\n                                                \"delta\": {\n                                                    \"role\": \"assistant\",\n                                                    \"content\": buffer.pop(0),\n                                                },\n                                            }\n                                        ],\n                                    )\n                                is_end = False\n                        elif chunk[\"choices\"][0][\"text\"] == \"\\n\":\n                            is_end = True\n                            buffer.append(chunk[\"choices\"][0][\"text\"].strip(\" \"))\n                            continue\n\n                        if len(buffer) == 0 and len(chunk[\"choices\"][0][\"text\"]) > 0:\n                            yield llama_types.CreateChatCompletionStreamResponse(\n                                id=\"chat\" + chunk_id,\n                                object=\"chat.completion.chunk\",\n                                created=chunk_created,\n                                model=chunk[\"model\"],\n                                choices=[\n                                    {\n                                        \"index\": 0,\n                                        \"logprobs\": _convert_text_completion_logprobs_to_chat(chunk[\"choices\"][0][\"logprobs\"]),\n                                        \"delta\": {\n                                            \"role\": \"assistant\",\n                                            \"content\": (\n                                                chunk[\"choices\"][0][\"text\"]\n                                                if i > 0\n                                                else chunk[\"choices\"][0][\n                                                    \"text\"\n                                                ].lstrip()\n                                            ),\n                                        },\n                                    }\n                                ],\n                            )\n                    # Check whether the model wants to generate another turn\n                    if (\n                        \"<|from|> assistant\" in completion_text\n                        or \"<|from|>assistant\" in completion_text\n                    ):\n                        if completion_text.endswith(\"\\n<|from|>assistant\\n\"):\n                            cleaned_completion_text = completion_text[\n                                : -len(\"\\n<|from|>assistant\\n\")\n                            ].strip()\n                        elif completion_text.endswith(\"\\n<|from|> assistant\\n\"):\n                            cleaned_completion_text = completion_text[\n                                : -len(\"\\n<|from|> assistant\\n\")\n                            ].strip()\n                        else:\n                            cleaned_completion_text = completion_text.strip()\n                        prompt += f\"{cleaned_completion_text}\\n<|from|>assistant\\n<|recipient|>\"\n                    else:\n                        # Yield stop message\n                        yield llama_types.CreateChatCompletionStreamResponse(\n                            id=\"chat\" + chunk_id,\n                            model=chunk[\"model\"],\n                            created=chunk_created,\n                            object=\"chat.completion.chunk\",\n                            choices=[\n                                {\n                                    \"index\": 0,\n                                    \"delta\": {},\n                                    \"logprobs\": None,\n                                    \"finish_reason\": \"stop\",\n                                }\n                            ],\n                        )\n                        break\n                else:\n                    # Check whether the model wants to generate another turn\n                    completion_text = \"\"\n                    for chunk in completion:\n                        completion_text += chunk[\"choices\"][0][\"text\"]\n                        if len(chunk[\"choices\"][0][\"text\"].rstrip()) > 0:\n                            if tools is not None:\n                                func_call_dict = {\n                                    \"tool_calls\": [\n                                        {\n                                            \"index\": tool_index,\n                                            \"id\": \"call_\" + tool_id,\n                                            \"type\": \"function\",\n                                            \"function\": {\n                                                \"name\": None,\n                                                \"arguments\": chunk[\"choices\"][0][\n                                                    \"text\"\n                                                ].rstrip(),\n                                            },\n                                        }\n                                    ]\n                                }\n                            else:\n                                func_call_dict = {\n                                    \"function_call\": {\n                                        \"name\": None,\n                                        \"arguments\": chunk[\"choices\"][0][\n                                            \"text\"\n                                        ].rstrip(),\n                                    }\n                                }\n                            yield llama_types.CreateChatCompletionStreamResponse(\n                                id=\"chat\" + chunk_id,\n                                object=\"chat.completion.chunk\",\n                                created=chunk_created,\n                                model=chunk[\"model\"],\n                                choices=[\n                                    {\n                                        \"index\": 0,\n                                        \"logprobs\": _convert_text_completion_logprobs_to_chat(chunk[\"choices\"][0][\"logprobs\"]),\n                                        \"delta\": {\n                                            \"role\": None,\n                                            \"content\": None,\n                                            **func_call_dict,\n                                        },\n                                    }\n                                ],\n                            )\n                    prompt += completion_text.strip()\n                    grammar = None\n                    completion = create_completion(\n                        prompt=prompt, stop=stops, grammar=grammar\n                    )\n                    completion_text += \"\".join(\n                        [chunk[\"choices\"][0][\"text\"] for chunk in completion]\n                    )\n                    if (\n                        \"<|from|> assistant\" in completion_text\n                        or \"<|from|>assistant\" in completion_text\n                    ) and tools is not None:\n                        prompt += \"\\n<|from|>assistant\\n<|recipient|>\"\n                        tool_index += 1\n                    else:\n                        # Yield tool_call/function_call stop message\n                        yield llama_types.CreateChatCompletionStreamResponse(\n                            id=\"chat\" + chunk_id,\n                            object=\"chat.completion.chunk\",\n                            created=chunk_created,\n                            model=chunk[\"model\"],\n                            choices=[\n                                {\n                                    \"index\": 0,\n                                    \"finish_reason\": (\n                                        \"tool_calls\"\n                                        if tools is not None\n                                        else \"function_call\"\n                                    ),\n                                    \"logprobs\": None,\n                                    \"delta\": {\n                                        \"role\": None,\n                                        \"content\": None,\n                                        \"function_call\": None,\n                                        \"tool_calls\": None,\n                                    },\n                                }\n                            ],\n                        )\n                        break\n\n    if stream is not False:\n        return generate_streaming(\n            tools=tools, functions=functions, function_call=function_call, prompt=prompt\n        )\n    else:\n        if version == \"v1\":\n            # If no or \"auto\" tool_choice/function_call\n            if isinstance(function_call, str) and function_call == \"auto\":\n                stops = [\"\\n\", END_ASSISTANT_TOKEN]\n            # If tool_choice/function_call is provided\n            elif isinstance(function_call, dict):\n                prompt += f\"{START_FUNCTION_CALL_TOKEN}{function_call['name']}:\\n\"\n                stops = END_FUNCTION_CALL_TOKEN\n                function_call = function_call[\"name\"]\n                function_calls.append(function_call)\n                grammar = get_grammar(function_call)\n            else:\n                prompt = prompt\n                stops = [\"\\n\", END_ASSISTANT_TOKEN]\n\n            completion = create_completion(prompt=prompt, stop=stops, grammar=grammar)\n            completion_text = completion[\"choices\"][0][\"text\"]\n            completion_tokens += completion[\"usage\"][\"completion_tokens\"]\n\n            # If the generation does not involve a function call\n            if (\n                START_FUNCTION_CALL_TOKEN not in prompt\n                and START_FUNCTION_CALL_TOKEN not in completion_text\n            ):\n                completion[\"usage\"][\"completion_tokens\"] = completion_tokens\n                return _convert_completion_to_chat(completion, stream=stream)  # type: ignore\n            # If the generation involves a function call in completion, generate the parameters\n            elif (\n                START_FUNCTION_CALL_TOKEN not in prompt\n                and START_FUNCTION_CALL_TOKEN in completion_text\n            ):\n                prompt += (\n                    completion_text.replace(\n                        f\"{START_FUNCTION_CALL_TOKEN} \", START_FUNCTION_CALL_TOKEN\n                    )\n                    + \"\\n\"\n                )\n                function_calls.append(\n                    completion_text.split(START_FUNCTION_CALL_TOKEN)[-1][:-1].strip()\n                )\n                grammar = get_grammar(function_calls[-1])\n                completion = create_completion(\n                    prompt=prompt, stop=END_FUNCTION_CALL_TOKEN, grammar=grammar\n                )\n                completion_tokens += completion[\"usage\"][\"completion_tokens\"]\n                function_bodies.append(completion[\"choices\"][0][\"text\"].strip())\n            # If the prompt involves a function call, just append generated parameters to function_bodies\n            else:\n                function_bodies.append(completion_text.strip())\n        else:\n            # If tool_choice/function_call is provided\n            if isinstance(function_call, dict):\n                prompt += f\"{function_call['name']}\\n{CONTENT_TOKEN}\"\n                function_call = function_call[\"name\"]\n                function_calls.append(function_call)\n                grammar = get_grammar(function_call)\n                stops = [STOP_TOKEN, FROM_TOKEN]\n                completion = create_completion(\n                    prompt=prompt, stop=stops, grammar=grammar\n                )\n                completion_text = completion[\"choices\"][0][\"text\"]\n                completion_tokens += completion[\"usage\"][\"completion_tokens\"]\n                function_bodies.append(completion_text.strip())\n            # If \"auto\" or no tool_choice/function_call\n            elif isinstance(function_call, str) and function_call == \"auto\":\n                while True:\n                    # Generate function name first\n                    grammar = None\n                    stops = CONTENT_TOKEN\n                    completion = create_completion(\n                        prompt=prompt, stop=stops, grammar=grammar\n                    )\n                    completion_text = completion[\"choices\"][0][\"text\"]\n                    completion_tokens += completion[\"usage\"][\"completion_tokens\"]\n                    function_name = completion_text.strip()\n                    if function_name == \"all\":\n                        prompt += \"all\\n<|content|>\"\n                    else:\n                        function_call = completion_text.strip()\n                        prompt += f\"{function_call}\\n<|content|>\"\n                        function_calls.append(function_call)\n                        grammar = get_grammar(function_call)\n                    # Generate content\n                    stops = [RECIPIENT_TOKEN, STOP_TOKEN]\n                    completion = create_completion(\n                        prompt=prompt, stop=stops, grammar=grammar\n                    )\n                    completion_text = completion[\"choices\"][0][\"text\"]\n                    completion_tokens += completion[\"usage\"][\"completion_tokens\"]\n                    if function_name == \"all\":\n                        if completion_text.endswith(\"\\n<|from|>assistant\\n\"):\n                            content += completion_text[: -len(\"\\n<|from|>assistant\\n\")]\n                        if completion_text.endswith(\"\\n<|from|> assistant\\n\"):\n                            content += completion_text[-len(\"\\n<|from|> assistant\\n\")]\n                        else:\n                            content += completion_text\n                        content = content.lstrip()\n                        # Check whether the model wants to generate another turn\n                        if (\n                            \"<|from|> assistant\" in completion_text\n                            or \"<|from|>assistant\" in completion_text\n                        ):\n                            if completion_text.endswith(\"\\n<|from|>assistant\\n\"):\n                                cleaned_completion_text = completion_text[\n                                    : -len(\"\\n<|from|>assistant\\n\")\n                                ].strip()\n                            elif completion_text.endswith(\"\\n<|from|> assistant\\n\"):\n                                cleaned_completion_text = completion_text[\n                                    -len(\"\\n<|from|> assistant\\n\")\n                                ].strip()\n                            else:\n                                cleaned_completion_text = completion_text.strip()\n                            prompt += f\"{cleaned_completion_text}\\n<|from|>assistant\\n<|recipient|>\"\n                        else:\n                            break\n                    else:\n                        function_bodies.append(completion_text.strip())\n                        # Check whether the model wants to generate another turn\n                        prompt += completion_text.strip()\n                        grammar = None\n                        completion = create_completion(\n                            prompt=prompt, stop=stops, grammar=grammar\n                        )\n                        completion_tokens += completion[\"usage\"][\"completion_tokens\"]\n                        if (\n                            \"<|from|> assistant\" in completion[\"choices\"][0][\"text\"]\n                            or \"<|from|>assistant\" in completion[\"choices\"][0][\"text\"]\n                        ):\n                            prompt += \"\\n<|from|>assistant\\n<|recipient|>\"\n                        else:\n                            break\n\n        assert \"usage\" in completion\n        assert len(function_calls) == len(function_bodies)\n\n        tool_calls: List[llama_types.ChatCompletionMessageToolCall] = []\n        for function_call, function_body in zip(function_calls, function_bodies):\n            tool_calls.append(\n                {\n                    \"id\": \"call_\"\n                    + \"\".join(\n                        [\n                            random.choice(string.ascii_letters + string.digits)\n                            for _ in range(24)\n                        ]\n                    ),\n                    \"type\": \"function\",\n                    \"function\": {\n                        \"name\": function_call,\n                        \"arguments\": function_body,\n                    },\n                }\n            )\n\n        # TODO: support stream mode\n        function_call_dict: Union[\n            Dict[str, str],\n            Dict[\n                Literal[\"function_call\"],\n                llama_types.ChatCompletionRequestAssistantMessageFunctionCall,\n            ],\n        ] = {}\n        if len(tool_calls) > 0:\n            if tools is not None:\n                function_call_dict[\"tool_calls\"] = tool_calls\n            else:\n                function_call_dict[\"function_call\"] = {\n                    \"name\": tool_calls[0][\"function\"][\"name\"],\n                    \"arguments\": tool_calls[0][\"function\"][\"arguments\"],\n                }\n        completion[\"usage\"][\"completion_tokens\"] = completion_tokens\n        return llama_types.CreateChatCompletionResponse(\n            id=\"chat\" + completion[\"id\"],\n            object=\"chat.completion\",\n            created=completion[\"created\"],\n            model=completion[\"model\"],\n            choices=[\n                {\n                    \"index\": 0,\n                    \"logprobs\": _convert_text_completion_logprobs_to_chat(completion[\"choices\"][0][\"logprobs\"]),\n                    \"message\": {\n                        \"role\": \"assistant\",\n                        \"content\": None if content == \"\" else content,\n                        **function_call_dict,\n                    },\n                    \"finish_reason\": \"tool_calls\" if len(tool_calls) > 0 else \"stop\",\n                }\n            ],\n            usage=completion[\"usage\"],\n        )\n\n\nclass Llava15ChatHandler:\n    DEFAULT_SYSTEM_MESSAGE: Optional[str] = (\n        \"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\"\n    )\n\n    CHAT_FORMAT = (\n        \"{% for message in messages %}\"\n        \"{% if message.role == 'system' %}\"\n        \"{{ message.content }}\"\n        \"{% endif %}\"\n        \"{% if message.role == 'user' %}\"\n        \"{% if message.content is string %}\"\n        \"\\nUSER: {{ message.content }}\"\n        \"{% endif %}\"\n        \"{% if message.content is iterable %}\"\n        \"\\nUSER: \"\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'image_url' and content.image_url is string %}\"\n        \"{{ content.image_url }}\"\n        \"{% endif %}\"\n        \"{% if content.type == 'image_url' and content.image_url is mapping %}\"\n        \"{{ content.image_url.url }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'text' %}\"\n        \"{{ content.text }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% endif %}\"\n        \"{% endif %}\"\n        \"{% if message.role == 'assistant' and message.content is not none %}\"\n        \"\\nASSISTANT: {{ message.content }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% if add_generation_prompt %}\"\n        \"\\nASSISTANT: \"\n        \"{% endif %}\"\n    )\n\n    def __init__(self, clip_model_path: str, verbose: bool = True):\n        import llama_cpp.mtmd_cpp as mtmd_cpp\n\n        self.clip_model_path = clip_model_path\n        self.verbose = verbose\n        self._mtmd_cpp = mtmd_cpp\n        self._exit_stack = ExitStack()\n        self.mtmd_ctx: Optional[mtmd_cpp.mtmd_context_p] = None\n\n        if not os.path.exists(clip_model_path):\n            raise ValueError(f\"Clip model path does not exist: {clip_model_path}\")\n\n    def _init_mtmd_context(self, llama_model: llama.Llama):\n        \"\"\"Initialize mtmd context with the llama model.\"\"\"\n        if self.mtmd_ctx is not None:\n            return  # Already initialized\n\n        with suppress_stdout_stderr(disable=self.verbose):\n            # Get default parameters\n            ctx_params = self._mtmd_cpp.mtmd_context_params_default()\n            ctx_params.use_gpu = True # TODO: Make this configurable\n            ctx_params.print_timings = self.verbose\n            ctx_params.n_threads = llama_model.n_threads\n            ctx_params.verbosity = 2 if self.verbose else 0  # GGML_LOG_LEVEL_INFO = 2\n\n            # Initialize mtmd context\n            self.mtmd_ctx = self._mtmd_cpp.mtmd_init_from_file(\n                self.clip_model_path.encode(),\n                llama_model.model,\n                ctx_params\n            )\n\n            if self.mtmd_ctx is None:\n                raise ValueError(f\"Failed to load mtmd context from: {self.clip_model_path}\")\n\n            # Check if vision is supported\n            if not self._mtmd_cpp.mtmd_support_vision(self.mtmd_ctx):\n                raise ValueError(\"Vision is not supported by this model\")\n\n            def mtmd_free():\n                with suppress_stdout_stderr(disable=self.verbose):\n                    if self.mtmd_ctx is not None:\n                        self._mtmd_cpp.mtmd_free(self.mtmd_ctx)\n                        self.mtmd_ctx = None\n\n            self._exit_stack.callback(mtmd_free)\n\n    def load_image(self, image_url: str) -> bytes:\n        return self._load_image(image_url)\n\n    def _create_bitmap_from_bytes(self, image_bytes: bytes):\n        \"\"\"Create mtmd_bitmap from image bytes.\"\"\"\n        if self.mtmd_ctx is None:\n            raise ValueError(\"mtmd context not initialized\")\n\n        with suppress_stdout_stderr(disable=self.verbose):\n            # Create bitmap from buffer using helper function\n            bitmap = self._mtmd_cpp.mtmd_helper_bitmap_init_from_buf(\n                self.mtmd_ctx,\n                (ctypes.c_uint8 * len(image_bytes)).from_buffer(bytearray(image_bytes)),\n                len(image_bytes)\n            )\n            \n            if bitmap is None:\n                raise ValueError(\"Failed to create bitmap from image bytes\")\n            \n            return bitmap\n\n    def __call__(\n        self,\n        *,\n        llama: llama.Llama,\n        messages: List[llama_types.ChatCompletionRequestMessage],\n        functions: Optional[List[llama_types.ChatCompletionFunction]] = None,\n        function_call: Optional[llama_types.ChatCompletionRequestFunctionCall] = None,\n        tools: Optional[List[llama_types.ChatCompletionTool]] = None,\n        tool_choice: Optional[llama_types.ChatCompletionToolChoiceOption] = None,\n        temperature: float = 0.2,\n        top_p: float = 0.95,\n        top_k: int = 40,\n        min_p: float = 0.05,\n        typical_p: float = 1.0,\n        stream: bool = False,\n        stop: Optional[Union[str, List[str]]] = [],\n        seed: Optional[int] = None,\n        response_format: Optional[\n            llama_types.ChatCompletionRequestResponseFormat\n        ] = None,\n        max_tokens: Optional[int] = None,\n        presence_penalty: float = 0.0,\n        frequency_penalty: float = 0.0,\n        repeat_penalty: float = 1.1,\n        tfs_z: float = 1.0,\n        mirostat_mode: int = 0,\n        mirostat_tau: float = 5.0,\n        mirostat_eta: float = 0.1,\n        model: Optional[str] = None,\n        logits_processor: Optional[llama.LogitsProcessorList] = None,\n        grammar: Optional[llama.LlamaGrammar] = None,\n        logit_bias: Optional[Dict[str, float]] = None,\n        logprobs: Optional[bool] = None,\n        top_logprobs: Optional[int] = None,\n        **kwargs,  # type: ignore\n    ) -> Union[\n        llama_types.CreateChatCompletionResponse,\n        Iterator[llama_types.CreateChatCompletionStreamResponse],\n    ]:\n        # Initialize mtmd context\n        self._init_mtmd_context(llama)\n        assert self.mtmd_ctx is not None\n\n        system_prompt = _get_system_message(messages)\n        if system_prompt == \"\" and self.DEFAULT_SYSTEM_MESSAGE is not None:\n            messages = [\n                llama_types.ChatCompletionRequestSystemMessage(\n                    role=\"system\", content=self.DEFAULT_SYSTEM_MESSAGE\n                )\n            ] + messages\n\n        image_urls = self.get_image_urls(messages)\n        template = ImmutableSandboxedEnvironment(\n            trim_blocks=True,\n            lstrip_blocks=True,\n        ).from_string(self.CHAT_FORMAT)\n        \n        # Get the default media marker\n        media_marker = self._mtmd_cpp.mtmd_default_marker().decode('utf-8')\n        \n        # Replace image URLs with media markers in the template\n        text = template.render(\n            messages=messages,\n            add_generation_prompt=True,\n            eos_token=llama.detokenize([llama.token_eos()]),\n            bos_token=llama.detokenize([llama.token_bos()]),\n        )\n        \n        # Replace image URLs in text with media markers\n        for image_url in image_urls:\n            text = text.replace(image_url, media_marker)\n\n        if self.verbose:\n            print(text, file=sys.stderr)\n\n        # Create bitmaps from images\n        bitmaps = []\n        bitmap_cleanup = []\n        try:\n            for image_url in image_urls:\n                image_bytes = self.load_image(image_url)\n                bitmap = self._create_bitmap_from_bytes(image_bytes)\n                bitmaps.append(bitmap)\n                bitmap_cleanup.append(bitmap)\n\n            # Create input text structure\n            input_text = self._mtmd_cpp.mtmd_input_text()\n            input_text.text = text.encode('utf-8')\n            input_text.add_special = True\n            input_text.parse_special = True\n\n            # Create input chunks\n            chunks = self._mtmd_cpp.mtmd_input_chunks_init()\n            if chunks is None:\n                raise ValueError(\"Failed to create input chunks\")\n\n            try:\n                # Tokenize text and images together\n                bitmap_array = (self._mtmd_cpp.mtmd_bitmap_p_ctypes * len(bitmaps))(*bitmaps)\n                result = self._mtmd_cpp.mtmd_tokenize(\n                    self.mtmd_ctx,\n                    chunks,\n                    ctypes.byref(input_text),\n                    bitmap_array,\n                    len(bitmaps)\n                )\n\n                if result != 0:\n                    raise ValueError(f\"Failed to tokenize input: error code {result}\")\n\n                # Reset llama context\n                llama.reset()\n                llama._ctx.kv_cache_clear()\n\n                # Process each chunk\n                n_past = llama_cpp.llama_pos(0)\n                n_chunks = self._mtmd_cpp.mtmd_input_chunks_size(chunks)\n                \n                for i in range(n_chunks):\n                    chunk = self._mtmd_cpp.mtmd_input_chunks_get(chunks, i)\n                    if chunk is None:\n                        continue\n\n                    chunk_type = self._mtmd_cpp.mtmd_input_chunk_get_type(chunk)\n                    \n                    if chunk_type == self._mtmd_cpp.MTMD_INPUT_CHUNK_TYPE_TEXT:\n                        # Handle text chunk\n                        n_tokens_out = ctypes.c_size_t()\n                        tokens_ptr = self._mtmd_cpp.mtmd_input_chunk_get_tokens_text(\n                            chunk, ctypes.byref(n_tokens_out)\n                        )\n                        \n                        if tokens_ptr and n_tokens_out.value > 0:\n                            # Convert ctypes array to Python list\n                            tokens = [tokens_ptr[j] for j in range(n_tokens_out.value)]\n                            \n                            if llama.n_tokens + len(tokens) > llama.n_ctx():\n                                raise ValueError(\n                                    f\"Prompt exceeds n_ctx: {llama.n_tokens + len(tokens)} > {llama.n_ctx()}\"\n                                )\n                            llama.eval(tokens)\n                    \n                    elif chunk_type in [self._mtmd_cpp.MTMD_INPUT_CHUNK_TYPE_IMAGE, self._mtmd_cpp.MTMD_INPUT_CHUNK_TYPE_AUDIO]:\n                        # Handle image/audio chunk using helper\n                        chunk_n_tokens = self._mtmd_cpp.mtmd_input_chunk_get_n_tokens(chunk)\n                        \n                        if llama.n_tokens + chunk_n_tokens > llama.n_ctx():\n                            raise ValueError(\n                                f\"Prompt exceeds n_ctx: {llama.n_tokens + chunk_n_tokens} > {llama.n_ctx()}\"\n                            )\n                        \n                        new_n_past = llama_cpp.llama_pos(0)\n                        result = self._mtmd_cpp.mtmd_helper_eval_chunk_single(\n                            self.mtmd_ctx,\n                            llama._ctx.ctx,\n                            chunk,\n                            llama_cpp.llama_pos(llama.n_tokens),\n                            llama_cpp.llama_seq_id(0),\n                            llama.n_batch,\n                            False,  # logits_last\n                            ctypes.byref(new_n_past)\n                        )\n                        \n                        if result != 0:\n                            raise ValueError(f\"Failed to evaluate chunk: error code {result}\")\n                        \n                        # Update llama's token count\n                        llama.n_tokens = new_n_past.value\n\n                # Get prompt tokens to avoid a cache miss\n                prompt = llama.input_ids[: llama.n_tokens].tolist()\n\n            finally:\n                self._mtmd_cpp.mtmd_input_chunks_free(chunks)\n\n        finally:\n            # Cleanup bitmaps\n            for bitmap in bitmap_cleanup:\n                self._mtmd_cpp.mtmd_bitmap_free(bitmap)\n\n        # Handle response format and tools (same as before)\n        if response_format is not None and response_format[\"type\"] == \"json_object\":\n            grammar = _grammar_for_response_format(response_format)\n\n        # Convert legacy functions to tools\n        if functions is not None:\n            tools = [\n                {\n                    \"type\": \"function\",\n                    \"function\": function,\n                }\n                for function in functions\n            ]\n\n        # Convert legacy function_call to tool_choice\n        if function_call is not None:\n            if isinstance(function_call, str) and (\n                function_call == \"none\" or function_call == \"auto\"\n            ):\n                tool_choice = function_call\n            if isinstance(function_call, dict) and \"name\" in function_call:\n                tool_choice = {\n                    \"type\": \"function\",\n                    \"function\": {\n                        \"name\": function_call[\"name\"],\n                    },\n                }\n\n        tool = None\n        if (\n            tool_choice is not None\n            and isinstance(tool_choice, dict)\n            and tools is not None\n        ):\n            name = tool_choice[\"function\"][\"name\"]\n            tool = next((t for t in tools if t[\"function\"][\"name\"] == name), None)\n            if tool is None:\n                raise ValueError(f\"Tool choice '{name}' not found in tools.\")\n            schema = tool[\"function\"][\"parameters\"]\n            try:\n                # create grammar from json schema\n                grammar = llama_grammar.LlamaGrammar.from_json_schema(\n                    json.dumps(schema), verbose=llama.verbose\n                )\n            except Exception as e:\n                if llama.verbose:\n                    print(str(e), file=sys.stderr)\n                grammar = llama_grammar.LlamaGrammar.from_string(\n                    llama_grammar.JSON_GBNF, verbose=llama.verbose\n                )\n\n        completion_or_chunks = llama.create_completion(\n            prompt=prompt,\n            temperature=temperature,\n            top_p=top_p,\n            top_k=top_k,\n            min_p=min_p,\n            typical_p=typical_p,\n            logprobs=top_logprobs if logprobs else None,\n            stream=stream,\n            stop=stop,\n            seed=seed,\n            max_tokens=max_tokens,\n            presence_penalty=presence_penalty,\n            frequency_penalty=frequency_penalty,\n            repeat_penalty=repeat_penalty,\n            tfs_z=tfs_z,\n            mirostat_mode=mirostat_mode,\n            mirostat_tau=mirostat_tau,\n            mirostat_eta=mirostat_eta,\n            model=model,\n            logits_processor=logits_processor,\n            grammar=grammar,\n            logit_bias=logit_bias,\n        )\n        \n        if tool is not None:\n            tool_name = tool[\"function\"][\"name\"]\n            return _convert_completion_to_chat_function(\n                tool_name, completion_or_chunks, stream\n            )\n        return _convert_completion_to_chat(completion_or_chunks, stream=stream)\n\n    @staticmethod\n    def _load_image(image_url: str) -> bytes:\n        # TODO: Add Pillow support for other image formats beyond (jpg, png)\n        if image_url.startswith(\"data:\"):\n            import base64\n            image_bytes = base64.b64decode(image_url.split(\",\")[1])\n            return image_bytes\n        else:\n            import urllib.request\n            with urllib.request.urlopen(image_url) as f:\n                image_bytes = f.read()\n                return image_bytes\n\n    @staticmethod\n    def get_image_urls(messages: List[llama_types.ChatCompletionRequestMessage]):\n        image_urls: List[str] = []\n        for message in messages:\n            if message[\"role\"] == \"user\":\n                if message[\"content\"] is None:\n                    continue\n                for content in message[\"content\"]:\n                    if isinstance(content, dict) and \"type\" in content:\n                        if content[\"type\"] == \"image_url\":\n                            if (\n                                isinstance(content[\"image_url\"], dict)\n                                and \"url\" in content[\"image_url\"]\n                            ):\n                                image_urls.append(content[\"image_url\"][\"url\"])\n                            else:\n                                image_urls.append(content[\"image_url\"])\n        return image_urls\n\n    @staticmethod\n    def split_text_on_image_urls(text: str, image_urls: List[str]):\n        \"\"\"This method is no longer used in the new implementation.\"\"\"\n        def find_first(s: str, substrs: List[str]):\n            for i, substr in enumerate(substrs):\n                pos = s.find(substr)\n                if pos != -1:\n                    return pos, i\n            return None, None\n\n        split_text: List[Tuple[Literal[\"text\", \"image_url\"], str]] = []\n        remaining = text\n        while remaining:\n            # Find first image_url\n            pos, i = find_first(remaining, image_urls)\n            if pos is not None and i is not None:\n                if pos > 0:\n                    split_text.append((\"text\", remaining[:pos]))\n                split_text.append((\"image_url\", image_urls[i]))\n                remaining = remaining[pos + len(image_urls[i]) :]\n            else:\n                split_text.append((\"text\", remaining))\n                remaining = \"\"\n        return split_text\n\n    @classmethod\n    def from_pretrained(\n        cls,\n        repo_id: str,\n        filename: Optional[str],\n        local_dir: Optional[Union[str, os.PathLike[str]]] = None,\n        local_dir_use_symlinks: Union[bool, Literal[\"auto\"]] = \"auto\",\n        cache_dir: Optional[Union[str, os.PathLike[str]]] = None,\n        **kwargs: Any,\n    ) -> \"Llava15ChatHandler\":\n        import fnmatch\n        from pathlib import Path\n\n        try:\n            from huggingface_hub import hf_hub_download, HfFileSystem  # type: ignore\n            from huggingface_hub.utils import validate_repo_id  # type: ignore\n        except ImportError:\n            raise ImportError(\n                \"Llama.from_pretrained requires the huggingface-hub package. \"\n                \"You can install it with `pip install huggingface-hub`.\"\n            )\n\n        validate_repo_id(repo_id)\n\n        hffs = HfFileSystem()\n\n        files = [\n            file[\"name\"] if isinstance(file, dict) else file\n            for file in hffs.ls(repo_id)  # type: ignore\n        ]\n\n        # split each file into repo_id, subfolder, filename\n        file_list: List[str] = []\n        for file in files:\n            rel_path = Path(file).relative_to(repo_id)\n            file_list.append(str(rel_path))\n\n        matching_files = [file for file in file_list if fnmatch.fnmatch(file, filename)]  # type: ignore\n\n        if len(matching_files) == 0:\n            raise ValueError(\n                f\"No file found in {repo_id} that match {filename}\\n\\n\"\n                f\"Available Files:\\n{json.dumps(file_list)}\"\n            )\n\n        if len(matching_files) > 1:\n            raise ValueError(\n                f\"Multiple files found in {repo_id} matching {filename}\\n\\n\"\n                f\"Available Files:\\n{json.dumps(files)}\"\n            )\n\n        (matching_file,) = matching_files\n\n        subfolder = str(Path(matching_file).parent)\n        filename = Path(matching_file).name\n\n        # download the file\n        hf_hub_download(\n            repo_id=repo_id,\n            filename=filename,\n            subfolder=subfolder,\n            local_dir=cast(Union[str, Path, None], local_dir),\n            local_dir_use_symlinks=local_dir_use_symlinks,\n            cache_dir=cast(Union[str, Path, None], cache_dir),\n        )\n\n        if local_dir is None:\n            model_path = hf_hub_download(\n                repo_id=repo_id,\n                filename=filename,\n                subfolder=subfolder,\n                local_dir=local_dir,\n                local_dir_use_symlinks=local_dir_use_symlinks,\n                cache_dir=cast(Union[str, Path, None], cache_dir),\n                local_files_only=True,\n            )\n        else:\n            model_path = os.path.join(local_dir, filename)\n\n        return cls(\n            clip_model_path=model_path,\n            **kwargs,\n        )\n\n\nclass ObsidianChatHandler(Llava15ChatHandler):\n    # Prompt Format\n    # The model followed ChatML format. However, with ### as the seperator\n\n    # <|im_start|>user\n    # What is this sign about?\\n<image>\n    # ###\n    # <|im_start|>assistant\n    # The sign is about bullying, and it is placed on a black background with a red background.\n    # ###\n\n    CHAT_FORMAT = (\n        \"{% for message in messages %}\"\n        # System message\n        \"{% if message.role == 'system' %}\"\n        \"<|im_start|>system\\n\"\n        \"{{ message.content }}\\n\"\n        \"###\\n\"\n        \"{% endif %}\"\n        # User message\n        \"{% if message.role == 'user' %}\"\n        \"<|im_start|>user\\n\"\n        \"{% if message.content is string %}\"\n        \"{{ message.content }}\"\n        \"{% endif %}\"\n        \"{% if message.content is iterable %}\"\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'image_url' and content.image_url is string %}\"\n        \"{{ content.image_url }}\"\n        \"{% endif %}\"\n        \"{% if content.type == 'image_url' and content.image_url is mapping %}\"\n        \"{{ content.image_url.url }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'text' %}\"\n        \"{{ content.text }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% endif %}\"\n        \"###\\n\"\n        \"{% endif %}\"\n        # Assistant message\n        \"{% if message.role == 'assistant' %}\"\n        \"<|im_start|>assistant\\n\"\n        \"{{ message.content }}\"\n        \"###\\n\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        # Generation prompt\n        \"{% if add_generation_prompt %}\"\n        \"<|im_start|>assistant\\n\"\n        \"{% endif %}\"\n    )\n\n\nclass MoondreamChatHandler(Llava15ChatHandler):\n    # Chat Format:\n    # f\"<image>\\n\\n{chat_history}Question: {question}\\n\\nAnswer:\"\n    CHAT_FORMAT = (\n        \"{% for message in messages %}\"\n        \"{% if message.role == 'user' %}\"\n        \"{% if message.content is iterable %}\"\n        # <image>\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'image_url' %}\"\n        \"{% if content.image_url is string %}\"\n        \"{{ content.image_url }}\\n\\n\"\n        \"{% endif %}\"\n        \"{% if content.image_url is mapping %}\"\n        \"{{ content.image_url.url }}\\n\\n\"\n        \"{% endif %}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        # Question:\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'text' %}\"\n        \"Question: {{ content.text }}\\n\\n\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% endif %}\"\n        # Question:\n        \"{% if message.content is string %}\"\n        \"Question: {{ message.content }}\\n\\n\"\n        \"{% endif %}\"\n        \"{% endif %}\"\n        # Answer:\n        \"{% if message.role == 'assistant' %}\"\n        \"Answer:{{ message.content }}\\n\\n\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        # Generation prompt\n        \"{% if add_generation_prompt %}\"\n        \"Answer:\"\n        \"{% endif %}\"\n    )\n\n\nclass Llava16ChatHandler(Llava15ChatHandler):\n    DEFAULT_SYSTEM_MESSAGE = \"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. \"\n\n    # Example prompt\n    # \"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\\nWhat is shown in this image? ASSISTANT:\"\n\n    CHAT_FORMAT = (\n        \"{% for message in messages %}\"\n        \"{% if message.role == 'system' %}\"\n        \"{{ message.content }}\"\n        \"{% endif %}\"\n        \"{% if message.role == 'user' %}\"\n        \"{% if message.content is iterable %}\"\n        # <image>\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'image_url' %}\"\n        \"{% if content.image_url is string %}\"\n        \"{{ content.image_url }}\\n\"\n        \"{% endif %}\"\n        \"{% if content.image_url is mapping %}\"\n        \"{{ content.image_url.url }}\\n\"\n        \"{% endif %}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        # Question:\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'text' %}\"\n        \"{{ content.text }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% endif %}\"\n        # Question:\n        \"{% if message.content is string %}\"\n        \"{{ message.content }}\"\n        \"{% endif %}\"\n        \"{% endif %}\"\n        # Answer:\n        \"{% if message.role == 'assistant' %}\"\n        \"{{ message.content }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        # Generation prompt\n        \"{% if add_generation_prompt %}\"\n        \"Answer:\"\n        \"{% endif %}\"\n    )\n\n\nclass NanoLlavaChatHandler(Llava15ChatHandler):\n    # Prompt Format\n    # The model follow the ChatML standard, however, without \\n at the end of <|im_end|>:\n\n    # <|im_start|>system\n    # Answer the question<|im_end|><|im_start|>user\n    # <image>\n    # What is the picture about?<|im_end|><|im_start|>assistant\n    DEFAULT_SYSTEM_MESSAGE = \"Answer the question\"\n\n    CHAT_FORMAT = (\n        \"{% for message in messages %}\"\n        # System message\n        \"{% if message.role == 'system' %}\"\n        \"<|im_start|>system\\n\"\n        \"{{ message.content }}\"\n        \"<|im_end|>\"\n        \"{% endif %}\"\n        # User message\n        \"{% if message.role == 'user' %}\"\n        \"<|im_start|>user\\n\"\n        \"{% if message.content is string %}\"\n        \"{{ message.content }}\"\n        \"{% endif %}\"\n        \"{% if message.content is iterable %}\"\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'image_url' and content.image_url is string %}\"\n        \"{{ content.image_url }}\"\n        \"{% endif %}\"\n        \"{% if content.type == 'image_url' and content.image_url is mapping %}\"\n        \"{{ content.image_url.url }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'text' %}\"\n        \"{{ content.text }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% endif %}\"\n        \"<|im_end|>\"\n        \"{% endif %}\"\n        # Assistant message\n        \"{% if message.role == 'assistant' %}\"\n        \"<|im_start|>assistant\\n\"\n        \"{{ message.content }}\"\n        \"<|im_end|>\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        # Generation prompt\n        \"{% if add_generation_prompt %}\"\n        \"<|im_start|>assistant\\n\"\n        \"{% endif %}\"\n    )\n\n\nclass Llama3VisionAlphaChatHandler(Llava15ChatHandler):\n    # question = \"<image>\" + q\n\n    # prompt = f\"<|start_header_id|>user<|end_header_id|>\\n\\n{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\\n\"\n    DEFAULT_SYSTEM_MESSAGE = None\n\n    CHAT_FORMAT = (\n        \"{% for message in messages %}\"\n        \"<|start_header_id|>\"\n        \"{% if message.role == 'user' %}\"\n        \"user<|end_header_id|>\\n\\n\"\n        \"{% if message.content is iterable %}\"\n        # <image>\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'image_url' %}\"\n        \"{% if content.image_url is string %}\"\n        \"{{ content.image_url }}\"\n        \"{% endif %}\"\n        \"{% if content.image_url is mapping %}\"\n        \"{{ content.image_url.url }}\"\n        \"{% endif %}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        # Question:\n        \"{% for content in message.content %}\"\n        \"{% if content.type == 'text' %}\"\n        \"{{ content.text }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% endif %}\"\n        # Question:\n        \"{% if message.content is string %}\"\n        \"{{ message.content }}\"\n        \"{% endif %}\"\n        \"{% endif %}\"\n        # Answer:\n        \"{% if message.role == 'assistant' %}\"\n        \"assistant<|end_header_id|>\\n\\n\"\n        \"{{ message.content }}\"\n        \"{% endif %}\"\n        \"<|eot_id|>\"\n        \"{% endfor %}\"\n        # Generation prompt\n        \"{% if add_generation_prompt %}\"\n        \"<|start_header_id|>assistant<|end_header_id|>\\n\\n\"\n        \"{% endif %}\"\n    )\n\n\n# alias\nLlama3VisionAlpha = Llama3VisionAlphaChatHandler\n\n\nclass MiniCPMv26ChatHandler(Llava15ChatHandler):\n    DEFAULT_SYSTEM_MESSAGE = \"You are a helpful assistant.\"\n\n    CHAT_FORMAT = (\n        \"{% for message in messages %}\"\n        \"{% if loop.first and messages[0]['role'] != 'system' %}\"\n        \"<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n\"\n        \"{% endif %}\"\n        \"<|im_start|>{{ message['role'] }}\\n\"\n        \"{% if message['content'] is iterable %}\"\n        \"{% for content in message['content'] %}\"\n        \"{% if content.type == 'image_url' %}\"\n        \"{% if content.image_url is string %}\"\n        \"{{ content.image_url }}\"\n        \"{% endif %}\"\n        \"{% if content.image_url is mapping %}\"\n        \"{{ content.image_url.url }}\"\n        \"{% endif %}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n\n        \"{% for content in message['content'] %}\"\n        \"{% if content.type == 'text' %}\"\n        \"{{ content.text }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% endif %}\"\n        \"{% if message['content'] is string %}\"\n        \"{{ message['content'] }}\"\n        \"{% endif %}\"\n        \"<|im_end|>\\n\"\n        \"{% endfor %}\"\n        \"{% if add_generation_prompt %}\"\n        \"<|im_start|>assistant\\n\"\n        \"{% endif %}\"\n    )\n\n\nclass Qwen25VLChatHandler(Llava15ChatHandler):\n    DEFAULT_SYSTEM_MESSAGE = \"You are a helpful assistant.\"\n\n    CHAT_FORMAT = (\n        #\"{% set image_count = namespace(value=0) %}\"\n        #\"{% set video_count = namespace(value=0) %}\"\n        \"{% for message in messages %}\"\n        \"{% if loop.first and message['role'] != 'system' %}\"\n        \"<|im_start|>system\\n\"\n        \"{{ self.DEFAULT_SYSTEM_MESSAGE }}<|im_end|>\\n\"\n        \"{% endif %}\"\n        \"<|im_start|>{{ message['role'] }}\\n\"\n        \"{% if message['content'] is string %}\"\n        \"{{ message['content'] }}<|im_end|>\\n\"\n        \"{% else %}\"\n        \"{% for content in message['content'] %}\"\n        \"{% if content['type'] == 'image_url' %}\"\n        \"{% if content.image_url is string %}\"\n        \"{{ content.image_url }}\"\n        \"{% else %}\"\n        \"{{ content.image_url.url }}\"\n        \"{% endif %}\"\n        #\"{% set image_count.value = image_count.value + 1 %}\"\n        \"{% elif content['type'] == 'text' %}\"\n        \"{{ content['text'] }}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"<|im_end|>\\n\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"<|im_start|>assistant\\n\"\n    )\n\n    def __call__(self, **kwargs):\n        llama = kwargs['llama']\n\n        # Clear state for multiple runs\n        llama.reset()\n        llama._ctx.kv_cache_clear()\n        llama.n_tokens = 0\n\n        if hasattr(llama, 'input_ids'):\n            llama.input_ids.fill(0)\n\n        # Clear any handler state\n        if hasattr(self, '_last_image_embed'):\n            self._last_image_embed = None\n            self._last_image_hash = None\n\n        if self.verbose:\n            messages = kwargs.get('messages', [])\n            image_count = len(self.get_image_urls(messages))\n            print(f\"Minimal - Cleared state, processing {image_count} images\", file=sys.stderr)\n\n        # Use parent implementation\n        return super().__call__(**kwargs)\n\n\n@register_chat_completion_handler(\"chatml-function-calling\")\ndef chatml_function_calling(\n    llama: llama.Llama,\n    messages: List[llama_types.ChatCompletionRequestMessage],\n    functions: Optional[List[llama_types.ChatCompletionFunction]] = None,\n    function_call: Optional[llama_types.ChatCompletionRequestFunctionCall] = None,\n    tools: Optional[List[llama_types.ChatCompletionTool]] = None,\n    tool_choice: Optional[llama_types.ChatCompletionToolChoiceOption] = None,\n    temperature: float = 0.2,\n    top_p: float = 0.95,\n    top_k: int = 40,\n    min_p: float = 0.05,\n    typical_p: float = 1.0,\n    stream: bool = False,\n    stop: Optional[Union[str, List[str]]] = [],\n    response_format: Optional[llama_types.ChatCompletionRequestResponseFormat] = None,\n    max_tokens: Optional[int] = None,\n    presence_penalty: float = 0.0,\n    frequency_penalty: float = 0.0,\n    repeat_penalty: float = 1.1,\n    tfs_z: float = 1.0,\n    mirostat_mode: int = 0,\n    mirostat_tau: float = 5.0,\n    mirostat_eta: float = 0.1,\n    model: Optional[str] = None,\n    logits_processor: Optional[llama.LogitsProcessorList] = None,\n    grammar: Optional[llama.LlamaGrammar] = None,\n    logprobs: Optional[bool] = None,\n    top_logprobs: Optional[int] = None,\n    **kwargs,  # type: ignore\n) -> Union[\n    llama_types.CreateChatCompletionResponse,\n    Iterator[llama_types.CreateChatCompletionStreamResponse],\n]:\n    function_calling_template = (\n        \"{% for message in messages %}\"\n        \"<|im_start|>{{ message.role }}\\n\"\n        # System message\n        \"{% if message.role == 'system' %}\"\n        \"{{ message.content }}\"\n        \"{% if tool_calls %}\"\n        \"\\n\\nYou have access to the following functions:\\n\"\n        \"{% for tool in tools %}\"\n        \"\\nfunctions.{{ tool.function.name }}:\\n\"\n        \"{{ tool.function.parameters | tojson }}\"\n        \"\\n{% endfor %}\"\n        \"\\n\\nYou can respond to users messages with either a single message or one or more function calls.\"\n        \"\\n\\nTo respond with a message begin the message with 'message:', use the following format:\"\n        \"\\n\\nmessage:\"\n        \"\\n<message>\"\n        \"\\n\\nTo respond with one or more function calls begin the message with 'functions.<function_name>:', use the following format:\"\n        \"\\n\\nfunctions.<function_name>:\"\n        '\\n{ \"arg1\": \"value1\", \"arg2\": \"value2\" }'\n        \"\\nfunctions.<function_name>:\"\n        '\\n{ \"arg1\": \"value1\", \"arg2\": \"value2\" }'\n        \"{% endif %}\"\n        \"<|im_end|>\\n\"\n        \"{% endif %}\"\n        # User message\n        \"{% if message.role == 'user' %}\"\n        \"{{ message.content }}\"\n        \"<|im_end|>\\n\"\n        \"{% endif %}\"\n        # Assistant message\n        \"{% if message.role == 'assistant' %}\"\n        ## Reglar message\n        \"{% if message.content and message.content | length > 0 %}\"\n        \"{% if tool_calls %}\"\n        \"message:\\n\"\n        \"{% endif %}\"\n        \"{{ message.content }}\"\n        \"<|im_end|>\\n\"\n        \"{% endif %}\"\n        ## Function calls\n        \"{% if 'tool_calls' in message %}\"\n        \"{% for tool_call in message.tool_calls %}\"\n        \"functions.{{ tool_call.function.name }}:\\n\"\n        \"{{ tool_call.function.arguments }}\"\n        \"{% endfor %}\"\n        \"<|im_end|>\\n\"\n        \"{% endif %}\"\n        \"{% endif %}\"\n        \"{% endfor %}\"\n        \"{% if add_generation_prompt %}<|im_start|>assistant\\n{% endif %}\"\n    )\n    template_renderer = ImmutableSandboxedEnvironment(\n        autoescape=jinja2.select_autoescape([\"html\", \"xml\"]),\n        undefined=jinja2.StrictUndefined,\n    ).from_string(function_calling_template)\n\n    # Convert legacy functions to tools\n    if functions is not None:\n        tools = [\n            {\n                \"type\": \"function\",\n                \"function\": function,\n            }\n            for function in functions\n        ]\n\n    # Convert legacy function_call to tool_choice\n    if function_call is not None:\n        if isinstance(function_call, str) and (\n            function_call == \"none\" or function_call == \"auto\"\n        ):\n            tool_choice = function_call\n        if isinstance(function_call, dict) and \"name\" in function_call:\n            tool_choice = {\n                \"type\": \"function\",\n                \"function\": {\n                    \"name\": function_call[\"name\"],\n                },\n            }\n\n    stop = (\n        [stop, \"<|im_end|>\"]\n        if isinstance(stop, str)\n        else stop + [\"<|im_end|>\"] if stop else [\"<|im_end|>\"]\n    )\n\n    # Case 1: No tool choice by user\n    if (\n        tool_choice is None\n        or (isinstance(tool_choice, str) and tool_choice == \"none\")\n        or tools is None\n        or len(tools) == 0\n    ):\n        prompt = template_renderer.render(\n            messages=messages,\n            tools=[],\n            tool_calls=None,\n            add_generation_prompt=True,\n        )\n\n        if response_format is not None and response_format[\"type\"] == \"json_object\":\n            grammar = _grammar_for_response_format(response_format)\n\n        return _convert_completion_to_chat(\n            llama.create_completion(\n                prompt=prompt,\n                temperature=temperature,\n                top_p=top_p,\n                top_k=top_k,\n                min_p=min_p,\n                typical_p=typical_p,\n                stream=stream,\n                stop=stop,\n                max_tokens=max_tokens,\n                presence_penalty=presence_penalty,\n                frequency_penalty=frequency_penalty,\n                repeat_penalty=repeat_penalty,\n                tfs_z=tfs_z,\n                mirostat_mode=mirostat_mode,\n                mirostat_tau=mirostat_tau,\n                mirostat_eta=mirostat_eta,\n                model=model,\n                logits_processor=logits_processor,\n                grammar=grammar,\n                logprobs=top_logprobs if logprobs else None,\n            ),\n            stream=stream,\n        )\n\n    # Case 2: Tool choice by user\n    if isinstance(tool_choice, dict):\n        tool_name = tool_choice[\"function\"][\"name\"]\n        tool = next(\n            (tool for tool in tools if tool[\"function\"][\"name\"] == tool_name), None\n        )\n        if tool is None:\n            raise ValueError(f\"Tool with name '{tool_name}' not found in tools\")\n        prompt = template_renderer.render(\n            messages=messages,\n            tools=tools,\n            tool_calls=True,\n            add_generation_prompt=True,\n        )\n        prompt += f\"functions.{tool_name}:\\n\"\n        try:\n            grammar = llama_grammar.LlamaGrammar.from_json_schema(\n                json.dumps(tool[\"function\"][\"parameters\"]), verbose=llama.verbose\n            )\n        except Exception as e:\n            grammar = llama_grammar.LlamaGrammar.from_string(\n                llama_grammar.JSON_GBNF, verbose=llama.verbose\n            )\n            if llama.verbose:\n                print(\n                    \"Failed to parse function body as JSON schema, falling back to default grammar\"\n                )\n                print(e)\n        completion_or_chunks = llama.create_completion(\n            prompt=prompt,\n            temperature=temperature,\n            top_p=top_p,\n            top_k=top_k,\n            min_p=min_p,\n            typical_p=typical_p,\n            stream=stream,\n            stop=stop,\n            max_tokens=max_tokens,\n            presence_penalty=presence_penalty,\n            frequency_penalty=frequency_penalty,\n            repeat_penalty=repeat_penalty,\n            tfs_z=tfs_z,\n            mirostat_mode=mirostat_mode,\n            mirostat_tau=mirostat_tau,\n            mirostat_eta=mirostat_eta,\n            model=model,\n            logits_processor=logits_processor,\n            grammar=grammar,\n        )\n        return _convert_completion_to_chat_function(\n            tool_name, completion_or_chunks, stream\n        )\n\n    # Case 3: Automatic tool choice\n    assert isinstance(tool_choice, str) and tool_choice == \"auto\"\n    function_names = \" | \".join(\n        [f'''\"functions.{tool['function']['name']}:\"''' for tool in tools]\n    )\n    initial_gbnf_tool_grammar = (\n        \"\"\"root   ::= functions | \"message:\"\\n\"\"\"\n        f\"\"\"functions ::= {function_names}\\n\"\"\"\n    )\n    follow_up_gbnf_tool_grammar = (\n        \"\"\"root   ::= functions | \"<|im_end|>\"\\n\"\"\"\n        f\"\"\"functions ::= {function_names}\\n\"\"\"\n    )\n    prompt = template_renderer.render(\n        messages=messages,\n        tools=tools,\n        tool_calls=True,\n        add_generation_prompt=True,\n    )\n    completion_or_chunks = llama.create_completion(\n        prompt=prompt,\n        temperature=0,\n        top_p=top_p,\n        top_k=top_k,\n        min_p=min_p,\n        typical_p=typical_p,\n        stream=False,\n        stop=[\":\"],\n        max_tokens=None,\n        presence_penalty=presence_penalty,\n        frequency_penalty=frequency_penalty,\n        repeat_penalty=repeat_penalty,\n        tfs_z=tfs_z,\n        mirostat_mode=mirostat_mode,\n        mirostat_tau=mirostat_tau,\n        mirostat_eta=mirostat_eta,\n        model=model,\n        logits_processor=logits_processor,\n        grammar=llama_grammar.LlamaGrammar.from_string(\n            initial_gbnf_tool_grammar, verbose=llama.verbose\n        ),\n    )\n    completion: llama_types.CreateCompletionResponse = completion_or_chunks  # type: ignore\n    text = completion[\"choices\"][0][\"text\"]\n    if \"message\" in text:\n        return _convert_completion_to_chat(\n            llama.create_completion(\n                prompt=prompt + \"message:\\n\",\n                temperature=temperature,\n                top_p=top_p,\n                top_k=top_k,\n                min_p=min_p,\n                typical_p=typical_p,\n                stream=stream,\n                stop=[\"<|im_end|>\"],\n                logprobs=top_logprobs if logprobs else None,\n                max_tokens=None,\n                presence_penalty=presence_penalty,\n                frequency_penalty=frequency_penalty,\n                repeat_penalty=repeat_penalty,\n                tfs_z=tfs_z,\n                mirostat_mode=mirostat_mode,\n                mirostat_tau=mirostat_tau,\n                mirostat_eta=mirostat_eta,\n                model=model,\n                logits_processor=logits_processor,\n                grammar=llama_grammar.LlamaGrammar.from_string(\n                    follow_up_gbnf_tool_grammar, verbose=llama.verbose\n                ),\n            ),\n            stream=stream,\n        )\n\n    # One or more function calls\n    tool_name = text[len(\"functions.\") :]\n    tool = next((tool for tool in tools if tool[\"function\"][\"name\"] == tool_name), None)\n    if not stream:\n        completions: List[llama_types.CreateCompletionResponse] = []\n        completions_tool_name: List[str] = []\n        while tool is not None:\n            prompt += f\"functions.{tool_name}:\\n\"\n            try:\n                grammar = llama_grammar.LlamaGrammar.from_json_schema(\n                    json.dumps(tool[\"function\"][\"parameters\"]), verbose=llama.verbose\n                )\n            except Exception as e:\n                grammar = llama_grammar.LlamaGrammar.from_string(\n                    llama_grammar.JSON_GBNF, verbose=llama.verbose\n                )\n                if llama.verbose:\n                    print(\n                        \"Failed to parse function body as JSON schema, falling back to default grammar\"\n                    )\n                    print(e)\n            completion_or_chunks = llama.create_completion(\n                prompt=prompt,\n                temperature=temperature,\n                top_p=top_p,\n                top_k=top_k,\n                min_p=min_p,\n                typical_p=typical_p,\n                stream=False,\n                stop=stop,\n                max_tokens=None,\n                presence_penalty=presence_penalty,\n                frequency_penalty=frequency_penalty,\n                repeat_penalty=repeat_penalty,\n                tfs_z=tfs_z,\n                mirostat_mode=mirostat_mode,\n                mirostat_tau=mirostat_tau,\n                mirostat_eta=mirostat_eta,\n                model=model,\n                logits_processor=logits_processor,\n                grammar=grammar,\n            )\n            completion_or_chunks = cast(\n                llama_types.CreateCompletionResponse, completion_or_chunks\n            )\n            completions.append(completion_or_chunks)\n            completions_tool_name.append(tool_name)\n            prompt += completion_or_chunks[\"choices\"][0][\"text\"]\n            prompt += \"\\n\"\n\n            response = llama.create_completion(\n                prompt=prompt,\n                temperature=temperature,\n                top_p=top_p,\n                top_k=top_k,\n                min_p=min_p,\n                typical_p=typical_p,\n                stream=False,\n                stop=stop,\n                max_tokens=None,\n                presence_penalty=presence_penalty,\n                frequency_penalty=frequency_penalty,\n                repeat_penalty=repeat_penalty,\n                tfs_z=tfs_z,\n                mirostat_mode=mirostat_mode,\n                mirostat_tau=mirostat_tau,\n                mirostat_eta=mirostat_eta,\n                model=model,\n                logits_processor=logits_processor,\n                grammar=llama_grammar.LlamaGrammar.from_string(\n                    follow_up_gbnf_tool_grammar, verbose=llama.verbose\n                ),\n            )\n            response = cast(llama_types.CreateCompletionResponse, response)\n\n            tool_name = response[\"choices\"][0][\"text\"][len(\"functions.\") :]\n            tool = next(\n                (tool for tool in tools if tool[\"function\"][\"name\"] == tool_name), None\n            )\n\n        # Merge completions\n        function_call_dict: Union[\n            Dict[str, str],\n            Dict[\n                Literal[\"function_call\"],\n                llama_types.ChatCompletionRequestAssistantMessageFunctionCall,\n            ],\n        ] = (\n            {\n                \"function_call\": {\n                    \"name\": tool_name,\n                    \"arguments\": completions[0][\"choices\"][0][\"text\"],\n                }\n            }\n            if len(completions) == 1\n            else {}\n        )\n        return {\n            \"id\": \"chat\" + completion[\"id\"],\n            \"object\": \"chat.completion\",\n            \"created\": completion[\"created\"],\n            \"model\": completion[\"model\"],\n            \"choices\": [\n                {\n                    \"finish_reason\": \"tool_calls\",\n                    \"index\": 0,\n                    \"logprobs\": _convert_text_completion_logprobs_to_chat(completion[\"choices\"][0][\"logprobs\"]),\n                    \"message\": {\n                        \"role\": \"assistant\",\n                        \"content\": None,\n                        \"tool_calls\": [\n                            {\n                                \"id\": \"call_\"\n                                + f\"_{i}_\"\n                                + tool_name\n                                + \"_\"\n                                + completion[\"id\"],\n                                \"type\": \"function\",\n                                \"function\": {\n                                    \"name\": tool_name,\n                                    \"arguments\": completion[\"choices\"][0][\"text\"],\n                                },\n                            }\n                            for i, (tool_name, completion) in enumerate(\n                                zip(completions_tool_name, completions)\n                            )\n                        ],\n                        **function_call_dict,\n                    },\n                }\n            ],\n            \"usage\": {\n                \"completion_tokens\": sum(\n                    (\n                        completion[\"usage\"][\"completion_tokens\"]\n                        if \"usage\" in completion\n                        else 0\n                    )\n                    for completion in completions\n                ),\n                \"prompt_tokens\": sum(\n                    completion[\"usage\"][\"prompt_tokens\"] if \"usage\" in completion else 0\n                    for completion in completions\n                ),\n                \"total_tokens\": sum(\n                    completion[\"usage\"][\"total_tokens\"] if \"usage\" in completion else 0\n                    for completion in completions\n                ),\n            },\n        }\n\n    raise ValueError(\"Automatic streaming tool choice is not supported\")\n"
  },
  {
    "path": "llama_cpp/llama_cpp.py",
    "content": "from __future__ import annotations\n\nimport os\nimport ctypes\nimport pathlib\n\nfrom typing import (\n    Callable,\n    Union,\n    NewType,\n    Optional,\n    TYPE_CHECKING,\n)\n\nfrom llama_cpp._ctypes_extensions import (\n    load_shared_library,\n    byref,\n    ctypes_function_for_shared_library,\n)\n\nif TYPE_CHECKING:\n    from llama_cpp._ctypes_extensions import (\n        CtypesCData,\n        CtypesArray,\n        CtypesPointer,\n        CtypesVoidPointer,\n        CtypesRef,\n        CtypesPointerOrRef,\n        CtypesFuncPointer,\n    )\n\n\n# Specify the base name of the shared library to load\n_lib_base_name = \"llama\"\n_override_base_path = os.environ.get(\"LLAMA_CPP_LIB_PATH\")\n_base_path = pathlib.Path(os.path.abspath(os.path.dirname(__file__))) / \"lib\" if _override_base_path is None else pathlib.Path(_override_base_path)\n# Load the library\n_lib = load_shared_library(_lib_base_name, _base_path)\n\nctypes_function = ctypes_function_for_shared_library(_lib)\n\n\n# from ggml.h\n# // NOTE: always add types at the end of the enum to keep backward compatibility\n# enum ggml_type {\n#     GGML_TYPE_F32     = 0,\n#     GGML_TYPE_F16     = 1,\n#     GGML_TYPE_Q4_0    = 2,\n#     GGML_TYPE_Q4_1    = 3,\n#     // GGML_TYPE_Q4_2 = 4, support has been removed\n#     // GGML_TYPE_Q4_3 = 5, support has been removed\n#     GGML_TYPE_Q5_0    = 6,\n#     GGML_TYPE_Q5_1    = 7,\n#     GGML_TYPE_Q8_0    = 8,\n#     GGML_TYPE_Q8_1    = 9,\n#     GGML_TYPE_Q2_K    = 10,\n#     GGML_TYPE_Q3_K    = 11,\n#     GGML_TYPE_Q4_K    = 12,\n#     GGML_TYPE_Q5_K    = 13,\n#     GGML_TYPE_Q6_K    = 14,\n#     GGML_TYPE_Q8_K    = 15,\n#     GGML_TYPE_IQ2_XXS = 16,\n#     GGML_TYPE_IQ2_XS  = 17,\n#     GGML_TYPE_IQ3_XXS = 18,\n#     GGML_TYPE_IQ1_S   = 19,\n#     GGML_TYPE_IQ4_NL  = 20,\n#     GGML_TYPE_IQ3_S   = 21,\n#     GGML_TYPE_IQ2_S   = 22,\n#     GGML_TYPE_IQ4_XS  = 23,\n#     GGML_TYPE_I8      = 24,\n#     GGML_TYPE_I16     = 25,\n#     GGML_TYPE_I32     = 26,\n#     GGML_TYPE_I64     = 27,\n#     GGML_TYPE_F64     = 28,\n#     GGML_TYPE_IQ1_M   = 29,\n#     GGML_TYPE_COUNT,\n# };\nGGML_TYPE_F32 = 0\nGGML_TYPE_F16 = 1\nGGML_TYPE_Q4_0 = 2\nGGML_TYPE_Q4_1 = 3\nGGML_TYPE_Q5_0 = 6\nGGML_TYPE_Q5_1 = 7\nGGML_TYPE_Q8_0 = 8\nGGML_TYPE_Q8_1 = 9\nGGML_TYPE_Q2_K = 10\nGGML_TYPE_Q3_K = 11\nGGML_TYPE_Q4_K = 12\nGGML_TYPE_Q5_K = 13\nGGML_TYPE_Q6_K = 14\nGGML_TYPE_Q8_K = 15\nGGML_TYPE_IQ2_XXS = 16\nGGML_TYPE_IQ2_XS = 17\nGGML_TYPE_IQ3_XXS = 18\nGGML_TYPE_IQ1_S = 19\nGGML_TYPE_IQ4_NL = 20\nGGML_TYPE_IQ3_S = 21\nGGML_TYPE_IQ2_S = 22\nGGML_TYPE_IQ4_XS = 23\nGGML_TYPE_I8 = 24\nGGML_TYPE_I16 = 25\nGGML_TYPE_I32 = 26\nGGML_TYPE_I64 = 27\nGGML_TYPE_F64 = 28\nGGML_TYPE_IQ1_M = 29\nGGML_TYPE_COUNT = 30\n\n# from ggml-backend.h\n# typedef bool (*ggml_backend_sched_eval_callback)(struct ggml_tensor * t, bool ask, void * user_data);\nggml_backend_sched_eval_callback = ctypes.CFUNCTYPE(\n    ctypes.c_bool, ctypes.c_void_p, ctypes.c_bool, ctypes.c_void_p\n)\n\n# // Abort callback\n# // If not NULL, called before ggml computation\n# // If it returns true, the computation is aborted\n# typedef bool (*ggml_abort_callback)(void * data);\nggml_abort_callback = ctypes.CFUNCTYPE(ctypes.c_bool, ctypes.c_void_p)\n\n# llama.h bindings\n\n_lib.llama_max_devices.argtypes = []\n_lib.llama_max_devices.restype = ctypes.c_size_t\n\nLLAMA_MAX_DEVICES = _lib.llama_max_devices()\n\n# define LLAMA_DEFAULT_SEED 0xFFFFFFFF\nLLAMA_DEFAULT_SEED = 0xFFFFFFFF\n\n# define LLAMA_TOKEN_NULL -1\nLLAMA_TOKEN_NULL = -1\n\n# define LLAMA_FILE_MAGIC_GGLA 0x67676c61u // 'ggla'\nLLAMA_FILE_MAGIC_GGLA = 0x67676C61\n\n# define LLAMA_FILE_MAGIC_GGSN 0x6767736eu // 'ggsn'\nLLAMA_FILE_MAGIC_GGSN = 0x6767736E\n\n# define LLAMA_FILE_MAGIC_GGSQ 0x67677371u // 'ggsq'\nLLAMA_FILE_MAGIC_GGSQ = 0x67677371\n\n# define LLAMA_SESSION_MAGIC   LLAMA_FILE_MAGIC_GGSN\nLLAMA_SESSION_MAGIC = LLAMA_FILE_MAGIC_GGSN\n# define LLAMA_SESSION_VERSION 9\nLLAMA_SESSION_VERSION = 9\n\n# define LLAMA_STATE_SEQ_MAGIC   LLAMA_FILE_MAGIC_GGSQ\nLLAMA_STATE_SEQ_MAGIC = LLAMA_FILE_MAGIC_GGSQ\n# define LLAMA_STATE_SEQ_VERSION 2\nLLAMA_STATE_SEQ_VERSION = 2\n\n# struct llama_vocab;\nllama_vocab_p = NewType(\"llama_vocab_p\", int)\nllama_vocab_p_ctypes = ctypes.c_void_p\n\n# struct llama_model;\nllama_model_p = NewType(\"llama_model_p\", int)\nllama_model_p_ctypes = ctypes.c_void_p\n\n# struct llama_context;\nllama_context_p = NewType(\"llama_context_p\", int)\nllama_context_p_ctypes = ctypes.c_void_p\n\n# typedef struct llama_memory_i * llama_memory_t;\nllama_memory_t = NewType(\"llama_memory_t\", int)\nllama_memory_t_ctypes = ctypes.c_void_p\n\n# struct llama_kv_cache; (DEPRECATED)\nllama_kv_cache_p = NewType(\"llama_kv_cache_p\", int)\nllama_kv_cache_p_ctypes = ctypes.c_void_p\n\n# typedef int32_t llama_pos;\nllama_pos = ctypes.c_int32\n# typedef int32_t llama_token;\nllama_token = ctypes.c_int32\nllama_token_p = ctypes.POINTER(llama_token)\n# typedef int32_t llama_seq_id;\nllama_seq_id = ctypes.c_int32\n\n\n# enum llama_vocab_type {\n#     LLAMA_VOCAB_TYPE_NONE   = 0, // For models without vocab\n#     LLAMA_VOCAB_TYPE_SPM    = 1, // LLaMA tokenizer based on byte-level BPE with byte fallback\n#     LLAMA_VOCAB_TYPE_BPE    = 2, // GPT-2 tokenizer based on byte-level BPE\n#     LLAMA_VOCAB_TYPE_WPM    = 3, // BERT tokenizer based on WordPiece\n#     LLAMA_VOCAB_TYPE_UGM    = 4, // T5 tokenizer based on Unigram\n#     LLAMA_VOCAB_TYPE_RWKV   = 5, // RWKV tokenizer based on greedy tokenization\n#     LLAMA_VOCAB_TYPE_PLAMO2 = 6, // PLaMo-2 tokenizer based on Aho-Corasick with dynamic programming\n# };\nLLAMA_VOCAB_TYPE_NONE = 0\n\"\"\"For models without vocab\"\"\"\nLLAMA_VOCAB_TYPE_SPM = 1\n\"\"\"LLaMA tokenizer based on byte-level BPE with byte fallback\"\"\"\nLLAMA_VOCAB_TYPE_BPE = 2\n\"\"\"GPT-2 tokenizer based on byte-level BPE\"\"\"\nLLAMA_VOCAB_TYPE_WPM = 3\n\"\"\"BERT tokenizer based on WordPiece\"\"\"\nLLAMA_VOCAB_TYPE_UGM = 4\n\"\"\"T5 tokenizer based on Unigram\"\"\"\nLLAMA_VOCAB_TYPE_RWKV = 5\n\"\"\"RWKV tokenizer based on greedy tokenization\"\"\"\nLLAMA_VOCAB_TYPE_PLAMO2 = 6\n\"\"\"PLaMo-2 tokenizer based on Aho-Corasick with dynamic programming\"\"\"\n\n\n# NOTE: Deprecated and will be removed in the future. (already gone in llama.cpp)\n# // pre-tokenization types\n# enum llama_vocab_pre_type {\n#     LLAMA_VOCAB_PRE_TYPE_DEFAULT        = 0,\n#     LLAMA_VOCAB_PRE_TYPE_LLAMA3         = 1,\n#     LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM   = 2,\n#     LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER = 3,\n#     LLAMA_VOCAB_PRE_TYPE_FALCON         = 4,\n#     LLAMA_VOCAB_PRE_TYPE_MPT            = 5,\n#     LLAMA_VOCAB_PRE_TYPE_STARCODER      = 6,\n#     LLAMA_VOCAB_PRE_TYPE_GPT2           = 7,\n#     LLAMA_VOCAB_PRE_TYPE_REFACT         = 8,\n#     LLAMA_VOCAB_PRE_TYPE_COMMAND_R      = 9,\n#     LLAMA_VOCAB_PRE_TYPE_STABLELM2      = 10,\n#     LLAMA_VOCAB_PRE_TYPE_QWEN2          = 11,\n#     LLAMA_VOCAB_PRE_TYPE_OLMO           = 12,\n#     LLAMA_VOCAB_PRE_TYPE_DBRX           = 13,\n#     LLAMA_VOCAB_PRE_TYPE_SMAUG          = 14,\n#     LLAMA_VOCAB_PRE_TYPE_PORO           = 15,\n#     LLAMA_VOCAB_PRE_TYPE_CHATGLM3       = 16,\n#     LLAMA_VOCAB_PRE_TYPE_CHATGLM4       = 17,\n#     LLAMA_VOCAB_PRE_TYPE_VIKING         = 18,\n#     LLAMA_VOCAB_PRE_TYPE_JAIS           = 19,\n#     LLAMA_VOCAB_PRE_TYPE_TEKKEN         = 20,\n#     LLAMA_VOCAB_PRE_TYPE_SMOLLM         = 21,\n#     LLAMA_VOCAB_PRE_TYPE_CODESHELL      = 22,\n#     LLAMA_VOCAB_PRE_TYPE_BLOOM          = 23,\n#     LLAMA_VOCAB_PRE_TYPE_GPT3_FINNISH   = 24,\n#     LLAMA_VOCAB_PRE_TYPE_EXAONE         = 25,\n#     LLAMA_VOCAB_PRE_TYPE_CHAMELEON      = 26,\n#     LLAMA_VOCAB_PRE_TYPE_MINERVA        = 27,\n#     LLAMA_VOCAB_PRE_TYPE_DEEPSEEK3_LLM  = 28,\n#     LLAMA_VOCAB_PRE_TYPE_GPT4O          = 29,\n#     LLAMA_VOCAB_PRE_TYPE_SUPERBPE       = 30,\n#     LLAMA_VOCAB_PRE_TYPE_TRILLION       = 31,\n#     LLAMA_VOCAB_PRE_TYPE_BAILINGMOE     = 32,\n#     LLAMA_VOCAB_PRE_TYPE_LLAMA4         = 33,\n#     LLAMA_VOCAB_PRE_TYPE_PIXTRAL        = 34,\n#     LLAMA_VOCAB_PRE_TYPE_SEED_CODER     = 35,\n# };\nLLAMA_VOCAB_PRE_TYPE_DEFAULT = 0\nLLAMA_VOCAB_PRE_TYPE_LLAMA3 = 1\nLLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM = 2\nLLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER = 3\nLLAMA_VOCAB_PRE_TYPE_FALCON = 4\nLLAMA_VOCAB_PRE_TYPE_MPT = 5\nLLAMA_VOCAB_PRE_TYPE_STARCODER = 6\nLLAMA_VOCAB_PRE_TYPE_GPT2 = 7\nLLAMA_VOCAB_PRE_TYPE_REFACT = 8\nLLAMA_VOCAB_PRE_TYPE_COMMAND_R = 9\nLLAMA_VOCAB_PRE_TYPE_STABLELM2 = 10\nLLAMA_VOCAB_PRE_TYPE_QWEN2 = 11\nLLAMA_VOCAB_PRE_TYPE_OLMO = 12\nLLAMA_VOCAB_PRE_TYPE_DBRX = 13\nLLAMA_VOCAB_PRE_TYPE_SMAUG = 14\nLLAMA_VOCAB_PRE_TYPE_PORO = 15\nLLAMA_VOCAB_PRE_TYPE_CHATGLM3 = 16\nLLAMA_VOCAB_PRE_TYPE_CHATGLM4 = 17\nLLAMA_VOCAB_PRE_TYPE_VIKING = 18\nLLAMA_VOCAB_PRE_TYPE_JAIS = 19\nLLAMA_VOCAB_PRE_TYPE_TEKKEN = 20\nLLAMA_VOCAB_PRE_TYPE_SMOLLM = 21\nLLAMA_VOCAB_PRE_TYPE_CODESHELL = 22\nLLAMA_VOCAB_PRE_TYPE_BLOOM = 23\nLLAMA_VOCAB_PRE_TYPE_GPT3_FINNISH = 24\nLLAMA_VOCAB_PRE_TYPE_EXAONE = 25\nLLAMA_VOCAB_PRE_TYPE_CHAMELEON = 26\nLLAMA_VOCAB_PRE_TYPE_MINERVA = 27\nLLAMA_VOCAB_PRE_TYPE_DEEPSEEK3_LLM = 28\nLLAMA_VOCAB_PRE_TYPE_GPT4O = 29\nLLAMA_VOCAB_PRE_TYPE_SUPERBPE = 30\nLLAMA_VOCAB_PRE_TYPE_TRILLION = 31\nLLAMA_VOCAB_PRE_TYPE_BAILINGMOE = 32\nLLAMA_VOCAB_PRE_TYPE_LLAMA4 = 33\nLLAMA_VOCAB_PRE_TYPE_PIXTRAL = 34\nLLAMA_VOCAB_PRE_TYPE_SEED_CODER = 35\n\n\n# // note: these values should be synchronized with ggml_rope\n# // TODO: maybe move this enum to ggml.h (ggml_rope_type)\n# enum llama_rope_type {\n#     LLAMA_ROPE_TYPE_NONE   = -1,\n#     LLAMA_ROPE_TYPE_NORM   = 0,\n#     LLAMA_ROPE_TYPE_NEOX   = GGML_ROPE_TYPE_NEOX,\n#     LLAMA_ROPE_TYPE_MROPE  = GGML_ROPE_TYPE_MROPE,\n#     LLAMA_ROPE_TYPE_VISION = GGML_ROPE_TYPE_VISION,\n# };\nLLAMA_ROPE_TYPE_NONE = -1\nLLAMA_ROPE_TYPE_NORM = 0\nLLAMA_ROPE_TYPE_NEOX = GGML_ROPE_TYPE_NEOX = 2\nLLAMA_ROPE_TYPE_MROPE = GGML_ROPE_TYPE_MROPE = 8\nLLAMA_ROPE_TYPE_VISION = GGML_ROPE_TYPE_VISION = 24\n\n\n# enum llama_token_type { //TODO: remove, required until per token attributes are available from GGUF file\n#     LLAMA_TOKEN_TYPE_UNDEFINED    = 0,\n#     LLAMA_TOKEN_TYPE_NORMAL       = 1,\n#     LLAMA_TOKEN_TYPE_UNKNOWN      = 2,\n#     LLAMA_TOKEN_TYPE_CONTROL      = 3,\n#     LLAMA_TOKEN_TYPE_USER_DEFINED = 4,\n#     LLAMA_TOKEN_TYPE_UNUSED       = 5,\n#     LLAMA_TOKEN_TYPE_BYTE         = 6,\n# };\nLLAMA_TOKEN_TYPE_UNDEFINED = 0\nLLAMA_TOKEN_TYPE_NORMAL = 1\nLLAMA_TOKEN_TYPE_UNKNOWN = 2\nLLAMA_TOKEN_TYPE_CONTROL = 3\nLLAMA_TOKEN_TYPE_USER_DEFINED = 4\nLLAMA_TOKEN_TYPE_UNUSED = 5\nLLAMA_TOKEN_TYPE_BYTE = 6\n\n\n# enum llama_token_attr {\n#     LLAMA_TOKEN_ATTR_UNDEFINED    = 0,\n#     LLAMA_TOKEN_ATTR_UNKNOWN      = 1 << 0,\n#     LLAMA_TOKEN_ATTR_UNUSED       = 1 << 1,\n#     LLAMA_TOKEN_ATTR_NORMAL       = 1 << 2,\n#     LLAMA_TOKEN_ATTR_CONTROL      = 1 << 3,  // SPECIAL?\n#     LLAMA_TOKEN_ATTR_USER_DEFINED = 1 << 4,\n#     LLAMA_TOKEN_ATTR_BYTE         = 1 << 5,\n#     LLAMA_TOKEN_ATTR_NORMALIZED   = 1 << 6,\n#     LLAMA_TOKEN_ATTR_LSTRIP       = 1 << 7,\n#     LLAMA_TOKEN_ATTR_RSTRIP       = 1 << 8,\n#     LLAMA_TOKEN_ATTR_SINGLE_WORD  = 1 << 9,\n# };\nLLAMA_TOKEN_ATTR_UNDEFINED = 0\nLLAMA_TOKEN_ATTR_UNKNOWN = 1 << 0\nLLAMA_TOKEN_ATTR_UNUSED = 1 << 1\nLLAMA_TOKEN_ATTR_NORMAL = 1 << 2\nLLAMA_TOKEN_ATTR_CONTROL = 1 << 3\nLLAMA_TOKEN_ATTR_USER_DEFINED = 1 << 4\nLLAMA_TOKEN_ATTR_BYTE = 1 << 5\nLLAMA_TOKEN_ATTR_NORMALIZED = 1 << 6\nLLAMA_TOKEN_ATTR_LSTRIP = 1 << 7\nLLAMA_TOKEN_ATTR_RSTRIP = 1 << 8\nLLAMA_TOKEN_ATTR_SINGLE_WORD = 1 << 9\n\n\n# // model file types\n# enum llama_ftype {\n#     LLAMA_FTYPE_ALL_F32              = 0,\n#     LLAMA_FTYPE_MOSTLY_F16           = 1,  // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q4_0          = 2,  // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q4_1          = 3,  // except 1d tensors\n#     // LLAMA_FTYPE_MOSTLY_Q4_1_SOME_F16 = 4,  // tok_embeddings.weight and output.weight are F16\n#     // LLAMA_FTYPE_MOSTLY_Q4_2       = 5,  // support has been removed\n#     // LLAMA_FTYPE_MOSTLY_Q4_3       = 6,  // support has been removed\n#     LLAMA_FTYPE_MOSTLY_Q8_0          = 7,  // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q5_0          = 8,  // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q5_1          = 9,  // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q2_K          = 10, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q3_K_S        = 11, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q3_K_M        = 12, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q3_K_L        = 13, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q4_K_S        = 14, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q4_K_M        = 15, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q5_K_S        = 16, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q5_K_M        = 17, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q6_K          = 18, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ2_XXS       = 19, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ2_XS        = 20, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_Q2_K_S        = 21, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ3_XS        = 22, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ3_XXS       = 23, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ1_S         = 24, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ4_NL        = 25, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ3_S         = 26, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ3_M         = 27, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ2_S         = 28, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ2_M         = 29, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ4_XS        = 30, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_IQ1_M         = 31, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_BF16          = 32, // except 1d tensors\n#     //LLAMA_FTYPE_MOSTLY_Q4_0_4_4      = 33, // removed from gguf files, use Q4_0 and runtime repack\n#     //LLAMA_FTYPE_MOSTLY_Q4_0_4_8      = 34, // removed from gguf files, use Q4_0 and runtime repack\n#     //LLAMA_FTYPE_MOSTLY_Q4_0_8_8      = 35, // removed from gguf files, use Q4_0 and runtime repack\n#     LLAMA_FTYPE_MOSTLY_TQ1_0         = 36, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_TQ2_0         = 37, // except 1d tensors\n#     LLAMA_FTYPE_MOSTLY_MXFP4_MOE     = 38, // except 1d tensors\n#\n#     LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file\n# };\nLLAMA_FTYPE_ALL_F32 = 0\nLLAMA_FTYPE_MOSTLY_F16 = 1\nLLAMA_FTYPE_MOSTLY_Q4_0 = 2\nLLAMA_FTYPE_MOSTLY_Q4_1 = 3\nLLAMA_FTYPE_MOSTLY_Q8_0 = 7\nLLAMA_FTYPE_MOSTLY_Q5_0 = 8\nLLAMA_FTYPE_MOSTLY_Q5_1 = 9\nLLAMA_FTYPE_MOSTLY_Q2_K = 10\nLLAMA_FTYPE_MOSTLY_Q3_K_S = 11\nLLAMA_FTYPE_MOSTLY_Q3_K_M = 12\nLLAMA_FTYPE_MOSTLY_Q3_K_L = 13\nLLAMA_FTYPE_MOSTLY_Q4_K_S = 14\nLLAMA_FTYPE_MOSTLY_Q4_K_M = 15\nLLAMA_FTYPE_MOSTLY_Q5_K_S = 16\nLLAMA_FTYPE_MOSTLY_Q5_K_M = 17\nLLAMA_FTYPE_MOSTLY_Q6_K = 18\nLLAMA_FTYPE_MOSTLY_IQ2_XXS = 19\nLLAMA_FTYPE_MOSTLY_IQ2_XS = 20\nLLAMA_FTYPE_MOSTLY_Q2_K_S = 21\nLLAMA_FTYPE_MOSTLY_IQ3_XS = 22\nLLAMA_FTYPE_MOSTLY_IQ3_XXS = 23\nLLAMA_FTYPE_MOSTLY_IQ1_S = 24\nLLAMA_FTYPE_MOSTLY_IQ4_NL = 25\nLLAMA_FTYPE_MOSTLY_IQ3_S = 26\nLLAMA_FTYPE_MOSTLY_IQ3_M = 27\nLLAMA_FTYPE_MOSTLY_IQ2_S = 28\nLLAMA_FTYPE_MOSTLY_IQ2_M = 29\nLLAMA_FTYPE_MOSTLY_IQ4_XS = 30\nLLAMA_FTYPE_MOSTLY_IQ1_M = 31\nLLAMA_FTYPE_MOSTLY_BF16 = 32\n# LLAMA_FTYPE_MOSTLY_Q4_0_4_4 = 33\n# LLAMA_FTYPE_MOSTLY_Q4_0_4_8 = 34\n# LLAMA_FTYPE_MOSTLY_Q4_0_8_8 = 35\nLLAMA_FTYPE_MOSTLY_TQ1_0 = 36\nLLAMA_FTYPE_MOSTLY_TQ2_0 = 37\nLLAMA_FTYPE_MOSTLY_MXFP4_MOE = 38\nLLAMA_FTYPE_GUESSED = 1024\n\n# enum llama_rope_scaling_type {\n#     LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED = -1,\n#     LLAMA_ROPE_SCALING_TYPE_NONE        = 0,\n#     LLAMA_ROPE_SCALING_TYPE_LINEAR      = 1,\n#     LLAMA_ROPE_SCALING_TYPE_YARN        = 2,\n#     LLAMA_ROPE_SCALING_TYPE_LONGROPE    = 3,\n#     LLAMA_ROPE_SCALING_TYPE_MAX_VALUE   = LLAMA_ROPE_SCALING_TYPE_LONGROPE,\n# };\nLLAMA_ROPE_SCALING_TYPE_UNSPECIFIED = -1\nLLAMA_ROPE_SCALING_TYPE_NONE = 0\nLLAMA_ROPE_SCALING_TYPE_LINEAR = 1\nLLAMA_ROPE_SCALING_TYPE_YARN = 2\nLLAMA_ROPE_SCALING_TYPE_LONGROPE = 3\nLLAMA_ROPE_SCALING_TYPE_MAX_VALUE = LLAMA_ROPE_SCALING_TYPE_LONGROPE\n\n# enum llama_pooling_type {\n#     LLAMA_POOLING_TYPE_UNSPECIFIED = -1,\n#     LLAMA_POOLING_TYPE_NONE = 0,\n#     LLAMA_POOLING_TYPE_MEAN = 1,\n#     LLAMA_POOLING_TYPE_CLS  = 2,\n#     LLAMA_POOLING_TYPE_LAST = 3,\n#     LLAMA_POOLING_TYPE_RANK = 4, // used by reranking models to attach the classification head to the graph\n# };\nLLAMA_POOLING_TYPE_UNSPECIFIED = -1\nLLAMA_POOLING_TYPE_NONE = 0\nLLAMA_POOLING_TYPE_MEAN = 1\nLLAMA_POOLING_TYPE_CLS = 2\nLLAMA_POOLING_TYPE_LAST = 3\nLLAMA_POOLING_TYPE_RANK = 4\n\n# enum llama_attention_type {\n#     LLAMA_ATTENTION_TYPE_UNSPECIFIED = -1,\n#     LLAMA_ATTENTION_TYPE_CAUSAL      = 0,\n#     LLAMA_ATTENTION_TYPE_NON_CAUSAL  = 1,\n# };\nLLAMA_ATTENTION_TYPE_UNSPECIFIED = -1\nLLAMA_ATTENTION_TYPE_CAUSAL = 0\nLLAMA_ATTENTION_TYPE_NON_CAUSAL = 1\n\n\n# enum llama_split_mode {\n#     LLAMA_SPLIT_MODE_NONE  = 0, // single GPU\n#     LLAMA_SPLIT_MODE_LAYER = 1, // split layers and KV across GPUs\n#     LLAMA_SPLIT_MODE_ROW   = 2, // split layers and KV across GPUs, use tensor parallelism if supported\n# };\nLLAMA_SPLIT_MODE_NONE = 0\nLLAMA_SPLIT_MODE_LAYER = 1\nLLAMA_SPLIT_MODE_ROW = 2\n\n\n# typedef struct llama_token_data {\n#     llama_token id; // token id\n#     float logit;    // log-odds of the token\n#     float p;        // probability of the token\n# } llama_token_data;\nclass llama_token_data(ctypes.Structure):\n    \"\"\"Used to store token data\n\n    Attributes:\n        id (llama_token): token id\n        logit (float): log-odds of the token\n        p (float): probability of the token\"\"\"\n\n    if TYPE_CHECKING:\n        id: llama_token\n        logit: float\n        p: float\n\n    _fields_ = [\n        (\"id\", llama_token),\n        (\"logit\", ctypes.c_float),\n        (\"p\", ctypes.c_float),\n    ]\n\n\nllama_token_data_p = ctypes.POINTER(llama_token_data)\n\n\n# typedef struct llama_token_data_array {\n#     // TODO: consider SoA\n#     // NOTE: this pointer can be modified by the samplers\n#     llama_token_data * data;\n#     size_t size;\n#     int64_t selected; // this is the index in the data array (i.e. not the token id)\n#     bool sorted;\n# } llama_token_data_array;\nclass llama_token_data_array(ctypes.Structure):\n    \"\"\"Used to sample tokens given logits\n\n    Attributes:\n        data (ctypes.Array[llama_token_data]): token data\n        size (int): size of the array\n        selected (int): index in the data array (i.e. not the token id)\n        sorted (bool): whether the array is sorted\"\"\"\n\n    if TYPE_CHECKING:\n        data: CtypesArray[llama_token_data]\n        size: int\n        selected: int\n        sorted: bool\n\n    _fields_ = [\n        (\"data\", llama_token_data_p),\n        (\"size\", ctypes.c_size_t),\n        (\"selected\", ctypes.c_int64),\n        (\"sorted\", ctypes.c_bool),\n    ]\n\n\nllama_token_data_array_p = ctypes.POINTER(llama_token_data_array)\n\n# typedef bool (*llama_progress_callback)(float progress, void * user_data);\nllama_progress_callback = ctypes.CFUNCTYPE(\n    ctypes.c_bool, ctypes.c_float, ctypes.c_void_p\n)\n\n\n# // Input data for llama_encode/llama_decode\n# // A llama_batch object can contain input about one or many sequences\n# // The provided arrays (i.e. token, embd, pos, etc.) must have size of n_tokens\n# //\n# // - token  : the token ids of the input (used when embd is NULL)\n# // - embd   : token embeddings (i.e. float vector of size n_embd) (used when token is NULL)\n# // - pos    : the positions of the respective token in the sequence\n# //            (if set to NULL, the token position will be tracked automatically by llama_encode/llama_decode)\n# // - seq_id : the sequence to which the respective token belongs\n# //            (if set to NULL, the sequence ID will be assumed to be 0)\n# // - logits : if zero, the logits (and/or the embeddings) for the respective token will not be output\n# //            (if set to NULL:\n# //               - if embeddings: all tokens are output\n# //               - if not:        only the last token is output\n# //            )\n# //\n# typedef struct llama_batch {\n#     int32_t n_tokens;\n\n#     llama_token  *  token;\n#     float        *  embd;\n#     llama_pos    *  pos;\n#     int32_t      *  n_seq_id;\n#     llama_seq_id ** seq_id;\n#     int8_t       *  logits;   // TODO: rename this to \"output\"\n# } llama_batch;\nclass llama_batch(ctypes.Structure):\n    \"\"\"Input data for llama_encode/llama_decode\n\n    A llama_batch object can contain input about one or many sequences\n\n    The provided arrays (i.e. token, embd, pos, etc.) must have size of n_tokens\n\n    Attributes:\n        n_tokens (int): number of tokens\n        token (ctypes.Array[llama_token]): the token ids of the input (used when embd is NULL)\n        embd (ctypes.Array[ctypes.ctypes.c_float]): token embeddings (i.e. float vector of size n_embd) (used when token is NULL)\n        pos (ctypes.Array[ctypes.Array[llama_pos]]): the positions of the respective token in the sequence\n        seq_id (ctypes.Array[ctypes.Array[llama_seq_id]]): the sequence to which the respective token belongs\n        logits (ctypes.Array[ctypes.ctypes.c_int8]): if zero, the logits for the respective token will not be output\n    \"\"\"\n\n    if TYPE_CHECKING:\n        n_tokens: int\n        token: CtypesArray[llama_token]\n        embd: CtypesArray[ctypes.c_float]\n        pos: CtypesArray[CtypesArray[llama_pos]]\n        n_seq_id: CtypesArray[ctypes.c_int]\n        seq_id: CtypesArray[CtypesArray[llama_seq_id]]\n        logits: CtypesArray[ctypes.c_int8]\n\n    _fields_ = [\n        (\"n_tokens\", ctypes.c_int32),\n        (\"token\", ctypes.POINTER(llama_token)),\n        (\"embd\", ctypes.POINTER(ctypes.c_float)),\n        (\"pos\", ctypes.POINTER(llama_pos)),\n        (\"n_seq_id\", ctypes.POINTER(ctypes.c_int32)),\n        (\"seq_id\", ctypes.POINTER(ctypes.POINTER(llama_seq_id))),\n        (\"logits\", ctypes.POINTER(ctypes.c_int8)),\n    ]\n\n\n# enum llama_model_kv_override_type {\n#     LLAMA_KV_OVERRIDE_TYPE_INT,\n#     LLAMA_KV_OVERRIDE_TYPE_FLOAT,\n#     LLAMA_KV_OVERRIDE_TYPE_BOOL,\n#     LLAMA_KV_OVERRIDE_TYPE_STR,\n# };\nLLAMA_KV_OVERRIDE_TYPE_INT = 0\nLLAMA_KV_OVERRIDE_TYPE_FLOAT = 1\nLLAMA_KV_OVERRIDE_TYPE_BOOL = 2\nLLAMA_KV_OVERRIDE_TYPE_STR = 3\n\n\n# struct llama_model_kv_override {\n#     enum llama_model_kv_override_type tag;\n\n#     char key[128];\n\n\n#     union {\n#         int64_t val_i64;\n#         double  val_f64;\n#         bool    val_bool;\n#         char    val_str[128];\n#     };\n# };\nclass llama_model_kv_override_value(ctypes.Union):\n    _fields_ = [\n        (\"val_i64\", ctypes.c_int64),\n        (\"val_f64\", ctypes.c_double),\n        (\"val_bool\", ctypes.c_bool),\n        (\"val_str\", ctypes.c_char * 128),\n    ]\n\n    if TYPE_CHECKING:\n        val_i64: int\n        val_f64: float\n        val_bool: bool\n        val_str: bytes\n\n\nclass llama_model_kv_override(ctypes.Structure):\n    _fields_ = [\n        (\"tag\", ctypes.c_int),\n        (\"key\", ctypes.c_char * 128),\n        (\"value\", llama_model_kv_override_value),\n    ]\n\n    if TYPE_CHECKING:\n        tag: int\n        key: bytes\n        value: Union[int, float, bool, bytes]\n\n\n# struct llama_model_tensor_buft_override {\n#     const char * pattern;\n#     ggml_backend_buffer_type_t buft;\n# };\n\n\n# struct llama_model_params {\n#     // NULL-terminated list of devices to use for offloading (if NULL, all available devices are used)\n#     ggml_backend_dev_t * devices;\n\n#     // NULL-terminated list of buffer types to use for tensors that match a pattern\n#     const struct llama_model_tensor_buft_override * tensor_buft_overrides;\n\n#     int32_t n_gpu_layers; // number of layers to store in VRAM\n#     enum llama_split_mode split_mode; // how to split the model across multiple GPUs\n\n#     // the GPU that is used for the entire model when split_mode is LLAMA_SPLIT_MODE_NONE\n#     int32_t main_gpu;\n\n#     // proportion of the model (layers or rows) to offload to each GPU, size: llama_max_devices()\n#     const float * tensor_split;\n\n#     // Called with a progress value between 0.0 and 1.0. Pass NULL to disable.\n#     // If the provided progress_callback returns true, model loading continues.\n#     // If it returns false, model loading is immediately aborted.\n#     llama_progress_callback progress_callback;\n\n#     // context pointer passed to the progress callback\n#     void * progress_callback_user_data;\n\n#     // override key-value pairs of the model meta data\n#     const struct llama_model_kv_override * kv_overrides;\n\n#     // Keep the booleans together to avoid misalignment during copy-by-value.\n#     bool vocab_only;    // only load the vocabulary, no weights\n#     bool use_mmap;      // use mmap if possible\n#     bool use_mlock;     // force system to keep model in RAM\n#     bool check_tensors; // validate model tensor data\n#     bool use_extra_bufts; // use extra buffer types (used for weight repacking)\n# };\nclass llama_model_params(ctypes.Structure):\n    \"\"\"Parameters for llama_model\n\n    Attributes:\n        devices (ctypes.Array[ggml_backend_dev_t]): NULL-terminated list of devices to use for offloading (if NULL, all available devices are used)\n        tensor_buft_overrides (ctypes.Array[llama_model_tensor_buft_override]): NULL-terminated list of buffer types to use for tensors that match a pattern\n        n_gpu_layers (int): number of layers to store in VRAM\n        split_mode (int): how to split the model across multiple GPUs\n        main_gpu (int): the GPU that is used for the entire model when split_mode is LLAMA_SPLIT_MODE_NONE\n        tensor_split (ctypes.Array[ctypes.ctypes.c_float]): proportion of the model (layers or rows) to offload to each GPU, size: llama_max_devices()\n        progress_callback (llama_progress_callback): called with a progress value between 0.0 and 1.0. Pass NULL to disable. If the provided progress_callback returns true, model loading continues. If it returns false, model loading is immediately aborted.\n        progress_callback_user_data (ctypes.ctypes.c_void_p): context pointer passed to the progress callback\n        kv_overrides (ctypes.Array[llama_model_kv_override]): override key-value pairs of the model meta data\n        vocab_only (bool): only load the vocabulary, no weights\n        use_mmap (bool): use mmap if possible\n        use_mlock (bool): force system to keep model in RAM\n        check_tensors (bool): validate model tensor data\n        use_extra_bufts (bool): use extra buffer types (used for weight repacking)\"\"\"\n\n    if TYPE_CHECKING:\n        devices: CtypesArray[ctypes.c_void_p]  # NOTE: unused\n        tensor_buft_overrides: CtypesArray[llama_model_tensor_buft_override] # NOTE: unused\n        n_gpu_layers: int\n        split_mode: int\n        main_gpu: int\n        tensor_split: CtypesArray[ctypes.c_float]\n        progress_callback: Callable[[float, ctypes.c_void_p], bool]\n        progress_callback_user_data: ctypes.c_void_p\n        kv_overrides: CtypesArray[llama_model_kv_override]\n        vocab_only: bool\n        use_mmap: bool\n        use_mlock: bool\n        check_tensors: bool\n        use_extra_bufts: bool\n\n    _fields_ = [\n        (\"devices\", ctypes.c_void_p), # NOTE: unnused\n        (\"tensor_buft_overrides\", ctypes.c_void_p), # NOTE: unused\n        (\"n_gpu_layers\", ctypes.c_int32),\n        (\"split_mode\", ctypes.c_int),\n        (\"main_gpu\", ctypes.c_int32),\n        (\"tensor_split\", ctypes.POINTER(ctypes.c_float)),\n        (\"progress_callback\", llama_progress_callback),\n        (\"progress_callback_user_data\", ctypes.c_void_p),\n        (\"kv_overrides\", ctypes.POINTER(llama_model_kv_override)),\n        (\"vocab_only\", ctypes.c_bool),\n        (\"use_mmap\", ctypes.c_bool),\n        (\"use_mlock\", ctypes.c_bool),\n        (\"check_tensors\", ctypes.c_bool),\n        (\"use_extra_bufts\", ctypes.c_bool),\n    ]\n\n\n# // NOTE: changing the default values of parameters marked as [EXPERIMENTAL] may cause crashes or incorrect results in certain configurations\n# //       https://github.com/ggml-org/llama.cpp/pull/7544\n# struct llama_context_params {\n#     uint32_t n_ctx;             // text context, 0 = from model\n#     uint32_t n_batch;           // logical maximum batch size that can be submitted to llama_decode\n#     uint32_t n_ubatch;          // physical maximum batch size\n#     uint32_t n_seq_max;         // max number of sequences (i.e. distinct states for recurrent models)\n#     int32_t  n_threads;         // number of threads to use for generation\n#     int32_t  n_threads_batch;   // number of threads to use for batch processing\n\n#     enum llama_rope_scaling_type rope_scaling_type; // RoPE scaling type, from `enum llama_rope_scaling_type`\n#     enum llama_pooling_type      pooling_type;      // whether to pool (sum) embedding results by sequence id\n#     enum llama_attention_type    attention_type;    // attention type to use for embeddings\n\n#     // ref: https://github.com/ggml-org/llama.cpp/pull/2054\n#     float    rope_freq_base;   // RoPE base frequency, 0 = from model\n#     float    rope_freq_scale;  // RoPE frequency scaling factor, 0 = from model\n#     float    yarn_ext_factor;  // YaRN extrapolation mix factor, negative = from model\n#     float    yarn_attn_factor; // YaRN magnitude scaling factor\n#     float    yarn_beta_fast;   // YaRN low correction dim\n#     float    yarn_beta_slow;   // YaRN high correction dim\n#     uint32_t yarn_orig_ctx;    // YaRN original context size\n#     float    defrag_thold;     // defragment the KV cache if holes/size > thold, <= 0 disabled (default)\n\n#     ggml_backend_sched_eval_callback cb_eval;\n#     void * cb_eval_user_data;\n\n#     enum ggml_type type_k; // data type for K cache [EXPERIMENTAL]\n#     enum ggml_type type_v; // data type for V cache [EXPERIMENTAL]\n\n#     // Abort callback\n#     // if it returns true, execution of llama_decode() will be aborted\n#     // currently works only with CPU execution\n#     ggml_abort_callback abort_callback;\n#     void *              abort_callback_data;\n\n#     // Keep the booleans together and at the end of the struct to avoid misalignment during copy-by-value.\n#     bool embeddings;  // if true, extract embeddings (together with logits)\n#     bool offload_kqv; // offload the KQV ops (including the KV cache) to GPU\n#     bool flash_attn;  // use flash attention [EXPERIMENTAL]\n#     bool no_perf;     // measure performance timings\n#     bool op_offload;  // offload host tensor operations to device\n#     bool swa_full;    // use full-size SWA cache (https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)\n#                       // NOTE: setting to false when n_seq_max > 1 can cause bad performance in some cases\n#                       //       ref: https://github.com/ggml-org/llama.cpp/pull/13845#issuecomment-2924800573\n#     bool kv_unified;  // use a unified buffer across the input sequences when computing the attention\n#                       // try to disable when n_seq_max > 1 for improved performance when the sequences do not share a large prefix\n#                       // ref: https://github.com/ggml-org/llama.cpp/pull/14363\n# };\nclass llama_context_params(ctypes.Structure):\n    \"\"\"Parameters for llama_context\n\n    Attributes:\n        n_ctx (int): text context, 0 = from model\n        n_batch (int): logical maximum batch size that can be submitted to llama_decode\n        n_ubatch (int): physical maximum batch size\n        n_seq_max (int): max number of sequences (i.e. distinct states for recurrent models)\n        n_threads (int): number of threads to use for generation\n        n_threads_batch (int): number of threads to use for batch processing\n        rope_scaling_type (int): RoPE scaling type, from `enum llama_rope_scaling_type`\n        pooling_type (int): whether to pool (sum) embedding results by sequence id (ignored if no pooling layer)\n        attention_type (int): attention type to use for embeddings\n        rope_freq_base (float): RoPE base frequency, 0 = from model\n        rope_freq_scale (float): RoPE frequency scaling factor, 0 = from model\n        yarn_ext_factor (float): YaRN extrapolation mix factor, negative = from model\n        yarn_attn_factor (float): YaRN magnitude scaling factor\n        yarn_beta_fast (float): YaRN low correction dim\n        yarn_beta_slow (float): YaRN high correction dim\n        yarn_orig_ctx (int): YaRN original context size\n        defrag_thold (float): defragment the KV cache if holes/size > thold, <= 0 disabled (default)\n        cb_eval (ggml_backend_sched_eval_callback): callback for scheduling eval\n        cb_eval_user_data (ctypes.ctypes.c_void_p): user data for cb_eval\n        type_k (int): data type for K cache\n        type_v (int): data type for V cache\n        abort_callback (ggml_abort_callback): abort callback if it returns true, execution of llama_decode() will be aborted\n        abort_callback_data (ctypes.ctypes.c_void_p): data for abort_callback\n        embeddings (bool): if true, extract embeddings (together with logits)\n        offload_kqv (bool): whether to offload the KQV ops (including the KV cache) to GPU\n        flash_attn (bool): whether to use flash attention\n        no_perf (bool): whether to measure performance timings\n        op_offload (bool): offload host tensor operations to device\n        swa_full (bool): use full-size SWA cache\n        kv_unified (bool): use a unified buffer across the input sequences when computing the attention\n    \"\"\"\n\n    if TYPE_CHECKING:\n        n_ctx: int\n        n_batch: int\n        n_ubatch: int\n        n_seq_max: int\n        n_threads: int\n        n_threads_batch: int\n        rope_scaling_type: int\n        pooling_type: int\n        attention_type: int\n        rope_freq_base: float\n        rope_freq_scale: float\n        yarn_ext_factor: float\n        yarn_attn_factor: float\n        yarn_beta_fast: float\n        yarn_beta_slow: float\n        yarn_orig_ctx: int\n        defrag_thold: float\n        cb_eval: Callable[[ctypes.c_void_p, bool], bool]\n        cb_eval_user_data: ctypes.c_void_p\n        type_k: int\n        type_v: int\n        abort_callback: Callable[[ctypes.c_void_p], bool]\n        abort_callback_data: ctypes.c_void_p\n        embeddings: bool\n        offload_kqv: bool\n        flash_attn: bool\n        no_perf: bool\n        op_offload: bool\n        swa_full: bool\n        kv_unified: bool\n\n    _fields_ = [\n        (\"n_ctx\", ctypes.c_uint32),\n        (\"n_batch\", ctypes.c_uint32),\n        (\"n_ubatch\", ctypes.c_uint32),\n        (\"n_seq_max\", ctypes.c_uint32),\n        (\"n_threads\", ctypes.c_int32),\n        (\"n_threads_batch\", ctypes.c_int32),\n        (\"rope_scaling_type\", ctypes.c_int),\n        (\"pooling_type\", ctypes.c_int),\n        (\"attention_type\", ctypes.c_int),\n        (\"rope_freq_base\", ctypes.c_float),\n        (\"rope_freq_scale\", ctypes.c_float),\n        (\"yarn_ext_factor\", ctypes.c_float),\n        (\"yarn_attn_factor\", ctypes.c_float),\n        (\"yarn_beta_fast\", ctypes.c_float),\n        (\"yarn_beta_slow\", ctypes.c_float),\n        (\"yarn_orig_ctx\", ctypes.c_uint32),\n        (\"defrag_thold\", ctypes.c_float),\n        (\"cb_eval\", ggml_backend_sched_eval_callback),\n        (\"cb_eval_user_data\", ctypes.c_void_p),\n        (\"type_k\", ctypes.c_int),\n        (\"type_v\", ctypes.c_int),\n        (\"abort_callback\", ggml_abort_callback),\n        (\"abort_callback_data\", ctypes.c_void_p),\n        (\"embeddings\", ctypes.c_bool),\n        (\"offload_kqv\", ctypes.c_bool),\n        (\"flash_attn\", ctypes.c_bool),\n        (\"no_perf\", ctypes.c_bool),\n        (\"op_offload\", ctypes.c_bool),\n        (\"swa_full\", ctypes.c_bool),\n        (\"kv_unified\", ctypes.c_bool),\n    ]\n\n\n# // Signature for logging events\n# // Note that text includes the new line character at the end for most events.\n# // If your logging mechanism cannot handle that, check if the last character is '\\n' and strip it\n# // if it exists.\n# // It might not exist for progress report where '.' is output repeatedly.\n# typedef void (*llama_log_callback)(enum llama_log_level level, const char * text, void * user_data);\nllama_log_callback = ctypes.CFUNCTYPE(\n    None, ctypes.c_int, ctypes.c_char_p, ctypes.c_void_p\n)\n\"\"\"Signature for logging events\nNote that text includes the new line character at the end for most events.\nIf your logging mechanism cannot handle that, check if the last character is '\\n' and strip it\nif it exists.\nIt might not exist for progress report where '.' is output repeatedly.\"\"\"\n\n\n# // model quantization parameters\n# typedef struct llama_model_quantize_params {\n#     int32_t nthread;                      // number of threads to use for quantizing, if <=0 will use std::thread::hardware_concurrency()\n#     enum llama_ftype ftype;               // quantize to this llama_ftype\n#     enum ggml_type output_tensor_type;    // output tensor type\n#     enum ggml_type token_embedding_type;  // token embeddings tensor type\n#     bool allow_requantize;                // allow quantizing non-f32/f16 tensors\n#     bool quantize_output_tensor;          // quantize output.weight\n#     bool only_copy;                       // only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored\n#     bool pure;                            // quantize all tensors to the default type\n#     bool keep_split;                      // quantize to the same number of shards\n#     void * imatrix;                       // pointer to importance matrix data\n#     void * kv_overrides;                  // pointer to vector containing overrides\n#     void * tensor_types;                  // pointer to vector containing tensor types\n#     void * prune_layers;                  // pointer to vector containing layer indices to prune\n# } llama_model_quantize_params;\nclass llama_model_quantize_params(ctypes.Structure):\n    \"\"\"Parameters for llama_model_quantize\n\n    Attributes:\n        nthread (int): number of threads to use for quantizing, if <=0 will use std::thread::hardware_concurrency()\n        ftype (int): quantize to this llama_ftype\n        output_tensor_type (int): output tensor type\n        token_embedding_type (int): token embeddings tensor type\n        allow_requantize (bool): allow quantizing non-f32/f16 tensors\n        quantize_output_tensor (bool): quantize output.weight\n        only_copy (bool): only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored\n        pure (bool): quantize all tensors to the default type\n        keep_split (bool): quantize to the same number of shards\n        imatrix (ctypes.c_void_p): pointer to importance matrix data\n        kv_overrides (ctypes.c_void_p): pointer to vector containing overrides\n        tensor_types (ctypes.c_void_p): pointer to vector containing tensor types\n        prune_layers (ctypes.c_void_p): pointer to vector containing layer indices to prune\n    \"\"\"\n\n    if TYPE_CHECKING:\n        nthread: int\n        ftype: int\n        output_tensor_type: int\n        token_embedding_type: int\n        allow_requantize: bool\n        quantize_output_tensor: bool\n        only_copy: bool\n        pure: bool\n        keep_split: bool\n        imatrix: ctypes.c_void_p\n        kv_overrides: ctypes.c_void_p\n        tensor_types: ctypes.c_void_p\n        prune_layers: ctypes.c_void_p\n\n    _fields_ = [\n        (\"nthread\", ctypes.c_int32),\n        (\"ftype\", ctypes.c_int),\n        (\"output_tensor_type\", ctypes.c_int),\n        (\"token_embedding_type\", ctypes.c_int),\n        (\"allow_requantize\", ctypes.c_bool),\n        (\"quantize_output_tensor\", ctypes.c_bool),\n        (\"only_copy\", ctypes.c_bool),\n        (\"pure\", ctypes.c_bool),\n        (\"keep_split\", ctypes.c_bool),\n        (\"imatrix\", ctypes.c_void_p),\n        (\"kv_overrides\", ctypes.c_void_p),\n        (\"tensor_types\", ctypes.c_void_p),\n        (\"prune_layers\", ctypes.c_void_p),\n    ]\n\n\n# typedef struct llama_logit_bias {\n#     llama_token token;\n#     float bias;\n# } llama_logit_bias;\nclass llama_logit_bias(ctypes.Structure):\n    \"\"\"Used to store logit bias\n\n    Attributes:\n        token (llama_token): token id\n        bias (float): bias\"\"\"\n\n    if TYPE_CHECKING:\n        token: llama_token\n        bias: float\n\n    _fields_ = [\n        (\"token\", llama_token),\n        (\"bias\", ctypes.c_float),\n    ]\n\n\nllama_logit_bias_p = ctypes.POINTER(llama_logit_bias)\n\n\n# typedef struct llama_sampler_chain_params {\n#     bool no_perf; // whether to measure performance timings\n# } llama_sampler_chain_params;\nclass llama_sampler_chain_params(ctypes.Structure):\n    \"\"\"Parameters for llama_sampler_chain\n\n    Attributes:\n        no_perf (bool): whether to measure performance timings\"\"\"\n\n    if TYPE_CHECKING:\n        no_perf: bool\n\n    _fields_ = [\n        (\"no_perf\", ctypes.c_bool),\n    ]\n\n\n# // used in chat template\n# typedef struct llama_chat_message {\n#     const char * role;\n#     const char * content;\n# } llama_chat_message;\nclass llama_chat_message(ctypes.Structure):\n    _fields_ = [\n        (\"role\", ctypes.c_char_p),\n        (\"content\", ctypes.c_char_p),\n    ]\n\n\n# // lora adapter\n# struct llama_adapter_lora;\nllama_adapter_lora_p = ctypes.c_void_p\nllama_adapter_lora_p_ctypes = ctypes.POINTER(ctypes.c_void_p)\n\n\n# // Helpers for getting default parameters\n# LLAMA_API struct llama_model_params          llama_model_default_params(void);\n@ctypes_function(\n    \"llama_model_default_params\",\n    [],\n    llama_model_params,\n)\ndef llama_model_default_params() -> llama_model_params:\n    \"\"\"Get default parameters for llama_model\"\"\"\n    ...\n\n\n# LLAMA_API struct llama_context_params        llama_context_default_params(void);\n@ctypes_function(\n    \"llama_context_default_params\",\n    [],\n    llama_context_params,\n)\ndef llama_context_default_params() -> llama_context_params:\n    \"\"\"Get default parameters for llama_context\"\"\"\n    ...\n\n\n# LLAMA_API struct llama_sampler_chain_params  llama_sampler_chain_default_params(void);\n@ctypes_function(\n    \"llama_sampler_chain_default_params\",\n    [],\n    llama_sampler_chain_params,\n)\ndef llama_sampler_chain_default_params() -> llama_sampler_chain_params:\n    \"\"\"Get default parameters for llama_sampler_chain\"\"\"\n    ...\n\n\n# LLAMA_API struct llama_model_quantize_params llama_model_quantize_default_params(void);\n@ctypes_function(\n    \"llama_model_quantize_default_params\",\n    [],\n    llama_model_quantize_params,\n)\ndef llama_model_quantize_default_params() -> llama_model_quantize_params:\n    \"\"\"Get default parameters for llama_model_quantize\"\"\"\n    ...\n\n\n# // Initialize the llama + ggml backend\n# // If numa is true, use NUMA optimizations\n# // Call once at the start of the program\n# LLAMA_API void llama_backend_init(void);\n@ctypes_function(\n    \"llama_backend_init\",\n    [],\n    None,\n)\ndef llama_backend_init():\n    \"\"\"Initialize the llama + ggml backend\n    Call once at the start of the program\"\"\"\n    ...\n\n\n# // numa strategies\n# enum ggml_numa_strategy {\n#     GGML_NUMA_STRATEGY_DISABLED   = 0,\n#     GGML_NUMA_STRATEGY_DISTRIBUTE = 1,\n#     GGML_NUMA_STRATEGY_ISOLATE    = 2,\n#     GGML_NUMA_STRATEGY_NUMACTL    = 3,\n#     GGML_NUMA_STRATEGY_MIRROR     = 4,\n#     GGML_NUMA_STRATEGY_COUNT\n# };\nGGML_NUMA_STRATEGY_DISABLED = 0\nGGML_NUMA_STRATEGY_DISTRIBUTE = 1\nGGML_NUMA_STRATEGY_ISOLATE = 2\nGGML_NUMA_STRATEGY_NUMACTL = 3\nGGML_NUMA_STRATEGY_MIRROR = 4\nGGML_NUMA_STRATEGY_COUNT = 5\n\n\n# // Call once at the end of the program - currently only used for MPI\n# LLAMA_API void llama_backend_free(void);\n@ctypes_function(\n    \"llama_backend_free\",\n    [],\n    None,\n)\ndef llama_backend_free():\n    \"\"\"Call once at the end of the program - currently only used for MPI\"\"\"\n    ...\n\n\n# //optional:\n# LLAMA_API void llama_numa_init(enum ggml_numa_strategy numa);\n@ctypes_function(\n    \"llama_numa_init\",\n    [ctypes.c_int],\n    None,\n)\ndef llama_numa_init(numa: int, /):\n    ...\n\n\n# // Optional: an auto threadpool gets created in ggml if not passed explicitly\n# LLAMA_API void llama_attach_threadpool(\n#         struct llama_context * ctx,\n#            ggml_threadpool_t   threadpool,\n#            ggml_threadpool_t   threadpool_batch);\n# TODO: Add llama_attach_threadpool\n\n\n# LLAMA_API void llama_detach_threadpool(struct llama_context * ctx);\n# TODO: Add llama_detach_threadpool\n\n\n# DEPRECATED(LLAMA_API struct llama_model * llama_load_model_from_file(\n#                          const char * path_model,\n#           struct llama_model_params   params),\n#         \"use llama_model_load_from_file instead\");\n@ctypes_function(\n    \"llama_load_model_from_file\",\n    [ctypes.c_char_p, llama_model_params],\n    llama_model_p_ctypes,\n)\ndef llama_load_model_from_file(\n    path_model: bytes, params: llama_model_params, /\n) -> Optional[llama_model_p]:\n    ...\n\n\n# // Load the model from a file\n# // If the file is split into multiple parts, the file name must follow this pattern: <name>-%05d-of-%05d.gguf\n# // If the split file name does not follow this pattern, use llama_model_load_from_splits\n# LLAMA_API struct llama_model * llama_model_load_from_file(\n#                          const char * path_model,\n#           struct llama_model_params   params);\n@ctypes_function(\n    \"llama_model_load_from_file\",\n    [ctypes.c_char_p, llama_model_params],\n    llama_model_p_ctypes,\n)\ndef llama_model_load_from_file(\n    path_model: bytes, params: llama_model_params, /\n) -> Optional[llama_model_p]:\n    \"\"\"Load the model from a file\n\n    If the file is split into multiple parts, the file name must follow this pattern: <name>-%05d-of-%05d.gguf\n\n    If the split file name does not follow this pattern, use llama_model_load_from_splits\"\"\"\n    ...\n\n\n# // Load the model from multiple splits (support custom naming scheme)\n# // The paths must be in the correct order\n# LLAMA_API struct llama_model * llama_model_load_from_splits(\n#                          const char ** paths,\n#                              size_t    n_paths,\n#           struct llama_model_params    params);\n@ctypes_function(\n    \"llama_model_load_from_splits\",\n    [ctypes.POINTER(ctypes.c_char_p), ctypes.c_size_t, llama_model_params],\n    llama_model_p_ctypes,\n)\ndef llama_model_load_from_splits(\n    paths: List[bytes], n_paths: int, params: llama_model_params, /\n) -> Optional[llama_model_p]:\n    \"\"\"Load the model from multiple splits (support custom naming scheme)\n\n    The paths must be in the correct order\"\"\"\n    ...\n\n\n# LLAMA_API void llama_model_save_to_file(\n#         const struct llama_model * model,\n#                     const char * path_model);\n@ctypes_function(\n    \"llama_model_save_to_file\",\n    [llama_model_p_ctypes, ctypes.c_char_p],\n    None,\n)\ndef llama_model_save_to_file(model: llama_model_p, path_model: bytes, /):\n    \"\"\"Save the model to a file\"\"\"\n    ...\n\n\n# DEPRECATED(LLAMA_API void llama_free_model(struct llama_model * model),\n#         \"use llama_model_free instead\");\n@ctypes_function(\n    \"llama_free_model\",\n    [llama_model_p_ctypes],\n    None,\n)\ndef llama_free_model(model: llama_model_p, /):\n    ...\n\n\n# LLAMA_API void llama_model_free(struct llama_model * model);\n@ctypes_function(\n    \"llama_model_free\",\n    [llama_model_p_ctypes],\n    None,\n)\ndef llama_model_free(model: llama_model_p, /):\n    ...\n\n\n# LLAMA_API struct llama_context * llama_init_from_model(\n#                  struct llama_model * model,\n#         struct llama_context_params   params);\n@ctypes_function(\n    \"llama_init_from_model\",\n    [llama_model_p_ctypes, llama_context_params],\n    llama_context_p_ctypes,\n)\ndef llama_init_from_model(\n    model: llama_model_p, params: llama_context_params, /\n) -> Optional[llama_context_p]:\n    ...\n\n\n# DEPRECATED(LLAMA_API struct llama_context * llama_new_context_with_model(\n#                  struct llama_model * model,\n#         struct llama_context_params   params),\n#         \"use llama_init_from_model instead\");\n@ctypes_function(\n    \"llama_new_context_with_model\",\n    [llama_model_p_ctypes, llama_context_params],\n    llama_context_p_ctypes,\n)\ndef llama_new_context_with_model(\n    model: llama_model_p, params: llama_context_params, /\n) -> Optional[llama_context_p]:\n    ...\n\n\n# // Frees all allocated memory\n# LLAMA_API void llama_free(struct llama_context * ctx);\n@ctypes_function(\n    \"llama_free\",\n    [llama_context_p_ctypes],\n    None,\n)\ndef llama_free(ctx: llama_context_p, /):\n    \"\"\"Frees all allocated memory\"\"\"\n    ...\n\n\n# LLAMA_API int64_t llama_time_us(void);\n@ctypes_function(\n    \"llama_time_us\",\n    [],\n    ctypes.c_int64,\n)\ndef llama_time_us() -> int:\n    ...\n\n\n# LLAMA_API size_t llama_max_devices(void);\n@ctypes_function(\"llama_max_devices\", [], ctypes.c_size_t)\ndef llama_max_devices() -> int:\n    ...\n\n\n# LLAMA_API size_t llama_max_parallel_sequences(void);\n@ctypes_function(\"llama_max_parallel_sequences\", [], ctypes.c_size_t)\ndef llama_max_parallel_sequences() -> int:\n    ...\n\n\n# LLAMA_API bool llama_supports_mmap       (void);\n@ctypes_function(\"llama_supports_mmap\", [], ctypes.c_bool)\ndef llama_supports_mmap() -> bool:\n    ...\n\n\n# LLAMA_API bool llama_supports_mlock      (void);\n@ctypes_function(\"llama_supports_mlock\", [], ctypes.c_bool)\ndef llama_supports_mlock() -> bool:\n    ...\n\n\n# LLAMA_API bool llama_supports_gpu_offload(void);\n@ctypes_function(\"llama_supports_gpu_offload\", [], ctypes.c_bool)\ndef llama_supports_gpu_offload() -> bool:\n    ...\n\n\n# LLAMA_API bool llama_supports_rpc        (void);\n@ctypes_function(\"llama_supports_rpc\", [], ctypes.c_bool)\ndef llama_supports_rpc() -> bool:\n    ...\n\n\n# LLAMA_API uint32_t llama_n_ctx      (const struct llama_context * ctx);\n@ctypes_function(\"llama_n_ctx\", [llama_context_p_ctypes], ctypes.c_uint32)\ndef llama_n_ctx(ctx: llama_context_p, /) -> int:\n    ...\n\n\n# LLAMA_API uint32_t llama_n_batch    (const struct llama_context * ctx);\n@ctypes_function(\"llama_n_batch\", [llama_context_p_ctypes], ctypes.c_uint32)\ndef llama_n_batch(ctx: llama_context_p, /) -> int:\n    ...\n\n\n# LLAMA_API uint32_t llama_n_ubatch   (const struct llama_context * ctx);\n@ctypes_function(\"llama_n_ubatch\", [llama_context_p_ctypes], ctypes.c_uint32)\ndef llama_n_ubatch(ctx: llama_context_p, /) -> int:\n    ...\n\n\n# LLAMA_API uint32_t llama_n_seq_max  (const struct llama_context * ctx);\n@ctypes_function(\"llama_n_seq_max\", [llama_context_p_ctypes], ctypes.c_uint32)\ndef llama_n_seq_max(ctx: llama_context_p, /) -> int:\n    ...\n\n\n# DEPRECATED(LLAMA_API int32_t llama_n_ctx_train(const struct llama_model * model), \"use llama_model_n_ctx_train instead\");\n@ctypes_function(\"llama_n_ctx_train\", [llama_model_p_ctypes], ctypes.c_int32)\ndef llama_n_ctx_train(model: llama_model_p, /) -> int:\n    ...\n\n\n# DEPRECATED(LLAMA_API int32_t llama_n_embd     (const struct llama_model * model), \"use llama_model_n_embd instead\");\n@ctypes_function(\"llama_n_embd\", [llama_model_p_ctypes], ctypes.c_int32)\ndef llama_n_embd(model: llama_model_p, /) -> int:\n    ...\n\n\n# DEPRECATED(LLAMA_API int32_t llama_n_layer    (const struct llama_model * model), \"use llama_model_n_layer instead\");\n@ctypes_function(\"llama_n_layer\", [llama_model_p_ctypes], ctypes.c_int32)\ndef llama_n_layer(model: llama_model_p, /) -> int:\n    ...\n\n\n# DEPRECATED(LLAMA_API int32_t llama_n_head     (const struct llama_model * model), \"use llama_model_n_head instead\");\n@ctypes_function(\"llama_n_head\", [llama_model_p_ctypes], ctypes.c_int32)\ndef llama_n_head(model: llama_model_p, /) -> int:\n    ...\n\n\n# DEPRECATED(LLAMA_API int32_t llama_n_vocab    (const struct llama_vocab * vocab), \"use llama_vocab_n_tokens instead\");\n@ctypes_function(\"llama_n_vocab\", [llama_vocab_p_ctypes], ctypes.c_int32)\ndef llama_n_vocab(model: llama_vocab_p, /) -> int:\n    ...\n\n\n# LLAMA_API const struct llama_model * llama_get_model   (const struct llama_context * ctx);\n@ctypes_function(\"llama_get_model\", [llama_context_p_ctypes], llama_model_p_ctypes)\ndef llama_get_model(ctx: llama_context_p, /) -> Optional[llama_model_p]:\n    ...\n\n\n# LLAMA_API           llama_memory_t   llama_get_memory  (const struct llama_context * ctx);\n@ctypes_function(\"llama_get_memory\", [llama_context_p_ctypes], llama_memory_t_ctypes)\ndef llama_get_memory(ctx: llama_context_p, /) -> Optional[llama_memory_t]:\n    \"\"\"Get the memory for the context\"\"\"\n    ...\n\n\n# LLAMA_API  enum llama_pooling_type    llama_pooling_type(const struct llama_context * ctx);\n@ctypes_function(\"llama_pooling_type\", [llama_context_p_ctypes], ctypes.c_int)\ndef llama_pooling_type(ctx: llama_context_p, /) -> int:\n    ...\n\n\n# DEPRECATED(LLAMA_API struct llama_kv_cache * llama_get_kv_self(struct llama_context * ctx), \"use llama_get_memory instead\");\n@ctypes_function(\n    \"llama_get_kv_self\",\n    [llama_context_p_ctypes],\n    llama_kv_cache_p_ctypes,\n)\ndef llama_get_kv_self(ctx: llama_context_p, /) -> Optional[llama_kv_cache_p]:\n    \"\"\"Get the KV cache for self-attention (DEPRECATED)\"\"\"\n    ...\n\n\n# LLAMA_API const struct llama_vocab * llama_model_get_vocab(const struct llama_model * model);\n@ctypes_function(\"llama_model_get_vocab\", [llama_model_p_ctypes], llama_vocab_p_ctypes)\ndef llama_model_get_vocab(model: llama_model_p, /) -> Optional[llama_vocab_p]:\n    ...\n\n\n# LLAMA_API enum llama_rope_type       llama_model_rope_type(const struct llama_model * model);\n@ctypes_function(\"llama_model_rope_type\", [llama_model_p_ctypes], ctypes.c_int)\ndef llama_model_rope_type(model: llama_model_p, /) -> int:\n    ...\n\n\n# LLAMA_API int32_t llama_model_n_ctx_train(const struct llama_model * model);\n@ctypes_function(\"llama_model_n_ctx_train\", [llama_model_p_ctypes], ctypes.c_int32)\ndef llama_model_n_ctx_train(model: llama_model_p, /) -> int:\n    ...\n\n\n# LLAMA_API int32_t llama_model_n_embd     (const struct llama_model * model);\n@ctypes_function(\"llama_model_n_embd\", [llama_model_p_ctypes], ctypes.c_int32)\ndef llama_model_n_embd(model: llama_model_p, /) -> int:\n    ...\n\n\n# LLAMA_API int32_t llama_model_n_layer    (const struct llama_model * model);\n@ctypes_function(\"llama_model_n_layer\", [llama_model_p_ctypes], ctypes.c_int32)\ndef llama_model_n_layer(model: llama_model_p, /) -> int:\n    ...\n\n\n# LLAMA_API int32_t llama_model_n_head     (const struct llama_model * model);\n@ctypes_function(\"llama_model_n_head\", [llama_model_p_ctypes], ctypes.c_int32)\ndef llama_model_n_head(model: llama_model_p, /) -> int:\n    ...\n\n\n# LLAMA_API int32_t llama_model_n_head_kv  (const struct llama_model * model);\n@ctypes_function(\"llama_model_n_head_kv\", [llama_model_p_ctypes], ctypes.c_int32)\ndef llama_model_n_head_kv(model: llama_model_p, /) -> int:\n    ...\n\n\n# LLAMA_API int32_t llama_model_n_swa      (const struct llama_model * model);\n@ctypes_function(\"llama_model_n_swa\", [llama_model_p_ctypes], ctypes.c_int32)\ndef llama_model_n_swa(model: llama_model_p, /) -> int:\n    ...\n\n\n# // Get the model's RoPE frequency scaling factor\n# LLAMA_API float llama_model_rope_freq_scale_train(const struct llama_model * model);\n@ctypes_function(\"llama_model_rope_freq_scale_train\", [llama_model_p_ctypes], ctypes.c_float)\ndef llama_model_rope_freq_scale_train(model: llama_model_p, /) -> float:\n    ...\n\n\n# // Returns the number of classifier outputs (only valid for classifier models)\n# // Undefined behavior for non-classifier models\n# LLAMA_API uint32_t llama_model_n_cls_out(const struct llama_model * model);\n@ctypes_function(\"llama_model_n_cls_out\", [llama_model_p_ctypes], ctypes.c_uint32)\ndef llama_model_n_cls_out(model: llama_model_p, /) -> int:\n    \"\"\"Returns the number of classifier outputs (only valid for classifier models)\"\"\"\n    ...\n\n\n# // Returns label of classifier output by index (<n_cls_out). Returns nullptr if no label provided\n# LLAMA_API const char * llama_model_cls_label(const struct llama_model * model, uint32_t i);\n@ctypes_function(\"llama_model_cls_label\", [llama_model_p_ctypes, ctypes.c_uint32], ctypes.c_char_p)\ndef llama_model_cls_label(model: llama_model_p, i: int, /) -> Optional[bytes]:\n    \"\"\"Returns label of classifier output by index. Returns None if no label provided\"\"\"\n    ...\n\n\n# LLAMA_API enum llama_vocab_type   llama_vocab_type  (const struct llama_model * model);\n@ctypes_function(\"llama_vocab_type\", [llama_vocab_p_ctypes], ctypes.c_int)\ndef llama_vocab_type(vocab: llama_vocab_p, /) -> int:\n    ...\n\n\n# LLAMA_API int32_t llama_vocab_n_tokens(const struct llama_vocab * vocab);\n@ctypes_function(\"llama_vocab_n_tokens\", [llama_vocab_p_ctypes], ctypes.c_int32)\ndef llama_vocab_n_tokens(vocab: llama_vocab_p, /) -> int:\n    ...\n\n\n# // Functions to access the model's GGUF metadata scalar values\n# // - The functions return the length of the string on success, or -1 on failure\n# // - The output string is always null-terminated and cleared on failure\n# // - When retrieving a string, an extra byte must be allocated to account for the null terminator\n# // - GGUF array values are not supported by these functions\n\n\n# // Get metadata value as a string by key name\n# LLAMA_API int32_t llama_model_meta_val_str(const struct llama_model * model, const char * key, char * buf, size_t buf_size);\n@ctypes_function(\n    \"llama_model_meta_val_str\",\n    [\n        llama_model_p_ctypes,\n        ctypes.c_char_p,\n        ctypes.c_char_p,\n        ctypes.c_size_t,\n    ],\n    ctypes.c_int32,\n)\ndef llama_model_meta_val_str(\n    model: llama_model_p,\n    key: Union[ctypes.c_char_p, bytes],\n    buf: bytes,\n    buf_size: int,\n    /,\n) -> int:\n    \"\"\"Get metadata value as a string by key name\"\"\"\n    ...\n\n\n# // Get the number of metadata key/value pairs\n# LLAMA_API int32_t llama_model_meta_count(const struct llama_model * model);\n@ctypes_function(\"llama_model_meta_count\", [llama_model_p_ctypes], ctypes.c_int32)\ndef llama_model_meta_count(model: llama_model_p, /) -> int:\n    \"\"\"Get the number of metadata key/value pairs\"\"\"\n    ...\n\n\n# // Get metadata key name by index\n# LLAMA_API int32_t llama_model_meta_key_by_index(const struct llama_model * model, int32_t i, char * buf, size_t buf_size);\n@ctypes_function(\n    \"llama_model_meta_key_by_index\",\n    [\n        llama_model_p_ctypes,\n        ctypes.c_int32,\n        ctypes.c_char_p,\n        ctypes.c_size_t,\n    ],\n    ctypes.c_int32,\n)\ndef llama_model_meta_key_by_index(\n    model: llama_model_p,\n    i: Union[ctypes.c_int, int],\n    buf: Union[bytes, CtypesArray[ctypes.c_char]],\n    buf_size: int,\n    /,\n) -> int:\n    \"\"\"Get metadata key name by index\"\"\"\n    ...\n\n\n# // Get metadata value as a string by index\n# LLAMA_API int32_t llama_model_meta_val_str_by_index(const struct llama_model * model, int32_t i, char * buf, size_t buf_size);\n@ctypes_function(\n    \"llama_model_meta_val_str_by_index\",\n    [\n        llama_model_p_ctypes,\n        ctypes.c_int32,\n        ctypes.c_char_p,\n        ctypes.c_size_t,\n    ],\n    ctypes.c_int32,\n)\ndef llama_model_meta_val_str_by_index(\n    model: llama_model_p,\n    i: Union[ctypes.c_int, int],\n    buf: Union[bytes, CtypesArray[ctypes.c_char]],\n    buf_size: int,\n    /,\n) -> int:\n    \"\"\"Get metadata value as a string by index\"\"\"\n    ...\n\n\n# // Get a string describing the model type\n# LLAMA_API int32_t llama_model_desc(const struct llama_model * model, char * buf, size_t buf_size);\n@ctypes_function(\n    \"llama_model_desc\",\n    [llama_model_p_ctypes, ctypes.c_char_p, ctypes.c_size_t],\n    ctypes.c_int32,\n)\ndef llama_model_desc(\n    model: llama_model_p,\n    buf: Union[bytes, CtypesArray[ctypes.c_char]],\n    buf_size: Union[ctypes.c_size_t, int],\n    /,\n) -> int:\n    \"\"\"Get a string describing the model type\"\"\"\n    ...\n\n\n# // Returns the total size of all the tensors in the model in bytes\n# LLAMA_API uint64_t llama_model_size(const struct llama_model * model);\n@ctypes_function(\"llama_model_size\", [llama_model_p_ctypes], ctypes.c_uint64)\ndef llama_model_size(model: llama_model_p, /) -> int:\n    \"\"\"Returns the total size of all the tensors in the model in bytes\"\"\"\n    ...\n\n\n# // Get the default chat template. Returns nullptr if not available\n# // If name is NULL, returns the default chat template\n# LLAMA_API const char * llama_model_chat_template(const struct llama_model * model, const char * name);\n@ctypes_function(\"llama_model_chat_template\", [llama_model_p_ctypes, ctypes.c_char_p], ctypes.c_char_p)\ndef llama_model_chat_template(model: llama_model_p, name: Optional[bytes], /) -> Optional[bytes]:\n    \"\"\"Get the default chat template. Returns None if not available\n    If name is None, returns the default chat template\"\"\"\n    ...\n\n\n# // Returns the total number of parameters in the model\n# LLAMA_API uint64_t llama_model_n_params(const struct llama_model * model);\n@ctypes_function(\"llama_model_n_params\", [llama_model_p_ctypes], ctypes.c_uint64)\ndef llama_model_n_params(model: llama_model_p, /) -> int:\n    \"\"\"Returns the total number of parameters in the model\"\"\"\n    ...\n\n\n# // Returns true if the model contains an encoder that requires llama_encode() call\n# LLAMA_API bool llama_model_has_encoder(const struct llama_model * model);\n@ctypes_function(\"llama_model_has_encoder\", [llama_model_p_ctypes], ctypes.c_bool)\ndef llama_model_has_encoder(model: llama_model_p, /) -> bool:\n    \"\"\"Returns true if the model contains an encoder that requires llama_encode() call\"\"\"\n    ...\n\n\n# // Returns true if the model contains a decoder that requires llama_decode() call\n# LLAMA_API bool llama_model_has_decoder(const struct llama_model * model);\n@ctypes_function(\"llama_model_has_decoder\", [llama_model_p_ctypes], ctypes.c_bool)\ndef llama_model_has_decoder(model: llama_model_p, /) -> bool:\n    \"\"\"Returns true if the model contains a decoder that requires llama_decode() call\"\"\"\n    ...\n\n\n# // For encoder-decoder models, this function returns id of the token that must be provided\n# // to the decoder to start generating output sequence. For other models, it returns -1.\n# LLAMA_API llama_token llama_model_decoder_start_token(const struct llama_model * model);\n@ctypes_function(\n    \"llama_model_decoder_start_token\", [llama_model_p_ctypes], ctypes.c_int32\n)\ndef llama_model_decoder_start_token(model: llama_model_p, /) -> int:\n    \"\"\"For encoder-decoder models, this function returns id of the token that must be provided\n    to the decoder to start generating output sequence. For other models, it returns -1.\n    \"\"\"\n    ...\n\n\n# // Returns true if the model is recurrent (like Mamba, RWKV, etc.)\n# LLAMA_API bool llama_model_is_recurrent(const struct llama_model * model);\n@ctypes_function(\"llama_model_is_recurrent\", [llama_model_p_ctypes], ctypes.c_bool)\ndef llama_model_is_recurrent(model: llama_model_p, /) -> bool:\n    \"\"\"Returns true if the model is recurrent (like Mamba, RWKV, etc.)\"\"\"\n    ...\n\n\n# // Returns true if the model is diffusion-based (like LLaDA, Dream, etc.)\n# LLAMA_API bool llama_model_is_diffusion(const struct llama_model * model);\n@ctypes_function(\"llama_model_is_diffusion\", [llama_model_p_ctypes], ctypes.c_bool)\ndef llama_model_is_diffusion(model: llama_model_p, /) -> bool:\n    \"\"\"Returns true if the model is diffusion-based (like LLaDA, Dream, etc.)\"\"\"\n    ...\n\n\n# // Returns 0 on success\n# LLAMA_API uint32_t llama_model_quantize(\n#         const char * fname_inp,\n#         const char * fname_out,\n#         const llama_model_quantize_params * params);\n@ctypes_function(\n    \"llama_model_quantize\",\n    [\n        ctypes.c_char_p,\n        ctypes.c_char_p,\n        ctypes.POINTER(llama_model_quantize_params),\n    ],\n    ctypes.c_uint32,\n)\ndef llama_model_quantize(\n    fname_inp: bytes,\n    fname_out: bytes,\n    params: CtypesPointerOrRef[llama_model_quantize_params],\n    /,\n) -> int:\n    \"\"\"Returns 0 on success\"\"\"\n    ...\n\n\n# //\n# // Adapters\n# //\n\n# // Load a LoRA adapter from file\n# LLAMA_API struct llama_adapter_lora * llama_adapter_lora_init(\n#         struct llama_model * model,\n#         const char * path_lora);\n@ctypes_function(\n    \"llama_adapter_lora_init\",\n    [llama_model_p_ctypes, ctypes.c_char_p],\n    llama_adapter_lora_p_ctypes,\n)\ndef llama_adapter_lora_init(\n    model: llama_model_p, path_lora: bytes, /\n) -> Optional[llama_adapter_lora_p]:\n    ...\n\n\n# // Manually free a LoRA adapter\n# // Note: loaded adapters will be free when the associated model is deleted\n# LLAMA_API void llama_adapter_lora_free(struct llama_adapter_lora * adapter);\n@ctypes_function(\n    \"llama_adapter_lora_free\",\n    [llama_adapter_lora_p_ctypes],\n    None,\n)\ndef llama_adapter_lora_free(adapter: llama_adapter_lora_p, /):\n    ...\n\n\n# // The following functions operate on a llama_context, hence the naming: llama_verb_...\n\n\n# // Add a loaded LoRA adapter to given context\n# // This will not modify model's weight\n# LLAMA_API int32_t llama_set_adapter_lora(\n#         struct llama_context * ctx,\n#         struct llama_adapter_lora * adapter,\n#         float scale);\n@ctypes_function(\n    \"llama_set_adapter_lora\",\n    [llama_context_p_ctypes, llama_adapter_lora_p_ctypes, ctypes.c_float],\n    ctypes.c_int32,\n)\ndef llama_set_adapter_lora(\n    ctx: llama_context_p, adapter: llama_adapter_lora_p, scale: float, /\n) -> int:\n    \"\"\"Add a loaded LoRA adapter to given context\n    This will not modify model's weight\"\"\"\n    ...\n\n\n# // Remove a specific LoRA adapter from given context\n# // Return -1 if the adapter is not present in the context\n# LLAMA_API int32_t llama_rm_adapter_lora(\n#         struct llama_context * ctx,\n#         struct llama_adapter_lora * adapter);\n@ctypes_function(\n    \"llama_rm_adapter_lora\",\n    [llama_context_p_ctypes, llama_adapter_lora_p_ctypes],\n    ctypes.c_int32,\n)\ndef llama_rm_adapter_lora(\n    ctx: llama_context_p, adapter: llama_adapter_lora_p, /\n) -> int:\n    \"\"\"Remove a specific LoRA adapter from given context\n    Return -1 if the adapter is not present in the context\"\"\"\n    ...\n\n\n# // Remove all LoRA adapters from given context\n# LLAMA_API void llama_clear_adapter_lora(struct llama_context * ctx);\n@ctypes_function(\n    \"llama_clear_adapter_lora\",\n    [llama_context_p_ctypes],\n    None,\n)\ndef llama_clear_adapter_lora(ctx: llama_context_p, /):\n    \"\"\"Remove all LoRA adapters from given context\"\"\"\n    ...\n\n\n# // Apply a loaded control vector to a llama_context, or if data is NULL, clear\n# // the currently loaded vector.\n# // n_embd should be the size of a single layer's control, and data should point\n# // to an n_embd x n_layers buffer starting from layer 1.\n# // il_start and il_end are the layer range the vector should apply to (both inclusive)\n# // See llama_control_vector_load in common to load a control vector.\n# LLAMA_API int32_t llama_apply_adapter_cvec(\n#         struct llama_context * ctx,\n#                  const float * data,\n#                       size_t   len,\n#                      int32_t   n_embd,\n#                      int32_t   il_start,\n#                      int32_t   il_end);\n@ctypes_function(\n    \"llama_apply_adapter_cvec\",\n    [\n        llama_context_p_ctypes,\n        ctypes.POINTER(ctypes.c_float),\n        ctypes.c_size_t,\n        ctypes.c_int32,\n        ctypes.c_int32,\n        ctypes.c_int32,\n    ],\n    ctypes.c_int32,\n)\ndef llama_apply_adapter_cvec(\n    ctx: llama_context_p,\n    data: CtypesPointerOrRef[ctypes.c_float],\n    len: int,\n    n_embd: int,\n    il_start: int,\n    il_end: int,\n    /,\n) -> int:\n    \"\"\"Apply a loaded control vector to a llama_context, or if data is NULL, clear\n    the currently loaded vector.\n    n_embd should be the size of a single layer's control, and data should point\n    to an n_embd x n_layers buffer starting from layer 1.\n    il_start and il_end are the layer range the vector should apply to (both inclusive)\n    See llama_control_vector_load in common to load a control vector.\"\"\"\n    ...\n\n\n# //\n# // Memory\n# //\n\n# // Clear the memory contents\n# // If data == true, the data buffers will also be cleared together with the metadata\n# LLAMA_API void llama_memory_clear(\n#         llama_memory_t mem,\n#                   bool data);\n@ctypes_function(\n    \"llama_memory_clear\",\n    [llama_memory_t_ctypes, ctypes.c_bool],\n    None,\n)\ndef llama_memory_clear(mem: llama_memory_t, data: bool, /):\n    \"\"\"Clear the memory contents\n    If data == true, the data buffers will also be cleared together with the metadata\"\"\"\n    ...\n\n\n# // Removes all tokens that belong to the specified sequence and have positions in [p0, p1)\n# // Returns false if a partial sequence cannot be removed. Removing a whole sequence never fails\n# // seq_id < 0 : match any sequence\n# // p0 < 0     : [0,  p1]\n# // p1 < 0     : [p0, inf)\n# LLAMA_API bool llama_memory_seq_rm(\n#         llama_memory_t mem,\n#           llama_seq_id seq_id,\n#              llama_pos p0,\n#              llama_pos p1);\n@ctypes_function(\n    \"llama_memory_seq_rm\",\n    [\n        llama_memory_t_ctypes,\n        llama_seq_id,\n        llama_pos,\n        llama_pos,\n    ],\n    ctypes.c_bool,\n)\ndef llama_memory_seq_rm(\n    mem: llama_memory_t,\n    seq_id: Union[llama_seq_id, int],\n    p0: Union[llama_pos, int],\n    p1: Union[llama_pos, int],\n    /,\n) -> bool:\n    \"\"\"Removes all tokens that belong to the specified sequence and have positions in [p0, p1)\n\n    Returns false if a partial sequence cannot be removed. Removing a whole sequence never fails\n\n    seq_id < 0 : match any sequence\n    p0 < 0     : [0,  p1]\n    p1 < 0     : [p0, inf)\"\"\"\n    ...\n\n\n# // Copy all tokens that belong to the specified sequence to another sequence\n# // p0 < 0 : [0,  p1]\n# // p1 < 0 : [p0, inf)\n# LLAMA_API void llama_memory_seq_cp(\n#         llama_memory_t mem,\n#           llama_seq_id seq_id_src,\n#           llama_seq_id seq_id_dst,\n#              llama_pos p0,\n#              llama_pos p1);\n@ctypes_function(\n    \"llama_memory_seq_cp\",\n    [\n        llama_memory_t_ctypes,\n        llama_seq_id,\n        llama_seq_id,\n        llama_pos,\n        llama_pos,\n    ],\n    None,\n)\ndef llama_memory_seq_cp(\n    mem: llama_memory_t,\n    seq_id_src: Union[llama_seq_id, int],\n    seq_id_dst: Union[llama_seq_id, int],\n    p0: Union[llama_pos, int],\n    p1: Union[llama_pos, int],\n    /,\n):\n    \"\"\"Copy all tokens that belong to the specified sequence to another sequence\n    p0 < 0 : [0,  p1]\n    p1 < 0 : [p0, inf)\"\"\"\n    ...\n\n\n# // Removes all tokens that do not belong to the specified sequence\n# LLAMA_API void llama_memory_seq_keep(\n#         llama_memory_t mem,\n#           llama_seq_id seq_id);\n@ctypes_function(\n    \"llama_memory_seq_keep\", [llama_memory_t_ctypes, llama_seq_id], None\n)\ndef llama_memory_seq_keep(mem: llama_memory_t, seq_id: Union[llama_seq_id, int], /):\n    \"\"\"Removes all tokens that do not belong to the specified sequence\"\"\"\n    ...\n\n\n# // Adds relative position \"delta\" to all tokens that belong to the specified sequence and have positions in [p0, p1)\n# // p0 < 0 : [0,  p1]\n# // p1 < 0 : [p0, inf)\n# LLAMA_API void llama_memory_seq_add(\n#         llama_memory_t mem,\n#           llama_seq_id seq_id,\n#              llama_pos p0,\n#              llama_pos p1,\n#              llama_pos delta);\n@ctypes_function(\n    \"llama_memory_seq_add\",\n    [\n        llama_memory_t_ctypes,\n        llama_seq_id,\n        llama_pos,\n        llama_pos,\n        llama_pos,\n    ],\n    None,\n)\ndef llama_memory_seq_add(\n    mem: llama_memory_t,\n    seq_id: Union[llama_seq_id, int],\n    p0: Union[llama_pos, int],\n    p1: Union[llama_pos, int],\n    delta: Union[llama_pos, int],\n    /,\n):\n    \"\"\"Adds relative position \"delta\" to all tokens that belong to the specified sequence and have positions in [p0, p1)\n    p0 < 0 : [0,  p1]\n    p1 < 0 : [p0, inf)\"\"\"\n    ...\n\n\n# // Integer division of the positions by factor of `d > 1`\n# // p0 < 0 : [0,  p1]\n# // p1 < 0 : [p0, inf)\n# LLAMA_API void llama_memory_seq_div(\n#         llama_memory_t mem,\n#           llama_seq_id seq_id,\n#              llama_pos p0,\n#              llama_pos p1,\n#                    int d);\n@ctypes_function(\n    \"llama_memory_seq_div\",\n    [\n        llama_memory_t_ctypes,\n        llama_seq_id,\n        llama_pos,\n        llama_pos,\n        ctypes.c_int,\n    ],\n    None,\n)\ndef llama_memory_seq_div(\n    mem: llama_memory_t,\n    seq_id: Union[llama_seq_id, int],\n    p0: Union[llama_pos, int],\n    p1: Union[llama_pos, int],\n    d: Union[ctypes.c_int, int],\n    /,\n):\n    \"\"\"Integer division of the positions by factor of `d > 1`\n    p0 < 0 : [0,  p1]\n    p1 < 0 : [p0, inf)\"\"\"\n    ...\n\n\n# // Returns the smallest position present in the memory for the specified sequence\n# // This is typically non-zero only for SWA caches\n# // Note that all positions in the range [pos_min, pos_max] are guaranteed to be present in the memory\n# // Return -1 if the sequence is empty\n# LLAMA_API llama_pos llama_memory_seq_pos_min(\n#         llama_memory_t mem,\n#           llama_seq_id seq_id);\n@ctypes_function(\n    \"llama_memory_seq_pos_min\", [llama_memory_t_ctypes, llama_seq_id], llama_pos\n)\ndef llama_memory_seq_pos_min(\n    mem: llama_memory_t, seq_id: Union[llama_seq_id, int], /\n) -> int:\n    \"\"\"Returns the smallest position present in the memory for the specified sequence\n    This is typically non-zero only for SWA caches\n    Return -1 if the sequence is empty\"\"\"\n    ...\n\n\n# // Returns the largest position present in the memory for the specified sequence\n# // Note that all positions in the range [pos_min, pos_max] are guaranteed to be present in the memory\n# // Return -1 if the sequence is empty\n# LLAMA_API llama_pos llama_memory_seq_pos_max(\n#         llama_memory_t mem,\n#           llama_seq_id seq_id);\n@ctypes_function(\n    \"llama_memory_seq_pos_max\", [llama_memory_t_ctypes, llama_seq_id], llama_pos\n)\ndef llama_memory_seq_pos_max(\n    mem: llama_memory_t, seq_id: Union[llama_seq_id, int], /\n) -> int:\n    \"\"\"Returns the largest position present in the memory for the specified sequence\n    Return -1 if the sequence is empty\"\"\"\n    ...\n\n\n# // Check if the memory supports shifting\n# LLAMA_API bool llama_memory_can_shift(llama_memory_t mem);\n@ctypes_function(\"llama_memory_can_shift\", [llama_memory_t_ctypes], ctypes.c_bool)\ndef llama_memory_can_shift(mem: llama_memory_t, /) -> bool:\n    \"\"\"Check if the memory supports shifting\"\"\"\n    ...\n\n\n# //\n# // KV cache for self-attention (TODO: deprecate in favor of llama_memory)\n# //\n\n# // Returns the number of tokens in the KV cache (slow, use only for debug)\n# // If a KV cell has multiple sequences assigned to it, it will be counted multiple times\n# DEPRECATED(LLAMA_API int32_t llama_kv_self_n_tokens(const struct llama_context * ctx),\n#            \"Use llama_kv_self_seq_pos_max() and llama_kv_self_seq_pos_min() instead (https://github.com/ggml-org/llama.cpp/issues/13793)\");\n@ctypes_function(\n    \"llama_kv_self_n_tokens\", [llama_context_p_ctypes], ctypes.c_int32\n)\ndef llama_kv_self_n_tokens(ctx: llama_context_p, /) -> int:\n    \"\"\"Returns the number of tokens in the KV cache (slow, use only for debug) (DEPRECATED)\"\"\"\n    ...\n\n\n# // Returns the number of used KV cells (i.e. have at least one sequence assigned to them)\n# DEPRECATED(LLAMA_API int32_t llama_kv_self_used_cells(const struct llama_context * ctx),\n#            \"Use llama_kv_self_seq_pos_max() and llama_kv_self_seq_pos_min() instead (https://github.com/ggml-org/llama.cpp/issues/13793)\");\n@ctypes_function(\n    \"llama_kv_self_used_cells\", [llama_context_p_ctypes], ctypes.c_int32\n)\ndef llama_kv_self_used_cells(ctx: llama_context_p, /) -> int:\n    \"\"\"Returns the number of used KV cells (DEPRECATED)\"\"\"\n    ...\n\n\n# // Clear the KV cache - both cell info is erased and KV data is zeroed\n# DEPRECATED(LLAMA_API void llama_kv_self_clear(\n#             struct llama_context * ctx),\n#         \"Use llama_memory_clear() instead\");\n@ctypes_function(\n    \"llama_kv_self_clear\", [llama_context_p_ctypes], None\n)\ndef llama_kv_self_clear(ctx: llama_context_p, /):\n    \"\"\"Clear the KV cache (DEPRECATED)\"\"\"\n    ...\n\n\n# // Removes all tokens that belong to the specified sequence and have positions in [p0, p1)\n# // Returns false if a partial sequence cannot be removed. Removing a whole sequence never fails\n# // seq_id < 0 : match any sequence\n# // p0 < 0     : [0,  p1]\n# // p1 < 0     : [p0, inf)\n# DEPRECATED(LLAMA_API bool llama_kv_self_seq_rm(\n#         struct llama_context * ctx,\n#                 llama_seq_id   seq_id,\n#                    llama_pos   p0,\n#                    llama_pos   p1),\n#         \"Use llama_memory_seq_rm() instead\");\n@ctypes_function(\n    \"llama_kv_self_seq_rm\",\n    [\n        llama_context_p_ctypes,\n        llama_seq_id,\n        llama_pos,\n        llama_pos,\n    ],\n    ctypes.c_bool,\n)\ndef llama_kv_self_seq_rm(\n    ctx: llama_context_p,\n    seq_id: Union[llama_seq_id, int],\n    p0: Union[llama_pos, int],\n    p1: Union[llama_pos, int],\n    /,\n) -> bool:\n    \"\"\"Remove tokens from KV cache (DEPRECATED)\"\"\"\n    ...\n\n\n# // Copy all tokens that belong to the specified sequence to another sequence\n# // Note that this does not allocate extra KV cache memory - it simply assigns the tokens to the new sequence\n# // p0 < 0 : [0,  p1]\n# // p1 < 0 : [p0, inf)\n# DEPRECATED(LLAMA_API void llama_kv_self_seq_cp(\n#         struct llama_context * ctx,\n#                 llama_seq_id   seq_id_src,\n#                 llama_seq_id   seq_id_dst,\n#                    llama_pos   p0,\n#                    llama_pos   p1),\n#         \"Use llama_memory_seq_cp() instead\");\n@ctypes_function(\n    \"llama_kv_self_seq_cp\",\n    [\n        llama_context_p_ctypes,\n        llama_seq_id,\n        llama_seq_id,\n        llama_pos,\n        llama_pos,\n    ],\n    None,\n)\ndef llama_kv_self_seq_cp(\n    ctx: llama_context_p,\n    seq_id_src: Union[llama_seq_id, int],\n    seq_id_dst: Union[llama_seq_id, int],\n    p0: Union[llama_pos, int],\n    p1: Union[llama_pos, int],\n    /,\n):\n    \"\"\"Copy tokens in KV cache (DEPRECATED)\"\"\"\n    ...\n\n\n# // Removes all tokens that do not belong to the specified sequence\n# DEPRECATED(LLAMA_API void llama_kv_self_seq_keep(\n#         struct llama_context * ctx,\n#                 llama_seq_id   seq_id),\n#         \"Use llama_memory_seq_keep() instead\");\n@ctypes_function(\n    \"llama_kv_self_seq_keep\", [llama_context_p_ctypes, llama_seq_id], None\n)\ndef llama_kv_self_seq_keep(ctx: llama_context_p, seq_id: Union[llama_seq_id, int], /):\n    \"\"\"Keep only specified sequence in KV cache (DEPRECATED)\"\"\"\n    ...\n\n\n# // Adds relative position \"delta\" to all tokens that belong to the specified sequence and have positions in [p0, p1)\n# // If the KV cache is RoPEd, the KV data is updated accordingly:\n# //   - lazily on next llama_decode()\n# // p0 < 0 : [0,  p1]\n# // p1 < 0 : [p0, inf)\n# DEPRECATED(LLAMA_API void llama_kv_self_seq_add(\n#         struct llama_context * ctx,\n#                 llama_seq_id   seq_id,\n#                    llama_pos   p0,\n#                    llama_pos   p1,\n#                    llama_pos   delta),\n#         \"Use llama_memory_seq_add() instead\");\n@ctypes_function(\n    \"llama_kv_self_seq_add\",\n    [\n        llama_context_p_ctypes,\n        llama_seq_id,\n        llama_pos,\n        llama_pos,\n        llama_pos,\n    ],\n    None,\n)\ndef llama_kv_self_seq_add(\n    ctx: llama_context_p,\n    seq_id: Union[llama_seq_id, int],\n    p0: Union[llama_pos, int],\n    p1: Union[llama_pos, int],\n    delta: Union[llama_pos, int],\n    /,\n):\n    \"\"\"Add delta to sequence positions in KV cache (DEPRECATED)\"\"\"\n    ...\n\n\n# // Integer division of the positions by factor of `d > 1`\n# // If the KV cache is RoPEd, the KV data is updated accordingly:\n# //   - lazily on next llama_decode()\n# // p0 < 0 : [0,  p1]\n# // p1 < 0 : [p0, inf)\n# DEPRECATED(LLAMA_API void llama_kv_self_seq_div(\n#         struct llama_context * ctx,\n#                 llama_seq_id   seq_id,\n#                    llama_pos   p0,\n#                    llama_pos   p1,\n#                          int   d),\n#         \"Use llama_memory_seq_div() instead\");\n@ctypes_function(\n    \"llama_kv_self_seq_div\",\n    [\n        llama_context_p_ctypes,\n        llama_seq_id,\n        llama_pos,\n        llama_pos,\n        ctypes.c_int,\n    ],\n    None,\n)\ndef llama_kv_self_seq_div(\n    ctx: llama_context_p,\n    seq_id: Union[llama_seq_id, int],\n    p0: Union[llama_pos, int],\n    p1: Union[llama_pos, int],\n    d: Union[ctypes.c_int, int],\n    /,\n):\n    \"\"\"Divide sequence positions in KV cache (DEPRECATED)\"\"\"\n    ...\n\n\n# // Returns the smallest position present in the KV cache for the specified sequence\n# // This is typically non-zero only for SWA caches\n# // Note that all positions in the range [pos_min, pos_max] are guaranteed to be present in the KV cache\n# // Return -1 if the sequence is empty\n# DEPRECATED(LLAMA_API llama_pos llama_kv_self_seq_pos_min(\n#         struct llama_context * ctx,\n#                 llama_seq_id   seq_id),\n#         \"Use llama_memory_seq_pos_min() instead\");\n@ctypes_function(\n    \"llama_kv_self_seq_pos_min\", [llama_context_p_ctypes, llama_seq_id], llama_pos\n)\ndef llama_kv_self_seq_pos_min(\n    ctx: llama_context_p, seq_id: Union[llama_seq_id, int], /\n) -> int:\n    \"\"\"Returns the smallest position in KV cache for sequence (DEPRECATED)\"\"\"\n    ...\n\n\n# // Returns the largest position present in the KV cache for the specified sequence\n# // Note that all positions in the range [pos_min, pos_max] are guaranteed to be present in the KV cache\n# // Return -1 if the sequence is empty\n# DEPRECATED(LLAMA_API llama_pos llama_kv_self_seq_pos_max(\n#         struct llama_context * ctx,\n#                 llama_seq_id   seq_id),\n#         \"Use llama_memory_seq_pos_max() instead\");\n@ctypes_function(\n    \"llama_kv_self_seq_pos_max\", [llama_context_p_ctypes, llama_seq_id], llama_pos\n)\ndef llama_kv_self_seq_pos_max(\n    ctx: llama_context_p, seq_id: Union[llama_seq_id, int], /\n) -> int:\n    \"\"\"Returns the largest position in KV cache for sequence (DEPRECATED)\"\"\"\n    ...\n\n\n# // Defragment the KV cache\n# // This will be applied:\n# //   - lazily on next llama_decode()\n# DEPRECATED(LLAMA_API void llama_kv_self_defrag(struct llama_context * ctx),\n#         \"simply remove this call, the context will automatically decide when to do a defragmentation based on 'defrag_thold'\");\n@ctypes_function(\"llama_kv_self_defrag\", [llama_context_p_ctypes], None)\ndef llama_kv_self_defrag(ctx: llama_context_p, /):\n    \"\"\"Defragment the KV cache (DEPRECATED)\"\"\"\n    ...\n\n\n# // Check if the context supports KV cache shifting\n# DEPRECATED(LLAMA_API bool llama_kv_self_can_shift(const struct llama_context * ctx),\n#         \"use llama_memory_can_shift() instead\");\n@ctypes_function(\"llama_kv_self_can_shift\", [llama_context_p_ctypes], ctypes.c_bool)\ndef llama_kv_self_can_shift(ctx: llama_context_p, /) -> bool:\n    \"\"\"Check if the context supports KV cache shifting (DEPRECATED)\"\"\"\n    ...\n\n\n# // Apply the KV cache updates (such as K-shifts, defragmentation, etc.)\n# DEPRECATED(LLAMA_API void llama_kv_self_update(struct llama_context * ctx),\n#         \"simply remove this call, updates are applied lazily on the next llama_decode()\");\n@ctypes_function(\"llama_kv_self_update\", [llama_context_p_ctypes], None)\ndef llama_kv_self_update(ctx: llama_context_p, /):\n    \"\"\"Apply the KV cache updates (DEPRECATED)\"\"\"\n    ...\n\n\n# //\n# // State / sessions\n# //\n\n# // Returns the *actual* size in bytes of the state\n# // (logits, embedding and memory)\n# // Only use when saving the state, not when restoring it, otherwise the size may be too small.\n# LLAMA_API size_t llama_state_get_size(struct llama_context * ctx);\n@ctypes_function(\"llama_state_get_size\", [llama_context_p_ctypes], ctypes.c_size_t)\ndef llama_state_get_size(ctx: llama_context_p, /) -> int:\n    \"\"\"Returns the *actual* size in bytes of the state (logits, embedding and memory)\"\"\"\n    ...\n\n\n# LLAMA_API DEPRECATED(size_t llama_get_state_size(struct llama_context * ctx),\n#     \"use llama_state_get_size instead\");\n@ctypes_function(\"llama_get_state_size\", [llama_context_p_ctypes], ctypes.c_size_t)\ndef llama_get_state_size(ctx: llama_context_p, /) -> int:\n    \"\"\"Returns the size in bytes of the state (DEPRECATED)\"\"\"\n    ...\n\n\n# // Copies the state to the specified destination address.\n# // Destination needs to have allocated enough memory.\n# // Returns the number of bytes copied\n# LLAMA_API size_t llama_state_get_data(\n#         struct llama_context * ctx,\n#                      uint8_t * dst,\n#                       size_t   size);\n@ctypes_function(\n    \"llama_state_get_data\",\n    [\n        llama_context_p_ctypes,\n        ctypes.POINTER(ctypes.c_uint8),\n        ctypes.c_size_t,\n    ],\n    ctypes.c_size_t,\n)\ndef llama_state_get_data(\n    ctx: llama_context_p,\n    dst: CtypesArray[ctypes.c_uint8],\n    size: Union[ctypes.c_size_t, int],\n    /,\n) -> int:\n    \"\"\"Copies the state to the specified destination address.\n    Destination needs to have allocated enough memory.\n    Returns the number of bytes copied\"\"\"\n    ...\n\n\n# LLAMA_API DEPRECATED(size_t llama_copy_state_data(\n#         struct llama_context * ctx,\n#                      uint8_t * dst),\n#     \"use llama_state_get_data instead\");\n@ctypes_function(\n    \"llama_copy_state_data\",\n    [\n        llama_context_p_ctypes,\n        ctypes.POINTER(ctypes.c_uint8),\n    ],\n    ctypes.c_size_t,\n)\ndef llama_copy_state_data(\n    ctx: llama_context_p, dst: CtypesArray[ctypes.c_uint8], /\n) -> int:\n    \"\"\"Copies the state to the specified destination address (DEPRECATED)\"\"\"\n    ...\n\n\n# // Set the state reading from the specified address\n# // Returns the number of bytes read\n# LLAMA_API size_t llama_state_set_data(\n#         struct llama_context * ctx,\n#                const uint8_t * src,\n#                       size_t   size);\n@ctypes_function(\n    \"llama_state_set_data\",\n    [llama_context_p_ctypes, ctypes.POINTER(ctypes.c_uint8), ctypes.c_size_t],\n    ctypes.c_size_t,\n)\ndef llama_state_set_data(\n    ctx: llama_context_p,\n    src: CtypesArray[ctypes.c_uint8],\n    size: Union[ctypes.c_size_t, int],\n    /,\n) -> int:\n    \"\"\"Set the state reading from the specified address\n    Returns the number of bytes read\"\"\"\n    ...\n\n\n# LLAMA_API DEPRECATED(size_t llama_set_state_data(\n#         struct llama_context * ctx,\n#                const uint8_t * src),\n#     \"use llama_state_set_data instead\");\n@ctypes_function(\n    \"llama_set_state_data\",\n    [llama_context_p_ctypes, ctypes.POINTER(ctypes.c_uint8)],\n    ctypes.c_size_t,\n)\ndef llama_set_state_data(\n    ctx: llama_context_p, src: CtypesArray[ctypes.c_uint8], /\n) -> int:\n    \"\"\"Set the state reading from the specified address (DEPRECATED)\"\"\"\n    ...\n\n\n# Save/load session file\n# LLAMA_API bool llama_state_load_file(\n#         struct llama_context * ctx,\n#                   const char * path_session,\n#                  llama_token * tokens_out,\n#                       size_t   n_token_capacity,\n#                       size_t * n_token_count_out);\n@ctypes_function(\n    \"llama_state_load_file\",\n    [\n        llama_context_p_ctypes,\n        ctypes.c_char_p,\n        llama_token_p,\n        ctypes.c_size_t,\n        ctypes.POINTER(ctypes.c_size_t),\n    ],\n    ctypes.c_bool,\n)\ndef llama_state_load_file(\n    ctx: llama_context_p,\n    path_session: bytes,\n    tokens_out: CtypesArray[llama_token],\n    n_token_capacity: Union[ctypes.c_size_t, int],\n    n_token_count_out: CtypesPointerOrRef[ctypes.c_size_t],\n    /,\n) -> bool:\n    ...\n\n\n# LLAMA_API DEPRECATED(bool llama_load_session_file(\n#         struct llama_context * ctx,\n#                   const char * path_session,\n#                  llama_token * tokens_out,\n#                       size_t   n_token_capacity,\n#                       size_t * n_token_count_out),\n#     \"use llama_state_load_file instead\");\n@ctypes_function(\n    \"llama_load_session_file\",\n    [\n        llama_context_p_ctypes,\n        ctypes.c_char_p,\n        llama_token_p,\n        ctypes.c_size_t,\n        ctypes.POINTER(ctypes.c_size_t),\n    ],\n    ctypes.c_bool,\n)\ndef llama_load_session_file(\n    ctx: llama_context_p,\n    path_session: bytes,\n    tokens_out: CtypesArray[llama_token],\n    n_token_capacity: Union[ctypes.c_size_t, int],\n    n_token_count_out: CtypesPointerOrRef[ctypes.c_size_t],\n    /,\n) -> bool:\n    ...\n\n\n# LLAMA_API bool llama_state_save_file(\n#         struct llama_context * ctx,\n#                   const char * path_session,\n#            const llama_token * tokens,\n#                       size_t   n_token_count);\n@ctypes_function(\n    \"llama_state_save_file\",\n    [\n        llama_context_p_ctypes,\n        ctypes.c_char_p,\n        llama_token_p,\n        ctypes.c_size_t,\n    ],\n    ctypes.c_bool,\n)\ndef llama_state_save_file(\n    ctx: llama_context_p,\n    path_session: bytes,\n    tokens: CtypesArray[llama_token],\n    n_token_count: Union[ctypes.c_size_t, int],\n    /,\n) -> bool:\n    ...\n\n\n# LLAMA_API DEPRECATED(bool llama_save_session_file(\n#         struct llama_context * ctx,\n#                   const char * path_session,\n#            const llama_token * tokens,\n#                       size_t   n_token_count),\n#     \"use llama_state_save_file instead\");\n@ctypes_function(\n    \"llama_save_session_file\",\n    [\n        llama_context_p_ctypes,\n        ctypes.c_char_p,\n        llama_token_p,\n        ctypes.c_size_t,\n    ],\n    ctypes.c_bool,\n)\ndef llama_save_session_file(\n    ctx: llama_context_p,\n    path_session: bytes,\n    tokens: CtypesArray[llama_token],\n    n_token_count: Union[ctypes.c_size_t, int],\n    /,\n) -> bool:\n    ...\n\n\n# // Get the exact size needed to copy the state of a single sequence\n# LLAMA_API size_t llama_state_seq_get_size(\n#         struct llama_context * ctx,\n#                 llama_seq_id   seq_id);\n@ctypes_function(\n    \"llama_state_seq_get_size\",\n    [llama_context_p_ctypes, llama_seq_id],\n    ctypes.c_size_t,\n)\ndef llama_state_seq_get_size(ctx: llama_context_p, seq_id: llama_seq_id, /) -> int:\n    \"\"\"Get the exact size needed to copy the state of a single sequence\"\"\"\n    ...\n\n\n# // Copy the state of a single sequence into the specified buffer\n# LLAMA_API size_t llama_state_seq_get_data(\n#         struct llama_context * ctx,\n#                      uint8_t * dst,\n#                       size_t   size,\n#                 llama_seq_id   seq_id);\n@ctypes_function(\n    \"llama_state_seq_get_data\",\n    [\n        llama_context_p_ctypes,\n        ctypes.POINTER(ctypes.c_uint8),\n        ctypes.c_size_t,\n        llama_seq_id,\n    ],\n    ctypes.c_size_t,\n)\ndef llama_state_seq_get_data(\n    ctx: llama_context_p,\n    dst: CtypesArray[ctypes.c_uint8],\n    size: Union[ctypes.c_size_t, int],\n    seq_id: llama_seq_id,\n    /,\n) -> int:\n    \"\"\"Copy the state of a single sequence into the specified buffer\"\"\"\n    ...\n\n\n# // Copy the sequence data (originally copied with `llama_state_seq_get_data`) into the specified sequence\n# // Returns:\n# //  - Positive: Ok\n# //  - Zero: Failed to load\n# LLAMA_API size_t llama_state_seq_set_data(\n#         struct llama_context * ctx,\n#                const uint8_t * src,\n#                       size_t   size,\n#                 llama_seq_id   dest_seq_id);\n@ctypes_function(\n    \"llama_state_seq_set_data\",\n    [\n        llama_context_p_ctypes,\n        ctypes.POINTER(ctypes.c_uint8),\n        ctypes.c_size_t,\n        llama_seq_id,\n    ],\n    ctypes.c_size_t,\n)\ndef llama_state_seq_set_data(\n    ctx: llama_context_p,\n    src: CtypesArray[ctypes.c_uint8],\n    size: Union[ctypes.c_size_t, int],\n    dest_seq_id: llama_seq_id,\n    /,\n) -> int:\n    \"\"\"Copy the sequence data into the specified sequence\"\"\"\n    ...\n\n\n# LLAMA_API size_t llama_state_seq_save_file(\n#         struct llama_context * ctx,\n#                   const char * filepath,\n#                 llama_seq_id   seq_id,\n#            const llama_token * tokens,\n#                       size_t   n_token_count);\n@ctypes_function(\n    \"llama_state_seq_save_file\",\n    [\n        llama_context_p_ctypes,\n        ctypes.c_char_p,\n        llama_seq_id,\n        llama_token_p,\n        ctypes.c_size_t,\n    ],\n    ctypes.c_size_t,\n)\ndef llama_state_seq_save_file(\n    ctx: llama_context_p,\n    filepath: bytes,\n    seq_id: llama_seq_id,\n    tokens: CtypesArray[llama_token],\n    n_token_count: Union[ctypes.c_size_t, int],\n    /,\n) -> int:\n    ...\n\n\n# LLAMA_API size_t llama_state_seq_load_file(\n#         struct llama_context * ctx,\n#                   const char * filepath,\n#                 llama_seq_id   dest_seq_id,\n#                  llama_token * tokens_out,\n#                       size_t   n_token_capacity,\n#                       size_t * n_token_count_out);\n@ctypes_function(\n    \"llama_state_seq_load_file\",\n    [\n        llama_context_p_ctypes,\n        ctypes.c_char_p,\n        llama_seq_id,\n        llama_token_p,\n        ctypes.c_size_t,\n        ctypes.POINTER(ctypes.c_size_t),\n    ],\n    ctypes.c_size_t,\n)\ndef llama_state_seq_load_file(\n    ctx: llama_context_p,\n    filepath: bytes,\n    dest_seq_id: llama_seq_id,\n    tokens_out: CtypesArray[llama_token],\n    n_token_capacity: Union[ctypes.c_size_t, int],\n    n_token_count_out: CtypesPointerOrRef[ctypes.c_size_t],\n    /,\n) -> int:\n    ...\n\n\n# //\n# // Decoding\n# //\n\n# // Return batch for single sequence of tokens\n# // The sequence ID will be fixed to 0\n# // The position of the tokens will be tracked automatically by llama_decode\n# //\n# // NOTE: this is a helper function to facilitate transition to the new batch API - avoid using it\n# //\n# LLAMA_API struct llama_batch llama_batch_get_one(\n#               llama_token * tokens,\n#                   int32_t   n_tokens);\n@ctypes_function(\n    \"llama_batch_get_one\",\n    [\n        llama_token_p,\n        ctypes.c_int32,\n    ],\n    llama_batch,\n)\ndef llama_batch_get_one(\n    tokens: CtypesArray[llama_token],\n    n_tokens: Union[ctypes.c_int, int],\n    /,\n) -> llama_batch:\n    \"\"\"Return batch for single sequence of tokens\n\n    NOTE: this is a helper function to facilitate transition to the new batch API - avoid using it\n    \"\"\"\n    ...\n\n\n# // Allocates a batch of tokens on the heap that can hold a maximum of n_tokens\n# // Each token can be assigned up to n_seq_max sequence ids\n# // The batch has to be freed with llama_batch_free()\n# // If embd != 0, llama_batch.embd will be allocated with size of n_tokens * embd * sizeof(float)\n# // Otherwise, llama_batch.token will be allocated to store n_tokens llama_token\n# // The rest of the llama_batch members are allocated with size n_tokens\n# // All members are left uninitialized\n# LLAMA_API struct llama_batch llama_batch_init(\n#         int32_t n_tokens,\n#         int32_t embd,\n#         int32_t n_seq_max);\n@ctypes_function(\n    \"llama_batch_init\", [ctypes.c_int32, ctypes.c_int32, ctypes.c_int32], llama_batch\n)\ndef llama_batch_init(\n    n_tokens: Union[ctypes.c_int32, int],\n    embd: Union[ctypes.c_int32, int],\n    n_seq_max: Union[ctypes.c_int32, int],\n    /,\n) -> llama_batch:\n    \"\"\"Allocates a batch of tokens on the heap that can hold a maximum of n_tokens\n    Each token can be assigned up to n_seq_max sequence ids\n    The batch has to be freed with llama_batch_free()\n    If embd != 0, llama_batch.embd will be allocated with size of n_tokens * embd * sizeof(float)\n    Otherwise, llama_batch.token will be allocated to store n_tokens llama_token\n    The rest of the llama_batch members are allocated with size n_tokens\n    All members are left uninitialized\"\"\"\n    ...\n\n\n# // Frees a batch of tokens allocated with llama_batch_init()\n# LLAMA_API void llama_batch_free(struct llama_batch batch);\n@ctypes_function(\"llama_batch_free\", [llama_batch], None)\ndef llama_batch_free(batch: llama_batch, /):\n    \"\"\"Frees a batch of tokens allocated with llama_batch_init()\"\"\"\n    ...\n\n\n# // Process a batch of tokens.\n# // In contrast to llama_decode() - this call does not use KV cache.\n# // For encode-decoder contexts, processes the batch using the encoder.\n# // Can store the encoder output internally for later use by the decoder's cross-attention layers.\n# //   0 - success\n# // < 0 - error. the memory state is restored to the state before this call\n# LLAMA_API int32_t llama_encode(\n#         struct llama_context * ctx,\n#           struct llama_batch   batch);\n@ctypes_function(\"llama_encode\", [llama_context_p_ctypes, llama_batch], ctypes.c_int32)\ndef llama_encode(ctx: llama_context_p, batch: llama_batch, /) -> int:\n    \"\"\"Process a batch of tokens using the encoder.\n    0 - success\n    < 0 - error\"\"\"\n    ...\n\n\n# // Process a batch of tokens.\n# // Requires the context to have a memory.\n# // For encode-decoder contexts, processes the batch using the decoder.\n# // Positive return values does not mean a fatal error, but rather a warning.\n# // Upon fatal-error or abort, the ubatches that managed to be been processed will remain in the memory state of the context\n# //   To handle this correctly, query the memory state using llama_memory_seq_pos_min() and llama_memory_seq_pos_max()\n# // Upon other return values, the memory state is restored to the state before this call\n# //    0 - success\n# //    1 - could not find a KV slot for the batch (try reducing the size of the batch or increase the context)\n# //    2 - aborted     (processed ubatches will remain in the context's memory)\n# //   -1 - invalid input batch\n# // < -1 - fatal error (processed ubatches will remain in the context's memory)\n# LLAMA_API int32_t llama_decode(\n#         struct llama_context * ctx,\n#           struct llama_batch   batch);\n@ctypes_function(\"llama_decode\", [llama_context_p_ctypes, llama_batch], ctypes.c_int32)\ndef llama_decode(ctx: llama_context_p, batch: llama_batch, /) -> int:\n    \"\"\"Process a batch of tokens.\n    0 - success\n    1 - could not find a KV slot for the batch (try reducing the size of the batch or increase the context)\n    2 - aborted (processed ubatches will remain in the context's memory)\n    -1 - invalid input batch\n    < -1 - fatal error (processed ubatches will remain in the context's memory)\"\"\"\n    ...\n\n\n# // Set the number of threads used for decoding\n# // n_threads is the number of threads used for generation (single token)\n# // n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens)\n# LLAMA_API void llama_set_n_threads(struct llama_context * ctx, int32_t n_threads, int32_t n_threads_batch);\n@ctypes_function(\n    \"llama_set_n_threads\",\n    [\n        llama_context_p_ctypes,\n        ctypes.c_int32,\n        ctypes.c_int32,\n    ],\n    None,\n)\ndef llama_set_n_threads(\n    ctx: llama_context_p,\n    n_threads: Union[ctypes.c_int32, int],\n    n_threads_batch: Union[ctypes.c_int32, int],\n    /,\n):\n    \"\"\"Set the number of threads used for decoding\n    n_threads is the number of threads used for generation (single token)\n    n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens)\n    \"\"\"\n    ...\n\n\n# // Get the number of threads used for generation of a single token.\n# LLAMA_API int32_t llama_n_threads(struct llama_context * ctx);\n@ctypes_function(\"llama_n_threads\", [llama_context_p_ctypes], ctypes.c_int32)\ndef llama_n_threads(ctx: llama_context_p, /) -> int:\n    \"\"\"Get the number of threads used for generation of a single token\"\"\"\n    ...\n\n\n# // Get the number of threads used for prompt and batch processing (multiple token).\n# LLAMA_API int32_t llama_n_threads_batch(struct llama_context * ctx);\n@ctypes_function(\"llama_n_threads_batch\", [llama_context_p_ctypes], ctypes.c_int32)\ndef llama_n_threads_batch(ctx: llama_context_p, /) -> int:\n    \"\"\"Get the number of threads used for prompt and batch processing (multiple token)\"\"\"\n    ...\n\n\n# // Set whether the context outputs embeddings or not\n# // TODO: rename to avoid confusion with llama_get_embeddings()\n# LLAMA_API void llama_set_embeddings(struct llama_context * ctx, bool embeddings);\n@ctypes_function(\"llama_set_embeddings\", [llama_context_p_ctypes, ctypes.c_bool], None)\ndef llama_set_embeddings(ctx: llama_context_p, embeddings: bool, /):\n    \"\"\"Set whether the context outputs embeddings or not\"\"\"\n    ...\n\n\n# // Set whether to use causal attention or not\n# // If set to true, the model will only attend to the past tokens\n# LLAMA_API void llama_set_causal_attn(struct llama_context * ctx, bool causal_attn);\n@ctypes_function(\"llama_set_causal_attn\", [llama_context_p_ctypes, ctypes.c_bool], None)\ndef llama_set_causal_attn(ctx: llama_context_p, causal_attn: bool, /):\n    \"\"\"Set whether to use causal attention or not\n    If set to true, the model will only attend to the past tokens\"\"\"\n    ...\n\n\n# // Set whether the model is in warmup mode or not\n# // If true, all model tensors are activated during llama_decode() to load and cache their weights.\n# LLAMA_API void llama_set_warmup(struct llama_context * ctx, bool warmup);\n@ctypes_function(\"llama_set_warmup\", [llama_context_p_ctypes, ctypes.c_bool], None)\ndef llama_set_warmup(ctx: llama_context_p, warmup: bool, /):\n    \"\"\"Set whether the model is in warmup mode or not\n    If true, all model tensors are activated during llama_decode() to load and cache their weights.\"\"\"\n    ...\n\n\n# // Set abort callback\n# LLAMA_API void llama_set_abort_callback(struct llama_context * ctx, ggml_abort_callback abort_callback, void * abort_callback_data);\n@ctypes_function(\n    \"llama_set_abort_callback\",\n    [llama_context_p_ctypes, ggml_abort_callback, ctypes.c_void_p],\n    None,\n)\ndef llama_set_abort_callback(\n    ctx: llama_context_p,\n    abort_callback: Callable[[ctypes.c_void_p], None],\n    abort_callback_data: ctypes.c_void_p,\n    /,\n):\n    \"\"\"Set abort callback\"\"\"\n    ...\n\n\n# // Wait until all computations are finished\n# // This is automatically done when using one of the functions below to obtain the computation results\n# // and is not necessary to call it explicitly in most cases\n# LLAMA_API void llama_synchronize(struct llama_context * ctx);\n@ctypes_function(\"llama_synchronize\", [llama_context_p_ctypes], None)\ndef llama_synchronize(ctx: llama_context_p, /):\n    \"\"\"Wait until all computations are finished\n    This is automatically done when using one of the functions below to obtain the computation results\n    and is not necessary to call it explicitly in most cases\"\"\"\n    ...\n\n\n# // Token logits obtained from the last call to llama_decode()\n# // The logits for which llama_batch.logits[i] != 0 are stored contiguously\n# // in the order they have appeared in the batch.\n# // Rows: number of tokens for which llama_batch.logits[i] != 0\n# // Cols: n_vocab\n# // TODO: deprecate in favor of llama_get_logits_ith() (ref: https://github.com/ggml-org/llama.cpp/pull/14853#issuecomment-3113143522)\n# LLAMA_API float * llama_get_logits(struct llama_context * ctx);\n@ctypes_function(\n    \"llama_get_logits\", [llama_context_p_ctypes], ctypes.POINTER(ctypes.c_float)\n)\ndef llama_get_logits(ctx: llama_context_p, /) -> CtypesArray[ctypes.c_float]:\n    \"\"\"Token logits obtained from the last call to llama_decode()\n    The logits for which llama_batch.logits[i] != 0 are stored contiguously\n    in the order they have appeared in the batch.\n    Rows: number of tokens for which llama_batch.logits[i] != 0\n    Cols: n_vocab\n\n    Returns:\n        Pointer to the logits buffer of shape (n_tokens, n_vocab)\"\"\"\n    ...\n\n\n# // Logits for the ith token. For positive indices, Equivalent to:\n# // llama_get_logits(ctx) + ctx->output_ids[i]*n_vocab\n# // Negative indicies can be used to access logits in reverse order, -1 is the last logit.\n# // returns NULL for invalid ids.\n# LLAMA_API float * llama_get_logits_ith(struct llama_context * ctx, int32_t i);\n@ctypes_function(\n    \"llama_get_logits_ith\",\n    [llama_context_p_ctypes, ctypes.c_int32],\n    ctypes.POINTER(ctypes.c_float),\n)\ndef llama_get_logits_ith(\n    ctx: llama_context_p, i: Union[ctypes.c_int32, int], /\n) -> CtypesArray[ctypes.c_float]:\n    \"\"\"Logits for the ith token. Equivalent to:\n    llama_get_logits(ctx) + i*n_vocab\"\"\"\n    ...\n\n\n# // Get all output token embeddings.\n# // when pooling_type == LLAMA_POOLING_TYPE_NONE or when using a generative model,\n# // the embeddings for which llama_batch.logits[i] != 0 are stored contiguously\n# // in the order they have appeared in the batch.\n# // shape: [n_outputs*n_embd]\n# // Otherwise, returns NULL.\n# // TODO: deprecate in favor of llama_get_embeddings_ith() (ref: https://github.com/ggml-org/llama.cpp/pull/14853#issuecomment-3113143522)\n# LLAMA_API float * llama_get_embeddings(struct llama_context * ctx);\n@ctypes_function(\n    \"llama_get_embeddings\", [llama_context_p_ctypes], ctypes.POINTER(ctypes.c_float)\n)\ndef llama_get_embeddings(ctx: llama_context_p, /) -> CtypesArray[ctypes.c_float]:\n    \"\"\"Get the embeddings for the input\n    shape: [n_embd] (1-dimensional)\"\"\"\n    ...\n\n\n# // Get the embeddings for the ith token. For positive indices, Equivalent to:\n# // llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd\n# // Negative indicies can be used to access embeddings in reverse order, -1 is the last embedding.\n# // shape: [n_embd] (1-dimensional)\n# // returns NULL for invalid ids.\n# LLAMA_API float * llama_get_embeddings_ith(struct llama_context * ctx, int32_t i);\n@ctypes_function(\n    \"llama_get_embeddings_ith\",\n    [llama_context_p_ctypes, ctypes.c_int32],\n    ctypes.POINTER(ctypes.c_float),\n)\ndef llama_get_embeddings_ith(\n    ctx: llama_context_p, i: Union[ctypes.c_int32, int], /\n) -> CtypesArray[ctypes.c_float]:\n    \"\"\"Get the embeddings for the ith sequence\n    llama_get_embeddings(ctx) + i*n_embd\"\"\"\n    ...\n\n\n# // Get the embeddings for a sequence id\n# // Returns NULL if pooling_type is LLAMA_POOLING_TYPE_NONE\n# // when pooling_type == LLAMA_POOLING_TYPE_RANK, returns float[n_cls_out] with the rank(s) of the sequence\n# // otherwise: float[n_embd] (1-dimensional)\n# LLAMA_API float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id);\n@ctypes_function(\n    \"llama_get_embeddings_seq\",\n    [llama_context_p_ctypes, llama_seq_id],\n    ctypes.POINTER(ctypes.c_float),\n)\ndef llama_get_embeddings_seq(\n    ctx: llama_context_p, seq_id: Union[llama_seq_id, int], /\n) -> CtypesArray[ctypes.c_float]:\n    \"\"\"Get the embeddings for a sequence id\n    Returns NULL if pooling_type is LLAMA_POOLING_TYPE_NONE\n    shape: [n_embd] (1-dimensional)\"\"\"\n    ...\n\n\n# //\n# // Vocab\n# //\n\n# LLAMA_API const char * llama_vocab_get_text(const struct llama_vocab * vocab, llama_token token);\n@ctypes_function(\n    \"llama_vocab_get_text\", [llama_vocab_p_ctypes, llama_token], ctypes.c_char_p\n)\ndef llama_vocab_get_text(\n    vocab: llama_vocab_p, token: Union[llama_token, int], /\n) -> bytes:\n    ...\n\n\n# LLAMA_API float llama_vocab_get_score(const struct llama_vocab * vocab, llama_token token);\n@ctypes_function(\n    \"llama_vocab_get_score\", [llama_vocab_p_ctypes, llama_token], ctypes.c_float\n)\ndef llama_vocab_get_score(\n    vocab: llama_vocab_p, token: Union[llama_token, int], /\n) -> float:\n    ...\n\n\n# LLAMA_API enum llama_token_attr llama_vocab_get_attr(const struct llama_vocab * vocab, llama_token token);\n@ctypes_function(\n    \"llama_vocab_get_attr\", [llama_vocab_p_ctypes, llama_token], ctypes.c_int\n)\ndef llama_vocab_get_attr(\n    vocab: llama_vocab_p, token: Union[llama_token, int], /\n) -> int:\n    ...\n\n\n# // Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)\n# LLAMA_API bool llama_vocab_is_eog(const struct llama_vocab * vocab, llama_token token);\n@ctypes_function(\n    \"llama_vocab_is_eog\", [llama_vocab_p_ctypes, llama_token], ctypes.c_bool\n)\ndef llama_vocab_is_eog(vocab: llama_vocab_p, token: Union[llama_token, int], /) -> bool:\n    \"\"\"Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)\"\"\"\n    ...\n\n\n# // Identify if Token Id is a control token or a render-able token\n# LLAMA_API bool llama_vocab_is_control(const struct llama_vocab * vocab, llama_token token);\n@ctypes_function(\n    \"llama_vocab_is_control\", [llama_vocab_p_ctypes, llama_token], ctypes.c_bool\n)\ndef llama_vocab_is_control(\n    vocab: llama_vocab_p, token: Union[llama_token, int], /\n) -> bool:\n    \"\"\"Identify if Token Id is a control token or a render-able token\"\"\"\n    ...\n\n\n# // Special tokens\n# LLAMA_API llama_token llama_vocab_bos(const struct llama_vocab * vocab); // beginning-of-sentence\n@ctypes_function(\"llama_vocab_bos\", [llama_vocab_p_ctypes], llama_token)\ndef llama_vocab_bos(vocab: llama_vocab_p, /) -> llama_token:\n    \"\"\"beginning-of-sentence\"\"\"\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_eos(const struct llama_vocab * vocab); // end-of-sentence\n@ctypes_function(\"llama_vocab_eos\", [llama_vocab_p_ctypes], llama_token)\ndef llama_vocab_eos(vocab: llama_vocab_p, /) -> llama_token:\n    \"\"\"end-of-sentence\"\"\"\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_eot(const struct llama_vocab * vocab); // end-of-turn\n@ctypes_function(\"llama_vocab_eot\", [llama_vocab_p_ctypes], llama_token)\ndef llama_vocab_eot(vocab: llama_vocab_p, /) -> llama_token:\n    \"\"\"end-of-turn\"\"\"\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_sep(const struct llama_vocab * vocab); // sentence separator\n@ctypes_function(\"llama_vocab_sep\", [llama_vocab_p_ctypes], llama_token)\ndef llama_vocab_sep(vocab: llama_vocab_p, /) -> llama_token:\n    \"\"\"sentence separator\"\"\"\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_nl (const struct llama_vocab * vocab); // next-line\n@ctypes_function(\"llama_vocab_nl\", [llama_vocab_p_ctypes], llama_token)\ndef llama_vocab_nl(vocab: llama_vocab_p, /) -> llama_token:\n    \"\"\"next-line\"\"\"\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_pad(const struct llama_vocab * vocab); // padding\n@ctypes_function(\"llama_vocab_pad\", [llama_vocab_p_ctypes], llama_token)\ndef llama_vocab_pad(vocab: llama_vocab_p, /) -> llama_token:\n    \"\"\"padding\"\"\"\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_mask(const struct llama_vocab * vocab); // mask\n@ctypes_function(\"llama_vocab_mask\", [llama_vocab_p_ctypes], llama_token)\ndef llama_vocab_mask(vocab: llama_vocab_p, /) -> llama_token:\n    \"\"\"mask\"\"\"\n    ...\n\n\n# LLAMA_API bool llama_vocab_get_add_bos(const struct llama_vocab * vocab);\n@ctypes_function(\n    \"llama_vocab_get_add_bos\",\n    [llama_vocab_p_ctypes],\n    ctypes.c_bool,\n)\ndef llama_vocab_get_add_bos(vocab: llama_vocab_p, /) -> bool:\n    ...\n\n\n# LLAMA_API bool llama_vocab_get_add_eos(const struct llama_vocab * vocab);\n@ctypes_function(\n    \"llama_vocab_get_add_eos\",\n    [llama_vocab_p_ctypes],\n    ctypes.c_bool,\n)\ndef llama_vocab_get_add_eos(vocab: llama_vocab_p, /) -> bool:\n    ...\n\n\n# LLAMA_API bool llama_vocab_get_add_sep(const struct llama_vocab * vocab);\n@ctypes_function(\n    \"llama_vocab_get_add_sep\",\n    [llama_vocab_p_ctypes],\n    ctypes.c_bool,\n)\ndef llama_vocab_get_add_sep(vocab: llama_vocab_p, /) -> bool:\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_fim_pre(const struct llama_vocab * vocab);\n@ctypes_function(\n    \"llama_vocab_fim_pre\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_vocab_fim_pre(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_fim_suf(const struct llama_vocab * vocab);\n@ctypes_function(\n    \"llama_vocab_fim_suf\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_vocab_fim_suf(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_fim_mid(const struct llama_vocab * vocab);\n@ctypes_function(\n    \"llama_vocab_fim_mid\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_vocab_fim_mid(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_fim_pad(const struct llama_vocab * vocab);\n@ctypes_function(\n    \"llama_vocab_fim_pad\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_vocab_fim_pad(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_fim_rep(const struct llama_vocab * vocab);\n@ctypes_function(\n    \"llama_vocab_fim_rep\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_vocab_fim_rep(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n\n# LLAMA_API llama_token llama_vocab_fim_sep(const struct llama_vocab * vocab);\n@ctypes_function(\n    \"llama_vocab_fim_sep\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_vocab_fim_sep(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n\n# DEPRECATED functions\n# DEPRECATED(LLAMA_API const char * llama_token_get_text(const struct llama_vocab * vocab, llama_token token), \"use llama_vocab_get_text instead\");\n@ctypes_function(\n    \"llama_token_get_text\",\n    [llama_vocab_p_ctypes, llama_token],\n    ctypes.c_char_p,\n)\ndef llama_token_get_text(\n    vocab: llama_vocab_p, token: Union[llama_token, int], /\n) -> bytes:\n    ...\n\n\n# DEPRECATED(LLAMA_API float llama_token_get_score(const struct llama_vocab * vocab, llama_token token), \"use llama_vocab_get_score instead\");\n@ctypes_function(\n    \"llama_token_get_score\",\n    [llama_vocab_p_ctypes, llama_token],\n    ctypes.c_float,\n)\ndef llama_token_get_score(\n    vocab: llama_vocab_p, token: Union[llama_token, int], /\n) -> float:\n    ...\n\n# DEPRECATED(LLAMA_API enum llama_token_attr llama_token_get_attr(const struct llama_vocab * vocab, llama_token token), \"use llama_vocab_get_attr instead\");\n@ctypes_function(\n    \"llama_token_get_attr\",\n    [llama_vocab_p_ctypes, llama_token],\n    ctypes.c_int,\n)\ndef llama_token_get_attr(\n    vocab: llama_vocab_p, token: Union[llama_token, int], /\n) -> int:\n    ...\n\n# DEPRECATED(LLAMA_API bool llama_token_is_eog(const struct llama_vocab * vocab, llama_token token), \"use llama_vocab_is_eog instead\");\n@ctypes_function(\n    \"llama_token_is_eog\",\n    [llama_vocab_p_ctypes, llama_token],\n    ctypes.c_bool,\n)\ndef llama_token_is_eog(\n    vocab: llama_vocab_p, token: Union[llama_token, int], /\n) -> bool:\n    ...\n\n# DEPRECATED(LLAMA_API bool llama_token_is_control(const struct llama_vocab * vocab, llama_token token), \"use llama_vocab_is_control instead\");\n@ctypes_function(\n    \"llama_token_is_control\",\n    [llama_vocab_p_ctypes, llama_token],\n    ctypes.c_bool,\n)\ndef llama_token_is_control(\n    vocab: llama_vocab_p, token: Union[llama_token, int], /\n) -> bool:\n    ...\n\n# DEPRECATED(LLAMA_API llama_token llama_token_bos(const struct llama_vocab * vocab), \"use llama_vocab_bos instead\");\n@ctypes_function(\n    \"llama_token_bos\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_bos(vocab: llama_vocab_p, /) -> int:\n    ...\n\n# DEPRECATED(LLAMA_API llama_token llama_token_eos(const struct llama_vocab * vocab), \"use llama_vocab_eos instead\");\n@ctypes_function(\n    \"llama_token_eos\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_eos(vocab: llama_vocab_p, /) -> int:\n    ...\n\n# DEPRECATED(LLAMA_API llama_token llama_token_eot(const struct llama_vocab * vocab), \"use llama_vocab_eot instead\");\n@ctypes_function(\n    \"llama_token_eot\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_eot(vocab: llama_vocab_p, /) -> int:\n    ...\n\n# DEPRECATED(LLAMA_API llama_token llama_token_cls(const struct llama_vocab * vocab), \"use llama_vocab_cls instead\");\n@ctypes_function(\n    \"llama_token_cls\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_cls(vocab: llama_vocab_p, /) -> int:\n    ...\n\n# DEPRECATED(LLAMA_API llama_token llama_token_sep(const struct llama_vocab * vocab), \"use llama_vocab_sep instead\");\n@ctypes_function(\n    \"llama_token_sep\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_sep(vocab: llama_vocab_p, /) -> int:\n    ...\n\n\n# DEPRECATED(LLAMA_API llama_token llama_token_nl (const struct llama_vocab * vocab), \"use llama_vocab_nl instead\");\n@ctypes_function(\n    \"llama_token_nl\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_nl(vocab: llama_vocab_p, /) -> int:\n    ...\n\n\n# DEPRECATED(LLAMA_API llama_token llama_token_pad(const struct llama_vocab * vocab), \"use llama_vocab_pad instead\");\n@ctypes_function(\n    \"llama_token_pad\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_pad(vocab: llama_vocab_p, /) -> int:\n    ...\n\n\n# DEPRECATED(LLAMA_API bool llama_add_bos_token(const struct llama_vocab * vocab), \"use llama_vocab_get_add_bos instead\");\n@ctypes_function(\n    \"llama_add_bos_token\",\n    [llama_vocab_p_ctypes],\n    ctypes.c_bool,\n)\ndef llama_add_bos_token(vocab: llama_vocab_p, /) -> bool:\n    ...\n\n# DEPRECATED(LLAMA_API bool llama_add_eos_token(const struct llama_vocab * vocab), \"use llama_vocab_get_add_eos instead\");\n@ctypes_function(\n    \"llama_add_eos_token\",\n    [llama_vocab_p_ctypes],\n    ctypes.c_bool,\n)\ndef llama_add_eos_token(vocab: llama_vocab_p, /) -> bool:\n    ...\n\n\n# DEPRECATED(LLAMA_API llama_token llama_token_fim_pre(const struct llama_vocab * vocab), \"use llama_vocab_fim_pre instead\");\n@ctypes_function(\n    \"llama_token_fim_pre\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_fim_pre(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n# DEPRECATED(LLAMA_API llama_token llama_token_fim_suf(const struct llama_vocab * vocab), \"use llama_vocab_fim_suf instead\");\n@ctypes_function(\n    \"llama_token_fim_suf\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_fim_suf(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n# DEPRECATED(LLAMA_API llama_token llama_token_fim_mid(const struct llama_vocab * vocab), \"use llama_vocab_fim_mid instead\");\n@ctypes_function(\n    \"llama_token_fim_mid\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_fim_mid(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n# DEPRECATED(LLAMA_API llama_token llama_token_fim_pad(const struct llama_vocab * vocab), \"use llama_vocab_fim_pad instead\");\n@ctypes_function(\n    \"llama_token_fim_pad\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_fim_pad(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n# DEPRECATED(LLAMA_API llama_token llama_token_fim_rep(const struct llama_vocab * vocab), \"use llama_vocab_fim_rep instead\");\n@ctypes_function(\n    \"llama_token_fim_rep\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_fim_rep(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n# DEPRECATED(LLAMA_API llama_token llama_token_fim_sep(const struct llama_vocab * vocab), \"use llama_vocab_fim_sep instead\");\n@ctypes_function(\n    \"llama_token_fim_sep\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_token_fim_sep(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n# // CLS is equivalent to BOS\n# DEPRECATED(LLAMA_API llama_token llama_vocab_cls(const struct llama_vocab * vocab), // classification\n#         \"use llama_vocab_bos instead\");\n@ctypes_function(\n    \"llama_vocab_cls\",\n    [llama_vocab_p_ctypes],\n    llama_token,\n)\ndef llama_vocab_cls(vocab: llama_vocab_p, /) -> llama_token:\n    ...\n\n\n# //\n# // Tokenization\n# //\n# // The API is thread-safe.\n# //\n\n# /// @details Convert the provided text into tokens.\n# /// @param tokens The tokens pointer must be large enough to hold the resulting tokens.\n# /// @return Returns the number of tokens on success, no more than n_tokens_max\n# /// @return Returns a negative number on failure - the number of tokens that would have been returned\n# /// @return Returns INT32_MIN on overflow (e.g., tokenization result size exceeds int32_t limit)\n# /// @param add_special Allow to add BOS and EOS tokens if model is configured to do so.\n# /// @param parse_special Allow tokenizing special and/or control tokens which otherwise are not exposed and treated\n# ///                      as plaintext. Does not insert a leading space.\n# LLAMA_API int32_t llama_tokenize(\n#     const struct llama_vocab * vocab,\n#                   const char * text,\n#                      int32_t   text_len,\n#                  llama_token * tokens,\n#                      int32_t   n_tokens_max,\n#                         bool   add_special,\n#                         bool   parse_special);\n@ctypes_function(\n    \"llama_tokenize\",\n    [\n        llama_vocab_p_ctypes,\n        ctypes.c_char_p,\n        ctypes.c_int32,\n        llama_token_p,\n        ctypes.c_int32,\n        ctypes.c_bool,\n        ctypes.c_bool,\n    ],\n    ctypes.c_int32,\n)\ndef llama_tokenize(\n    vocab: llama_vocab_p,\n    text: bytes,\n    text_len: Union[ctypes.c_int, int],\n    tokens: CtypesArray[llama_token],\n    n_tokens_max: Union[ctypes.c_int, int],\n    add_special: Union[ctypes.c_bool, bool],\n    parse_special: Union[ctypes.c_bool, bool],\n    /,\n) -> int:\n    \"\"\"Convert the provided text into tokens.\n\n    Args:\n        vocab: The vocabulary to use for tokenization.\n        text: The text to tokenize.\n        text_len: The length of the text.\n        tokens: The tokens pointer must be large enough to hold the resulting tokens.\n        n_max_tokens: The maximum number of tokens to return.\n        add_special: Allow adding special tokens if the model is configured to do so.\n        parse_special: Allow parsing special tokens.\n\n    Returns:\n        Returns the number of tokens on success, no more than n_tokens_max\n        Returns a negative number on failure - the number of tokens that would have been returned\n    \"\"\"\n    ...\n\n\n# // Token Id -> Piece.\n# // Uses the vocabulary in the provided context.\n# // Does not write null terminator to the buffer.\n# // User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix')\n# // @param special If true, special tokens are rendered in the output.\n# LLAMA_API int32_t llama_token_to_piece(\n#           const struct llama_vocab * vocab,\n#                        llama_token   token,\n#                               char * buf,\n#                            int32_t   length,\n#                            int32_t   lstrip,\n#                               bool   special);\n@ctypes_function(\n    \"llama_token_to_piece\",\n    [\n        llama_vocab_p_ctypes,\n        llama_token,\n        ctypes.c_char_p,\n        ctypes.c_int32,\n        ctypes.c_int32,\n        ctypes.c_bool,\n    ],\n    ctypes.c_int32,\n)\ndef llama_token_to_piece(\n    vocab: llama_vocab_p,\n    token: Union[llama_token, int],\n    buf: Union[ctypes.c_char_p, bytes, CtypesArray[ctypes.c_char]],\n    length: Union[ctypes.c_int, int],\n    lstrip: Union[ctypes.c_int, int],\n    special: Union[ctypes.c_bool, bool],\n    /,\n) -> int:\n    \"\"\"Token Id -> Piece.\n    Uses the vocabulary in the provided context.\n    Does not write null terminator to the buffer.\n    User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens.\n\n    Args:\n        vocab: The vocabulary to use for tokenization.\n        token: The token to convert.\n        buf: The buffer to write the token to.\n        length: The length of the buffer.\n        lstrip: The number of leading spaces to skip.\n        special: If true, special tokens are rendered in the output.\"\"\"\n    ...\n\n\n# /// @details Convert the provided tokens into text (inverse of llama_tokenize()).\n# /// @param text The char pointer must be large enough to hold the resulting text.\n# /// @return Returns the number of chars/bytes on success, no more than text_len_max.\n# /// @return Returns a negative number on failure - the number of chars/bytes that would have been returned.\n# /// @param remove_special Allow to remove BOS and EOS tokens if model is configured to do so.\n# /// @param unparse_special If true, special tokens are rendered in the output.\n# LLAMA_API int32_t llama_detokenize(\n#     const struct llama_vocab * vocab,\n#            const llama_token * tokens,\n#                      int32_t   n_tokens,\n#                         char * text,\n#                      int32_t   text_len_max,\n#                         bool   remove_special,\n#                         bool   unparse_special);\n@ctypes_function(\n    \"llama_detokenize\",\n    [\n        llama_vocab_p_ctypes,\n        ctypes.POINTER(llama_token),\n        ctypes.c_int32,\n        ctypes.c_char_p,\n        ctypes.c_int32,\n        ctypes.c_bool,\n        ctypes.c_bool,\n    ],\n    ctypes.c_int32,\n)\ndef llama_detokenize(\n    vocab: llama_vocab_p,\n    tokens: CtypesArray[llama_token],\n    n_tokens: Union[ctypes.c_int, int],\n    text: bytes,\n    text_len_max: Union[ctypes.c_int, int],\n    remove_special: Union[ctypes.c_bool, bool],\n    unparse_special: Union[ctypes.c_bool, bool],\n    /,\n) -> int:\n    \"\"\"Convert the provided tokens into text (inverse of llama_tokenize()).\n\n    Args:\n        vocab: The vocabulary to use for tokenization.\n        tokens: The tokens to convert.\n        n_tokens: The number of tokens.\n        text: The buffer to write the text to.\n        text_len_max: The length of the buffer.\n        remove_special: Allow to remove BOS and EOS tokens if model is configured to do so.\n        unparse_special: If true, special tokens are rendered in the output.\"\"\"\n    ...\n\n\n# //\n# // Chat templates\n# //\n\n# /// Apply chat template. Inspired by hf apply_chat_template() on python.\n# /// Both \"model\" and \"custom_template\" are optional, but at least one is required. \"custom_template\" has higher precedence than \"model\"\n# /// NOTE: This function does not use a jinja parser. It only support a pre-defined list of template. See more: https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template\n# /// @param tmpl A Jinja template to use for this chat. If this is nullptr, the model's default chat template will be used instead.\n# /// @param chat Pointer to a list of multiple llama_chat_message\n# /// @param n_msg Number of llama_chat_message in this chat\n# /// @param add_ass Whether to end the prompt with the token(s) that indicate the start of an assistant message.\n# /// @param buf A buffer to hold the output formatted prompt. The recommended alloc size is 2 * (total number of characters of all messages)\n# /// @param length The size of the allocated buffer\n# /// @return The total number of bytes of the formatted prompt. If is it larger than the size of buffer, you may need to re-alloc it and then re-apply the template.\n# LLAMA_API int32_t llama_chat_apply_template(\n#                         const char * tmpl,\n#    const struct llama_chat_message * chat,\n#                             size_t   n_msg,\n#                               bool   add_ass,\n#                               char * buf,\n#                            int32_t   length);\n@ctypes_function(\n    \"llama_chat_apply_template\",\n    [\n        ctypes.c_char_p,  # tmpl\n        ctypes.POINTER(llama_chat_message),  # chat\n        ctypes.c_size_t,  # n_msg\n        ctypes.c_bool,    # add_ass (added)\n        ctypes.c_char_p,  # buf\n        ctypes.c_int32,   # length\n    ],\n    ctypes.c_int32,\n)\ndef llama_chat_apply_template(\n    tmpl: bytes,\n    chat: CtypesArray[llama_chat_message],\n    n_msg: int,\n    add_ass: bool,  # Added parameter\n    buf: bytes,\n    length: int,\n    /,\n) -> int:\n    \"\"\"Apply chat template.\n\n    Args:\n        tmpl: Template to use. If None, uses model's default\n        chat: Array of chat messages\n        n_msg: Number of messages\n        add_ass: Whether to end prompt with assistant token\n        buf: Output buffer\n        length: Buffer length\n\n    Returns:\n        Number of bytes written, or needed if buffer too small\n    \"\"\"\n    ...\n\n\n# // Get list of built-in chat templates\n# LLAMA_API int32_t llama_chat_builtin_templates(const char ** output, size_t len);\n@ctypes_function(\n    \"llama_chat_builtin_templates\",\n    [\n        ctypes.POINTER(ctypes.c_char_p),\n        ctypes.c_size_t,\n    ],\n    ctypes.c_int32,\n)\ndef llama_chat_builtin_templates(\n    output: CtypesArray[bytes],\n    len: Union[ctypes.c_size_t, int],\n    /,\n) -> int:\n    \"\"\"Get list of built-in chat templates.\n\n    Args:\n        output: Output buffer to store template names.\n        len: Length of the output buffer.\n\n    Returns:\n        Number of templates available.\n        Returns a negative number on error.\n    \"\"\"\n    ...\n\n\n# //\n# // Sampling API\n# //\n\n# typedef void * llama_sampler_context_t;\nllama_sampler_context_t = ctypes.c_void_p\n\n\n# // user code can implement the interface below in order to create custom llama_sampler\n# struct llama_sampler_i {\n#     const char *           (*name)  (const struct llama_sampler * smpl);                                 // can be NULL\n#     void                   (*accept)(      struct llama_sampler * smpl, llama_token token);              // can be NULL\n#     void                   (*apply) (      struct llama_sampler * smpl, llama_token_data_array * cur_p); // required\n#     void                   (*reset) (      struct llama_sampler * smpl);                                 // can be NULL\n#     struct llama_sampler * (*clone) (const struct llama_sampler * smpl);                                 // can be NULL if ctx is NULL\n#     void                   (*free)  (      struct llama_sampler * smpl);                                 // can be NULL if ctx is NULL\n\n#     // TODO: API for internal libllama usage for appending the sampling to an existing ggml_cgraph\n#     //void (*apply_ggml) (struct llama_sampler * smpl, ...);\n# };\nclass llama_sampler_i(ctypes.Structure):\n    ...\n\n\n# struct llama_sampler {\n#     const struct llama_sampler_i * iface;\n#     llama_sampler_context_t        ctx;\n# };\nclass llama_sampler(ctypes.Structure):\n    _fields_ = [\n        (\"iface\", ctypes.POINTER(llama_sampler_i)),\n        (\"ctx\", llama_sampler_context_t),\n    ]\n\n\nif TYPE_CHECKING:\n    llama_sampler_p = CtypesPointer[llama_sampler]\n\nllama_sampler_p_ctypes = ctypes.POINTER(llama_sampler)\n\nllama_sampler_i_name = ctypes.CFUNCTYPE(ctypes.c_char_p, llama_sampler_p_ctypes)\nllama_sampler_i_accept = ctypes.CFUNCTYPE(None, llama_sampler_p_ctypes, llama_token)\nllama_sampler_i_apply = ctypes.CFUNCTYPE(\n    None, llama_sampler_p_ctypes, llama_token_data_array_p\n)\nllama_sampler_i_reset = ctypes.CFUNCTYPE(None, llama_sampler_p_ctypes)\nllama_sampler_i_clone = ctypes.CFUNCTYPE(llama_sampler_p_ctypes, llama_sampler_p_ctypes)\nllama_sampler_i_free = ctypes.CFUNCTYPE(None, llama_sampler_p_ctypes)\n\nllama_sampler_i._fields_ = [\n    (\"name\", llama_sampler_i_name),\n    (\"accept\", llama_sampler_i_accept),\n    (\"apply\", llama_sampler_i_apply),\n    (\"reset\", llama_sampler_i_reset),\n    (\"clone\", llama_sampler_i_clone),\n    (\"free\", llama_sampler_i_free),\n]\n\n\n# // mirror of llama_sampler_i:\n# LLAMA_API struct llama_sampler * llama_sampler_init  (const struct llama_sampler_i * iface, llama_sampler_context_t ctx);\n@ctypes_function(\n    \"llama_sampler_init\",\n    [ctypes.POINTER(llama_sampler_i), llama_sampler_context_t],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init(\n    iface: ctypes.POINTER(llama_sampler_i), ctx: llama_sampler_context_t, /\n) -> llama_sampler_p:\n    ...\n\n\n# LLAMA_API const char *           llama_sampler_name  (const struct llama_sampler * smpl);\n@ctypes_function(\n    \"llama_sampler_name\",\n    [llama_sampler_p_ctypes],\n    ctypes.c_char_p,\n)\ndef llama_sampler_name(smpl: llama_sampler_p, /) -> bytes:\n    ...\n\n\n# LLAMA_API void                   llama_sampler_accept(      struct llama_sampler * smpl, llama_token token);\n@ctypes_function(\n    \"llama_sampler_accept\",\n    [llama_sampler_p_ctypes, llama_token],\n    None,\n)\ndef llama_sampler_accept(smpl: llama_sampler_p, token: Union[llama_token, int], /):\n    ...\n\n\n# LLAMA_API void                   llama_sampler_apply (      struct llama_sampler * smpl, llama_token_data_array * cur_p);\n@ctypes_function(\n    \"llama_sampler_apply\",\n    [llama_sampler_p_ctypes, llama_token_data_array_p],\n    None,\n)\ndef llama_sampler_apply(\n    smpl: llama_sampler_p, cur_p: CtypesArray[llama_token_data_array], /\n):\n    ...\n\n\n# LLAMA_API void                   llama_sampler_reset (      struct llama_sampler * smpl);\n@ctypes_function(\n    \"llama_sampler_reset\",\n    [llama_sampler_p_ctypes],\n    None,\n)\ndef llama_sampler_reset(smpl: llama_sampler_p, /):\n    ...\n\n\n# LLAMA_API struct llama_sampler * llama_sampler_clone (const struct llama_sampler * smpl);\n@ctypes_function(\n    \"llama_sampler_clone\",\n    [llama_sampler_p_ctypes],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_clone(smpl: llama_sampler_p, /) -> llama_sampler_p:\n    ...\n\n\n# // important: do not free if the sampler has been added to a llama_sampler_chain (via llama_sampler_chain_add)\n# LLAMA_API void                   llama_sampler_free  (      struct llama_sampler * smpl);\n@ctypes_function(\n    \"llama_sampler_free\",\n    [llama_sampler_p_ctypes],\n    None,\n)\ndef llama_sampler_free(smpl: llama_sampler_p, /):\n    ...\n\n\n# // llama_sampler_chain\n# // a type of llama_sampler that can chain multiple samplers one after another\n\n# LLAMA_API struct llama_sampler * llama_sampler_chain_init(struct llama_sampler_chain_params params);\n@ctypes_function(\n    \"llama_sampler_chain_init\",\n    [llama_sampler_chain_params],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_chain_init(params: llama_sampler_chain_params, /) -> llama_sampler_p:\n    ...\n\n\n# // important: takes ownership of the sampler object and will free it when llama_sampler_free is called\n# LLAMA_API void                   llama_sampler_chain_add(      struct llama_sampler * chain, struct llama_sampler * smpl);\n@ctypes_function(\n    \"llama_sampler_chain_add\",\n    [llama_sampler_p_ctypes, llama_sampler_p_ctypes],\n    None,\n)\ndef llama_sampler_chain_add(chain: llama_sampler_p, smpl: llama_sampler_p, /):\n    ...\n\n\n# LLAMA_API struct llama_sampler * llama_sampler_chain_get(const struct llama_sampler * chain, int32_t i);\n@ctypes_function(\n    \"llama_sampler_chain_get\",\n    [llama_sampler_p_ctypes, ctypes.c_int32],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_chain_get(\n    chain: llama_sampler_p, i: Union[ctypes.c_int32, int], /\n) -> llama_sampler_p:\n    ...\n\n\n# LLAMA_API int                    llama_sampler_chain_n  (const struct llama_sampler * chain);\n@ctypes_function(\n    \"llama_sampler_chain_n\",\n    [llama_sampler_p_ctypes],\n    ctypes.c_int,\n)\ndef llama_sampler_chain_n(chain: llama_sampler_p, /) -> int:\n    ...\n\n\n# // after removing a sampler, the chain will no longer own it, and it will not be freed when the chain is freed\n# LLAMA_API struct llama_sampler * llama_sampler_chain_remove(   struct llama_sampler * chain, int32_t i);\n@ctypes_function(\n    \"llama_sampler_chain_remove\",\n    [llama_sampler_p_ctypes, ctypes.c_int32],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_chain_remove(\n    chain: llama_sampler_p, i: Union[ctypes.c_int32, int], /\n) -> llama_sampler_p:\n    ...\n\n\n# // available samplers:\n\n# LLAMA_API struct llama_sampler * llama_sampler_init_greedy(void);\n@ctypes_function(\"llama_sampler_init_greedy\", [], llama_sampler_p_ctypes)\ndef llama_sampler_init_greedy() -> llama_sampler_p:\n    ...\n\n\n# LLAMA_API struct llama_sampler * llama_sampler_init_dist  (uint32_t seed);\n@ctypes_function(\"llama_sampler_init_dist\", [ctypes.c_uint32], llama_sampler_p_ctypes)\ndef llama_sampler_init_dist(seed: int) -> llama_sampler_p:\n    ...\n\n\n# /// @details Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.\n# /// NOTE: Avoid using on the full vocabulary as the sorting can become slow. For example, apply top-k or top-p sampling first.\n# DEPRECATED(LLAMA_API struct llama_sampler * llama_sampler_init_softmax    (void),\n#     \"will be removed in the future (see https://github.com/ggml-org/llama.cpp/pull/9896#discussion_r1800920915)\");\n@ctypes_function(\"llama_sampler_init_softmax\", [], llama_sampler_p_ctypes)\ndef llama_sampler_init_softmax() -> llama_sampler_p:\n    ...\n\n\n# /// @details Top-K sampling described in academic paper \"The Curious Case of Neural Text Degeneration\" https://arxiv.org/abs/1904.09751\n# /// Setting k <= 0 makes this a noop\n# LLAMA_API struct llama_sampler * llama_sampler_init_top_k      (int32_t k);\n@ctypes_function(\"llama_sampler_init_top_k\", [ctypes.c_int32], llama_sampler_p_ctypes)\ndef llama_sampler_init_top_k(k: int) -> llama_sampler_p:\n    ...\n\n\n# /// @details Nucleus sampling described in academic paper \"The Curious Case of Neural Text Degeneration\" https://arxiv.org/abs/1904.09751\n# LLAMA_API struct llama_sampler * llama_sampler_init_top_p      (float   p, size_t min_keep);\n@ctypes_function(\n    \"llama_sampler_init_top_p\",\n    [ctypes.c_float, ctypes.c_size_t],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_top_p(p: float, min_keep: int) -> llama_sampler_p:\n    ...\n\n\n# /// @details Minimum P sampling as described in https://github.com/ggml-org/llama.cpp/pull/3841\n# LLAMA_API struct llama_sampler * llama_sampler_init_min_p      (float   p, size_t min_keep);\n@ctypes_function(\n    \"llama_sampler_init_min_p\",\n    [ctypes.c_float, ctypes.c_size_t],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_min_p(p: float, min_keep: int) -> llama_sampler_p:\n    ...\n\n\n# /// @details Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.\n# LLAMA_API struct llama_sampler * llama_sampler_init_typical    (float   p, size_t min_keep);\n@ctypes_function(\n    \"llama_sampler_init_typical\",\n    [ctypes.c_float, ctypes.c_size_t],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_typical(p: float, min_keep: int) -> llama_sampler_p:\n    ...\n\n\n# /// #details Updates the logits l_i` = l_i/t. When t <= 0.0f, the maximum logit is kept at it's original value, the rest are set to -inf\n# LLAMA_API struct llama_sampler * llama_sampler_init_temp       (float   t);\n@ctypes_function(\"llama_sampler_init_temp\", [ctypes.c_float], llama_sampler_p_ctypes)\ndef llama_sampler_init_temp(t: float) -> llama_sampler_p:\n    ...\n\n\n# /// @details Dynamic temperature implementation (a.k.a. entropy) described in the paper https://arxiv.org/abs/2309.02772.\n# LLAMA_API struct llama_sampler * llama_sampler_init_temp_ext   (float   t, float   delta, float exponent);\n@ctypes_function(\n    \"llama_sampler_init_temp_ext\",\n    [ctypes.c_float, ctypes.c_float, ctypes.c_float],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_temp_ext(\n    t: float, delta: float, exponent: float\n) -> llama_sampler_p:\n    ...\n\n\n# /// @details XTC sampler as described in https://github.com/oobabooga/text-generation-webui/pull/6335\n# LLAMA_API struct llama_sampler * llama_sampler_init_xtc        (float   p, float   t,     size_t min_keep, uint32_t seed);\n@ctypes_function(\n    \"llama_sampler_init_xtc\",\n    [ctypes.c_float, ctypes.c_float, ctypes.c_size_t, ctypes.c_uint32],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_xtc(\n    p: float, t: float, min_keep: int, seed: int, /\n) -> llama_sampler_p:\n    ...\n\n\n# /// @details Top n sigma sampling as described in academic paper \"Top-nσ: Not All Logits Are You Need\" https://arxiv.org/pdf/2411.07641\n# LLAMA_API struct llama_sampler * llama_sampler_init_top_n_sigma(float   n);\n@ctypes_function(\n    \"llama_sampler_init_top_n_sigma\",\n    [ctypes.c_float],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_top_n_sigma(n: float, /) -> llama_sampler_p:\n    ...\n\n\n# /// @details Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.\n# LLAMA_API struct llama_sampler * llama_sampler_init_mirostat(\n#                          int32_t   n_vocab,\n#                         uint32_t   seed,\n#                            float   tau,\n#                            float   eta,\n#                          int32_t   m);\n@ctypes_function(\n    \"llama_sampler_init_mirostat\",\n    [ctypes.c_int32, ctypes.c_uint32, ctypes.c_float, ctypes.c_float, ctypes.c_int32],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_mirostat(\n    n_vocab: int, seed: int, tau: float, eta: float, m: int, /\n) -> llama_sampler_p:\n    ...\n\n\n# /// @details Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.\n# LLAMA_API struct llama_sampler * llama_sampler_init_mirostat_v2(\n#                         uint32_t   seed,\n#                            float   tau,\n#                            float   eta);\n@ctypes_function(\n    \"llama_sampler_init_mirostat_v2\",\n    [ctypes.c_uint32, ctypes.c_float, ctypes.c_float],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_mirostat_v2(\n    seed: int, tau: float, eta: float, /\n) -> llama_sampler_p:\n    ...\n\n\n# /// @details Intializes a GBNF grammar, see grammars/README.md for details.\n# LLAMA_API struct llama_sampler * llama_sampler_init_grammar(\n#         const struct llama_vocab * vocab,\n#                       const char * grammar_str,\n#                       const char * grammar_root);\n@ctypes_function(\n    \"llama_sampler_init_grammar\",\n    [llama_vocab_p_ctypes, ctypes.c_char_p, ctypes.c_char_p],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_grammar(\n    vocab: llama_vocab_p, grammar_str: bytes, grammar_root: bytes, /\n) -> llama_sampler_p:\n    ...\n\n\n# DEPRECATED(LLAMA_API struct llama_sampler * llama_sampler_init_grammar_lazy(\n#         const struct llama_vocab * vocab,\n#                       const char * grammar_str,\n#                       const char * grammar_root,\n#                      const char ** trigger_words,\n#                             size_t num_trigger_words,\n#                const llama_token * trigger_tokens,\n#                             size_t num_trigger_tokens),\n#     \"use llama_sampler_init_grammar_lazy_patterns instead\");\n@ctypes_function(\n    \"llama_sampler_init_grammar_lazy\",\n    [\n        llama_vocab_p_ctypes,\n        ctypes.c_char_p,\n        ctypes.c_char_p,\n        ctypes.POINTER(ctypes.c_char_p),\n        ctypes.c_size_t,\n        ctypes.POINTER(llama_token),\n        ctypes.c_size_t,\n    ],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_grammar_lazy(\n    vocab: llama_vocab_p,\n    grammar_str: bytes,\n    grammar_root: bytes,\n    trigger_words: CtypesArray[bytes],\n    num_trigger_words: int,\n    trigger_tokens: CtypesArray[llama_token],\n    num_trigger_tokens: int,\n    /,\n) -> llama_sampler_p:\n    ...\n\n\n# /// @details Lazy grammar sampler, introduced in https://github.com/ggml-org/llama.cpp/pull/9639\n# LLAMA_API struct llama_sampler * llama_sampler_init_grammar_lazy_patterns(\n#     const struct llama_vocab * vocab,\n#                   const char * grammar_str,\n#                   const char * grammar_root,\n#                  const char ** trigger_patterns,\n#                         size_t num_trigger_patterns,\n#            const llama_token * trigger_tokens,\n#                         size_t num_trigger_tokens);\n@ctypes_function(\n    \"llama_sampler_init_grammar_lazy_patterns\",\n    [\n        llama_vocab_p_ctypes,\n        ctypes.c_char_p,\n        ctypes.c_char_p,\n        ctypes.POINTER(ctypes.c_char_p),\n        ctypes.c_size_t,\n        ctypes.POINTER(llama_token),\n        ctypes.c_size_t,\n    ],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_grammar_lazy_patterns(\n    vocab: llama_vocab_p,\n    grammar_str: bytes,\n    grammar_root: bytes,\n    trigger_patterns: CtypesArray[bytes],\n    num_trigger_patterns: int,\n    trigger_tokens: CtypesArray[llama_token],\n    num_trigger_tokens: int,\n    /,\n) -> llama_sampler_p:\n    ...\n\n\n# /// NOTE: Avoid using on the full vocabulary as searching for repeated tokens can become slow. For example, apply top-k or top-p sampling first.\n# LLAMA_API struct llama_sampler * llama_sampler_init_penalties(\n#                          int32_t   penalty_last_n,   // last n tokens to penalize (0 = disable penalty, -1 = context size)\n#                            float   penalty_repeat,   // 1.0 = disabled\n#                            float   penalty_freq,     // 0.0 = disabled\n#                            float   penalty_present); // 0.0 = disabled\n@ctypes_function(\n    \"llama_sampler_init_penalties\",\n    [ctypes.c_int32, ctypes.c_float, ctypes.c_float, ctypes.c_float],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_penalties(\n    penalty_last_n: int,\n    penalty_repeat: float,\n    penalty_freq: float,\n    penalty_present: float,\n    /,\n) -> llama_sampler_p:\n    ...\n\n\n# ///  @details DRY sampler, designed by p-e-w, as described in: https://github.com/oobabooga/text-generation-webui/pull/5677, porting Koboldcpp implementation authored by pi6am: https://github.com/LostRuins/koboldcpp/pull/982\n# LLAMA_API struct llama_sampler *    llama_sampler_init_dry(\n#         const struct llama_vocab *  vocab,\n#                          int32_t    n_ctx_train,\n#                            float    dry_multiplier,\n#                            float    dry_base,\n#                          int32_t    dry_allowed_length,\n#                          int32_t    dry_penalty_last_n,\n#                       const char ** seq_breakers,\n#                           size_t    num_breakers);\n@ctypes_function(\n    \"llama_sampler_init_dry\",\n    [\n        llama_vocab_p_ctypes,\n        ctypes.c_int32,\n        ctypes.c_float,\n        ctypes.c_float,\n        ctypes.c_int32,\n        ctypes.c_int32,\n        ctypes.POINTER(ctypes.c_char_p),\n        ctypes.c_size_t,\n    ],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_dry(\n    vocab: llama_vocab_p,\n    n_ctx_train: int,\n    dry_multiplier: float,\n    dry_base: float,\n    dry_allowed_length: int,\n    dry_penalty_last_n: int,\n    seq_breakers,\n    num_breakers: int,\n    /,\n) -> llama_sampler_p:\n    ...\n\n\n# LLAMA_API struct llama_sampler * llama_sampler_init_logit_bias(\n#                          int32_t   n_vocab,\n#                          int32_t   n_logit_bias,\n#           const llama_logit_bias * logit_bias);\n@ctypes_function(\n    \"llama_sampler_init_logit_bias\",\n    [ctypes.c_int32, ctypes.c_int32, llama_logit_bias_p],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_logit_bias(\n    n_vocab: int, n_logit_bias: int, logit_bias: CtypesArray[llama_logit_bias], /\n) -> llama_sampler_p:\n    ...\n\n\n# // this sampler is meant to be used for fill-in-the-middle infilling\n# LLAMA_API struct llama_sampler * llama_sampler_init_infill(const struct llama_vocab * vocab);\n@ctypes_function(\n    \"llama_sampler_init_infill\",\n    [llama_vocab_p_ctypes],\n    llama_sampler_p_ctypes,\n)\ndef llama_sampler_init_infill(vocab: llama_vocab_p, /) -> llama_sampler_p:\n    ...\n\n\n# // Returns the seed used by the sampler if applicable, LLAMA_DEFAULT_SEED otherwise\n# LLAMA_API uint32_t llama_sampler_get_seed(const struct llama_sampler * smpl);\n@ctypes_function(\n    \"llama_sampler_get_seed\",\n    [llama_sampler_p_ctypes],\n    ctypes.c_uint32,\n)\ndef llama_sampler_get_seed(smpl: llama_sampler_p, /) -> int:\n    ...\n\n\n# /// @details Sample and accept a token from the idx-th output of the last evaluation\n# LLAMA_API llama_token llama_sampler_sample(struct llama_sampler * smpl, struct llama_context * ctx, int32_t idx);\n@ctypes_function(\n    \"llama_sampler_sample\",\n    [llama_sampler_p_ctypes, llama_context_p_ctypes, ctypes.c_int32],\n    llama_token,\n)\ndef llama_sampler_sample(\n    smpl: llama_sampler_p, ctx: llama_context_p, idx: int, /\n) -> int:\n    ...\n\n\n# //\n# // Model split\n# //\n\n# /// @details Build a split GGUF final path for this chunk.\n# LLAMA_API int llama_split_path(char * split_path, size_t maxlen, const char * path_prefix, int split_no, int split_count);\n@ctypes_function(\n    \"llama_split_path\",\n    [ctypes.c_char_p, ctypes.c_size_t, ctypes.c_char_p, ctypes.c_int, ctypes.c_int],\n    ctypes.c_int,\n)\ndef llama_split_path(\n    split_path: bytes,\n    maxlen: Union[ctypes.c_size_t, int],\n    path_prefix: bytes,\n    split_no: Union[ctypes.c_int, int],\n    split_count: Union[ctypes.c_int, int],\n    /,\n) -> int:\n    \"\"\"Build a split GGUF final path for this chunk.\"\"\"\n    ...\n\n\n# /// @details Extract the path prefix from the split_path if and only if the split_no and split_count match.\n# LLAMA_API int llama_split_prefix(char * split_prefix, size_t maxlen, const char * split_path, int split_no, int split_count);\n@ctypes_function(\n    \"llama_split_prefix\",\n    [ctypes.c_char_p, ctypes.c_size_t, ctypes.c_char_p, ctypes.c_int, ctypes.c_int],\n    ctypes.c_int,\n)\ndef llama_split_prefix(\n    split_prefix: bytes,\n    maxlen: Union[ctypes.c_size_t, int],\n    split_path: bytes,\n    split_no: Union[ctypes.c_int, int],\n    split_count: Union[ctypes.c_int, int],\n    /,\n) -> int:\n    \"\"\"Extract the path prefix from the split_path if and only if the split_no and split_count match.\"\"\"\n    ...\n\n\n# // Print system information\n# LLAMA_API const char * llama_print_system_info(void);\n@ctypes_function(\"llama_print_system_info\", [], ctypes.c_char_p)\ndef llama_print_system_info() -> bytes:\n    ...\n\n\n# // Set callback for all future logging events.\n# // If this is not called, or NULL is supplied, everything is output on stderr.\n# LLAMA_API void llama_log_set(ggml_log_callback log_callback, void * user_data);\n@ctypes_function(\n    \"llama_log_set\",\n    [ctypes.c_void_p, ctypes.c_void_p],\n    None,\n)\ndef llama_log_set(\n    log_callback: Optional[CtypesFuncPointer],\n    user_data: ctypes.c_void_p,\n    /,\n):\n    \"\"\"Set callback for all future logging events.\n\n    If this is not called, or NULL is supplied, everything is output on stderr.\"\"\"\n    ...\n\n\n# //\n# // Performance utils\n# //\n\n# struct llama_perf_context_data {\n#     double t_start_ms;\n#     double t_load_ms;\n#     double t_p_eval_ms;\n#     double t_eval_ms;\n\n#     int32_t n_p_eval;\n#     int32_t n_eval;\n#     int32_t n_reused; // number of times a ggml compute graph had been reused\n# };\nclass llama_perf_context_data(ctypes.Structure):\n    _fields_ = [\n        (\"t_start_ms\", ctypes.c_double),\n        (\"t_load_ms\", ctypes.c_double),\n        (\"t_p_eval_ms\", ctypes.c_double),\n        (\"t_eval_ms\", ctypes.c_double),\n        (\"n_p_eval\", ctypes.c_int32),\n        (\"n_eval\", ctypes.c_int32),\n        (\"n_reused\", ctypes.c_int32),\n    ]\n\n\n# struct llama_perf_sampler_data {\n#     double t_sample_ms;\n\n#     int32_t n_sample;\n# };\nclass llama_perf_sampler_data(ctypes.Structure):\n    _fields_ = [\n        (\"t_sample_ms\", ctypes.c_double),\n        (\"n_sample\", ctypes.c_int32),\n    ]\n\n\n# LLAMA_API struct llama_perf_context_data llama_perf_context      (const struct llama_context * ctx);\n@ctypes_function(\n    \"llama_perf_context\",\n    [llama_context_p_ctypes],\n    llama_perf_context_data,\n)\ndef llama_perf_context(ctx: llama_context_p, /) -> llama_perf_context_data:\n    ...\n\n\n# LLAMA_API void                           llama_perf_context_print(const struct llama_context * ctx);\n@ctypes_function(\n    \"llama_perf_context_print\",\n    [llama_context_p_ctypes],\n    None,\n)\ndef llama_perf_context_print(ctx: llama_context_p, /):\n    ...\n\n\n# LLAMA_API void                           llama_perf_context_reset(      struct llama_context * ctx);\n@ctypes_function(\n    \"llama_perf_context_reset\",\n    [llama_context_p_ctypes],\n    None,\n)\ndef llama_perf_context_reset(ctx: llama_context_p, /):\n    ...\n\n\n# // NOTE: the following work only with samplers constructed via llama_sampler_chain_init\n# LLAMA_API struct llama_perf_sampler_data llama_perf_sampler      (const struct llama_sampler * chain);\n@ctypes_function(\n    \"llama_perf_sampler\",\n    [llama_sampler_p_ctypes],\n    llama_perf_sampler_data,\n)\ndef llama_perf_sampler(chain: llama_sampler_p, /) -> llama_perf_sampler_data:\n    ...\n\n\n# LLAMA_API void                           llama_perf_sampler_print(const struct llama_sampler * chain);\n@ctypes_function(\n    \"llama_perf_sampler_print\",\n    [llama_sampler_p_ctypes],\n    None,\n)\ndef llama_perf_sampler_print(chain: llama_sampler_p, /):\n    ...\n\n\n# LLAMA_API void                           llama_perf_sampler_reset(      struct llama_sampler * chain);\n@ctypes_function(\n    \"llama_perf_sampler_reset\",\n    [llama_sampler_p_ctypes],\n    None,\n)\ndef llama_perf_sampler_reset(chain: llama_sampler_p, /):\n    ...\n\n\n# //\n# // training\n# //\n\n# // function that returns whether or not a given tensor contains trainable parameters\n# typedef bool (*llama_opt_param_filter)(const struct ggml_tensor * tensor, void * userdata);\nllama_opt_param_filter = ctypes.CFUNCTYPE(ctypes.c_bool, ctypes.c_void_p, ctypes.c_void_p)\n\n# // always returns true\n# LLAMA_API bool llama_opt_param_filter_all(const struct ggml_tensor * tensor, void * userdata);\n@ctypes_function(\n    \"llama_opt_param_filter_all\",\n    [ctypes.c_void_p, ctypes.c_void_p],\n    ctypes.c_bool,\n)\ndef llama_opt_param_filter_all(tensor: ctypes.c_void_p, userdata: ctypes.c_void_p, /) -> bool:\n    ...\n\n\n# struct llama_opt_params {\n#     uint32_t n_ctx_train; // assumed context size post training, use context size specified in llama_context if 0\n\n#     llama_opt_param_filter param_filter; // callback for determining which tensors contain trainable parameters\n#     void * param_filter_ud;              // userdata for determining which tensors contain trainable parameters\n\n#     ggml_opt_get_optimizer_params get_opt_pars; // callback for calculating optimizer parameters\n#     void * get_opt_pars_ud;                     // userdata for calculating optimizer parameters\n# };\nclass llama_opt_params(ctypes.Structure):\n    _fields_ = [\n        (\"n_ctx_train\", ctypes.c_uint32),\n        (\"param_filter\", llama_opt_param_filter),\n        (\"param_filter_ud\", ctypes.c_void_p),\n        (\"get_opt_pars\", ctypes.c_void_p),  # ggml_opt_get_optimizer_params - not implemented here\n        (\"get_opt_pars_ud\", ctypes.c_void_p),\n    ]\n\n\n# LLAMA_API void llama_opt_init(struct llama_context * lctx, struct llama_model * model, struct llama_opt_params lopt_params);\n@ctypes_function(\n    \"llama_opt_init\",\n    [llama_context_p_ctypes, llama_model_p_ctypes, llama_opt_params],\n    None,\n)\ndef llama_opt_init(lctx: llama_context_p, model: llama_model_p, lopt_params: llama_opt_params, /):\n    ...\n\n\n# LLAMA_API void llama_opt_epoch(\n#         struct llama_context    * lctx,\n#         ggml_opt_dataset_t        dataset,\n#         ggml_opt_result_t         result_train,\n#         ggml_opt_result_t         result_eval,\n#         int64_t                   idata_split,\n#         ggml_opt_epoch_callback   callback_train,\n#         ggml_opt_epoch_callback   callback_eval);\n@ctypes_function(\n    \"llama_opt_epoch\",\n    [\n        llama_context_p_ctypes,\n        ctypes.c_void_p,  # ggml_opt_dataset_t\n        ctypes.c_void_p,  # ggml_opt_result_t  \n        ctypes.c_void_p,  # ggml_opt_result_t\n        ctypes.c_int64,\n        ctypes.c_void_p,  # ggml_opt_epoch_callback\n        ctypes.c_void_p,  # ggml_opt_epoch_callback\n    ],\n    None,\n)\ndef llama_opt_epoch(\n    lctx: llama_context_p,\n    dataset: ctypes.c_void_p,\n    result_train: ctypes.c_void_p,\n    result_eval: ctypes.c_void_p,\n    idata_split: int,\n    callback_train: ctypes.c_void_p,\n    callback_eval: ctypes.c_void_p,\n    /,\n):\n    ...\n"
  },
  {
    "path": "llama_cpp/llama_grammar.py",
    "content": "\"\"\"Python implementation of llama grammar parser directly translated from C++ source file in vendor/llama.cpp/common/grammar-parser.cpp.\"\"\"\n\n# flake8: noqa\nfrom pathlib import Path\n\nfrom itertools import groupby\nfrom typing import (\n    Any,\n    Set,\n    List,\n    Optional,\n    Tuple,\n    Union,\n)\n\nLLAMA_GRAMMAR_DEFAULT_ROOT = \"root\"\n\n\nclass LlamaGrammar:\n    def __init__(self, *args, _grammar: str, **kwargs):\n        self._grammar = _grammar\n        self._root = LLAMA_GRAMMAR_DEFAULT_ROOT\n\n    @classmethod\n    def from_string(cls, grammar: str, verbose: bool = True) -> \"LlamaGrammar\":\n        return cls(_grammar=grammar)\n\n    @classmethod\n    def from_file(cls, file: Union[str, Path], verbose: bool = True) -> \"LlamaGrammar\":\n        try:\n            with open(file) as f:\n                grammar = f.read()\n        except Exception as err:\n            raise Exception(\n                f\"{cls.from_file.__name__}: error reading grammar file: {err}\"\n            )\n\n        if grammar:\n            return cls.from_string(grammar, verbose=verbose)\n\n        raise ValueError(\n            f\"{cls.from_file.__name__}: error parsing grammar file: params_grammer is empty\"\n        )\n\n    @classmethod\n    def from_json_schema(cls, json_schema: str, verbose: bool = True) -> \"LlamaGrammar\":\n        return cls.from_string(json_schema_to_gbnf(json_schema), verbose=verbose)\n\n\n\"\"\"llama.cpp gbnf rules from vendor/llama.cpp/grammars\"\"\"\n\nARITHMETIC_GBNF = r\"\"\"\nroot  ::= (expr \"=\" ws term \"\\n\")+\nexpr  ::= term ([-+*/] term)*\nterm  ::= ident | num | \"(\" ws expr \")\" ws\nident ::= [a-z] [a-z0-9_]* ws\nnum   ::= [0-9]+ ws\nws    ::= [ \\t\\n]*\n\"\"\"\n\nC_GBNF = r\"\"\"\nroot ::= (declaration)*\n\ndeclaration ::= dataType identifier \"(\" parameter? \")\" \"{\" statement* \"}\"\n\ndataType  ::= \"int\" ws | \"float\" ws | \"char\" ws\nidentifier ::= [a-zA-Z_] [a-zA-Z_0-9]*\n\nparameter ::= dataType identifier\n\nstatement ::=\n    ( dataType identifier ws \"=\" ws expression \";\" ) |\n    ( identifier ws \"=\" ws expression \";\" ) |\n    ( identifier ws \"(\" argList? \")\" \";\" ) |\n    ( \"return\" ws expression \";\" ) |\n    ( \"while\" \"(\" condition \")\" \"{\" statement* \"}\" ) |\n    ( \"for\" \"(\" forInit \";\" ws condition \";\" ws forUpdate \")\" \"{\" statement* \"}\" ) |\n    ( \"if\" \"(\" condition \")\" \"{\" statement* \"}\" (\"else\" \"{\" statement* \"}\")? ) |\n    ( singleLineComment ) |\n    ( multiLineComment )\n\nforInit ::= dataType identifier ws \"=\" ws expression | identifier ws \"=\" ws expression\nforUpdate ::= identifier ws \"=\" ws expression\n\ncondition ::= expression relationOperator expression\nrelationOperator ::= (\"<=\" | \"<\" | \"==\" | \"!=\" | \">=\" | \">\")\n\nexpression ::= term ((\"+\" | \"-\") term)*\nterm ::= factor((\"*\" | \"/\") factor)*\n\nfactor ::= identifier | number | unaryTerm | funcCall | parenExpression\nunaryTerm ::= \"-\" factor\nfuncCall ::= identifier \"(\" argList? \")\"\nparenExpression ::= \"(\" ws expression ws \")\"\n\nargList ::= expression (\",\" ws expression)*\n\nnumber ::= [0-9]+\n\nsingleLineComment ::= \"//\" [^\\n]* \"\\n\"\nmultiLineComment ::= \"/*\" ( [^*] | (\"*\" [^/]) )* \"*/\"\n\nws ::= ([ \\t\\n]+)\n\"\"\"\n\nCHESS_GBNF = r\"\"\"\nroot   ::= object\nvalue  ::= object | array | string | number | (\"true\" | \"false\" | \"null\") ws\n\nobject ::=\n  \"{\" ws (\n            string \":\" ws value\n    (\",\" ws string \":\" ws value)*\n  )? \"}\" ws\n\narray  ::=\n  \"[\" ws (\n            value\n    (\",\" ws value)*\n  )? \"]\" ws\n\nstring ::=\n  \"\\\"\" (\n    [^\"\\\\] |\n    \"\\\\\" ([\"\\\\/bfnrt] | \"u\" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes\n  )* \"\\\"\" ws\n\nnumber ::= (\"-\"? ([0-9] | [1-9] [0-9]*)) (\".\" [0-9]+)? ([eE] [-+]? [0-9]+)? ws\n\n# Optional space: by convention, applied in this grammar after literal chars when allowed\nws ::= ([ \\t\\n] ws)?\n\"\"\"\n\nJAPANESE_GBNF = r\"\"\"\nroot   ::= object\nvalue  ::= object | array | string | number | (\"true\" | \"false\" | \"null\") ws\n\nobject ::=\n  \"{\" ws (\n            string \":\" ws value\n    (\",\" ws string \":\" ws value)*\n  )? \"}\" ws\n\narray  ::=\n  \"[\" ws (\n            value\n    (\",\" ws value)*\n  )? \"]\" ws\n\nstring ::=\n  \"\\\"\" (\n    [^\"\\\\] |\n    \"\\\\\" ([\"\\\\/bfnrt] | \"u\" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes\n  )* \"\\\"\" ws\n\nnumber ::= (\"-\"? ([0-9] | [1-9] [0-9]*)) (\".\" [0-9]+)? ([eE] [-+]? [0-9]+)? ws\n\n# Optional space: by convention, applied in this grammar after literal chars when allowed\nws ::= ([ \\t\\n] ws)?\n\"\"\"\n\nJSON_ARR_GBNF = r\"\"\"\n# This is the same as json.gbnf but we restrict whitespaces at the end of the root array\n# Useful for generating JSON arrays\n\nroot   ::= arr\nvalue  ::= object | array | string | number | (\"true\" | \"false\" | \"null\") ws\n\narr  ::=\n  \"[\\n\" ws (\n            value\n    (\",\\n\" ws value)*\n  )? \"]\"\n\nobject ::=\n  \"{\" ws (\n            string \":\" ws value\n    (\",\" ws string \":\" ws value)*\n  )? \"}\" ws\n\narray  ::=\n  \"[\" ws (\n            value\n    (\",\" ws value)*\n  )? \"]\" ws\n\nstring ::=\n  \"\\\"\" (\n    [^\"\\\\\\x7F\\x00-\\x1F] |\n    \"\\\\\" ([\"\\\\/bfnrt] | \"u\" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes\n  )* \"\\\"\" ws\n\nnumber ::= (\"-\"? ([0-9] | [1-9] [0-9]*)) (\".\" [0-9]+)? ([eE] [-+]? [0-9]+)? ws\n\n# Optional space: by convention, applied in this grammar after literal chars when allowed\nws ::= ([ \\t\\n] ws)?\n\"\"\"\n\n\nJSON_GBNF = r\"\"\"\nroot   ::= object\nvalue  ::= object | array | string | number | (\"true\" | \"false\" | \"null\") ws\n\nobject ::=\n  \"{\" ws (\n            string \":\" ws value\n    (\",\" ws string \":\" ws value)*\n  )? \"}\" ws\n\narray  ::=\n  \"[\" ws (\n            value\n    (\",\" ws value)*\n  )? \"]\" ws\n\nstring ::=\n  \"\\\"\" (\n    [^\"\\\\\\x7F\\x00-\\x1F] |\n    \"\\\\\" ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4}) # escapes\n  )* \"\\\"\" ws\n\nnumber ::= (\"-\"? ([0-9] | [1-9] [0-9]{0,15})) (\".\" [0-9]+)? ([eE] [-+]? [0-9] [1-9]{0,15})? ws\n\n# Optional space: by convention, applied in this grammar after literal chars when allowed\nws ::= | \" \" | \"\\n\" [ \\t]{0,20}\n\"\"\"\n\nLIST_GBNF = r\"\"\"\nroot ::= item+\n\n# Excludes various line break characters\nitem ::= \"- \" [^\\r\\n\\x0b\\x0c\\x85\\u2028\\u2029]+ \"\\n\"\n\"\"\"\n\n\"\"\"llama.cpp json-schema to grammar converter from vendor/llama.cpp/examples/json-schema-to-grammar.py\"\"\"\nimport json\nimport re\nfrom typing import List, Optional\n\n# whitespace is constrained to a single space char to prevent model \"running away\" in\n# whitespace. Also maybe improves generation quality?\nSPACE_RULE = '\" \"?'\n\n\nINVALID_RULE_CHARS_RE = re.compile(r\"[^a-zA-Z0-9-]+\")\nGRAMMAR_LITERAL_ESCAPE_RE = re.compile(r'[\\r\\n\"]')\nGRAMMAR_LITERAL_ESCAPES = {\"\\r\": \"\\\\r\", \"\\n\": \"\\\\n\", '\"': '\\\\\"'}\n\n# whitespace is constrained to a single space char to prevent model \"running away\" in\n# whitespace. Also maybe improves generation quality?\nSPACE_RULE = '\" \"?'\n\n\ndef _build_repetition(\n    item_rule, min_items, max_items, separator_rule=None, item_rule_is_literal=False\n):\n    if not separator_rule:\n        if min_items == 0 and max_items == 1:\n            return f\"{item_rule}?\"\n        elif min_items == 1 and max_items is None:\n            return f\"{item_rule}+\"\n\n    result = \"\"\n\n    if min_items > 0:\n        if item_rule_is_literal and separator_rule is None:\n            result = '\"' + (item_rule[1:-1] * min_items) + '\"'\n        else:\n            result = (f\" {separator_rule} \" if separator_rule else \" \").join(\n                [item_rule] * min_items\n            )\n\n    def opt_repetitions(up_to_n, prefix_with_sep=False):\n        \"\"\"\n        - n=4, no sep:             '(a (a (a (a)?)?)?)?'\n        - n=4, sep=',', prefix:    '(\",\" a (\",\" a (\",\" a (\",\" a)?)?)?)?'\n        - n=4, sep=',', no prefix: '(a (\",\" a (\",\" a (\",\" a)?)?)?)?'\n        \"\"\"\n\n        content = (\n            f\"{separator_rule} {item_rule}\"\n            if prefix_with_sep and separator_rule\n            else item_rule\n        )\n        if up_to_n == 0:\n            return \"\"\n        elif up_to_n == 1:\n            return f\"({content})?\"\n        elif separator_rule and not prefix_with_sep:\n            return f\"({content} {opt_repetitions(up_to_n - 1, prefix_with_sep=True)})?\"\n        else:\n            return (f\"({content} \" * up_to_n).rstrip() + (\")?\" * up_to_n)\n\n    if min_items > 0 and max_items != min_items:\n        result += \" \"\n\n    if max_items is not None:\n        result += opt_repetitions(max_items - min_items, prefix_with_sep=min_items > 0)\n    else:\n        item_operator = f'({separator_rule + \" \" if separator_rule else \"\"}{item_rule})'\n\n        if min_items == 0 and separator_rule:\n            result = f\"({item_rule} {item_operator}*)?\"\n        else:\n            result += f\"{item_operator}*\"\n\n    return result\n\n\nclass BuiltinRule:\n    def __init__(self, content: str, deps: list = None):\n        self.content = content\n        self.deps = deps or []\n\n\n_up_to_15_digits = _build_repetition(\"[0-9]\", 0, 15)\n\nPRIMITIVE_RULES = {\n    \"boolean\": BuiltinRule('(\"true\" | \"false\") space', []),\n    \"decimal-part\": BuiltinRule(\"[0-9] \" + _up_to_15_digits, []),\n    \"integral-part\": BuiltinRule(\"[0-9] | [1-9] \" + _up_to_15_digits, []),\n    \"number\": BuiltinRule(\n        '(\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space',\n        [\"integral-part\", \"decimal-part\"],\n    ),\n    \"integer\": BuiltinRule('(\"-\"? integral-part) space', [\"integral-part\"]),\n    \"value\": BuiltinRule(\n        \"object | array | string | number | boolean | null\",\n        [\"object\", \"array\", \"string\", \"number\", \"boolean\", \"null\"],\n    ),\n    \"object\": BuiltinRule(\n        '\"{\" space ( string \":\" space value (\",\" space string \":\" space value)* )? \"}\" space',\n        [\"string\", \"value\"],\n    ),\n    \"array\": BuiltinRule(\n        '\"[\" space ( value (\",\" space value)* )? \"]\" space', [\"value\"]\n    ),\n    \"uuid\": BuiltinRule(\n        r'\"\\\"\" '\n        + ' \"-\" '.join(\"[0-9a-fA-F]\" * n for n in [8, 4, 4, 4, 12])\n        + r' \"\\\"\" space',\n        [],\n    ),\n    \"char\": BuiltinRule(\n        r'[^\"\\\\] | \"\\\\\" ([\"\\\\/bfnrt] | \"u\" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])',\n        [],\n    ),\n    \"string\": BuiltinRule(r'\"\\\"\" char* \"\\\"\" space', [\"char\"]),\n    \"null\": BuiltinRule('\"null\" space', []),\n}\n\n# TODO: support \"uri\", \"email\" string formats\nSTRING_FORMAT_RULES = {\n    \"date\": BuiltinRule(\n        '[0-9] [0-9] [0-9] [0-9] \"-\" ( \"0\" [1-9] | \"1\" [0-2] ) \"-\" ( \"0\" [1-9] | [1-2] [0-9] | \"3\" [0-1] )',\n        [],\n    ),\n    \"time\": BuiltinRule(\n        '([01] [0-9] | \"2\" [0-3]) \":\" [0-5] [0-9] \":\" [0-5] [0-9] ( \".\" [0-9] [0-9] [0-9] )? ( \"Z\" | ( \"+\" | \"-\" ) ( [01] [0-9] | \"2\" [0-3] ) \":\" [0-5] [0-9] )',\n        [],\n    ),\n    \"date-time\": BuiltinRule('date \"T\" time', [\"date\", \"time\"]),\n    \"date-string\": BuiltinRule('\"\\\\\"\" date \"\\\\\"\" space', [\"date\"]),\n    \"time-string\": BuiltinRule('\"\\\\\"\" time \"\\\\\"\" space', [\"time\"]),\n    \"date-time-string\": BuiltinRule('\"\\\\\"\" date-time \"\\\\\"\" space', [\"date-time\"]),\n}\n\nDOTALL = \"[\\\\U00000000-\\\\U0010FFFF]\"\nDOT = \"[^\\\\x0A\\\\x0D]\"\n\nRESERVED_NAMES = set(\n    [\"root\", \"dot\", *PRIMITIVE_RULES.keys(), *STRING_FORMAT_RULES.keys()]\n)\n\n\nNON_LITERAL_SET = set(\"|.()[]{}*+?\")\nESCAPED_IN_REGEXPS_BUT_NOT_IN_LITERALS = set(\"[]()|{}*+?\")\n\n\nclass SchemaConverter:\n    def __init__(self, *, prop_order, allow_fetch, dotall, raw_pattern):\n        self._prop_order = prop_order\n        self._allow_fetch = allow_fetch\n        self._dotall = dotall\n        self._raw_pattern = raw_pattern\n        self._rules = {\n            \"space\": SPACE_RULE,\n        }\n        self._refs = {}\n        self._refs_being_resolved = set()\n\n    def _format_literal(self, literal):\n        escaped = GRAMMAR_LITERAL_ESCAPE_RE.sub(\n            lambda m: GRAMMAR_LITERAL_ESCAPES.get(m.group(0)), literal\n        )\n        return f'\"{escaped}\"'\n\n    def not_literal(\n        self, literal: str, dotall: bool = True, maybe_escaped_underscores=False\n    ) -> str:\n        \"\"\"\n        not_literal('a') -> '[^a]'\n        not_literal('abc') -> '([^a] | \"a\" ([^b] | \"b\" ([^c])?)?)?'\n        \"\"\"\n        assert len(literal) > 0, \"Empty literal not supported\"\n\n        def recurse(i: int):\n            c = literal[i]\n            if maybe_escaped_underscores and c == \"_\":\n                yield f\"[^{c}\\\\\\\\]\"\n                yield \" | \"\n                yield f'\"\\\\\\\\\"? \"{c}\"'\n            else:\n                yield f\"[^{c}]\"\n            if i < len(literal) - 1:\n                yield \" | \"\n                yield self._format_literal(c)\n                yield \" (\"\n                yield from recurse(i + 1)\n                yield \")?\"\n\n        return \"\".join((\"(\", *recurse(0), \")\"))\n\n    def _add_rule(self, name, rule):\n        esc_name = INVALID_RULE_CHARS_RE.sub(\"-\", name)\n        if esc_name not in self._rules or self._rules[esc_name] == rule:\n            key = esc_name\n        else:\n            i = 0\n            while (\n                f\"{esc_name}{i}\" in self._rules\n                and self._rules[f\"{esc_name}{i}\"] != rule\n            ):\n                i += 1\n            key = f\"{esc_name}{i}\"\n        self._rules[key] = rule\n        return key\n\n    def resolve_refs(self, schema: dict, url: str):\n        \"\"\"\n        Resolves all $ref fields in the given schema, fetching any remote schemas,\n        replacing $ref with absolute reference URL and populating self._refs with the\n        respective referenced (sub)schema dictionaries.\n        \"\"\"\n\n        def visit(n: dict):\n            if isinstance(n, list):\n                return [visit(x) for x in n]\n            elif isinstance(n, dict):\n                ref = n.get(\"$ref\")\n                if ref is not None and ref not in self._refs:\n                    if ref.startswith(\"https://\"):\n                        assert (\n                            self._allow_fetch\n                        ), \"Fetching remote schemas is not allowed (use --allow-fetch for force)\"\n                        import requests\n\n                        frag_split = ref.split(\"#\")\n                        base_url = frag_split[0]\n\n                        target = self._refs.get(base_url)\n                        if target is None:\n                            target = self.resolve_refs(\n                                requests.get(ref).json(), base_url\n                            )\n                            self._refs[base_url] = target\n\n                        if len(frag_split) == 1 or frag_split[-1] == \"\":\n                            return target\n                    elif ref.startswith(\"#/\"):\n                        target = schema\n                        ref = f\"{url}{ref}\"\n                        n[\"$ref\"] = ref\n                    else:\n                        raise ValueError(f\"Unsupported ref {ref}\")\n\n                    for sel in ref.split(\"#\")[-1].split(\"/\")[1:]:\n                        assert (\n                            target is not None and sel in target\n                        ), f\"Error resolving ref {ref}: {sel} not in {target}\"\n                        target = target[sel]\n\n                    self._refs[ref] = target\n                else:\n                    for v in n.values():\n                        visit(v)\n\n            return n\n\n        return visit(schema)\n\n    def _generate_union_rule(self, name, alt_schemas):\n        return \" | \".join(\n            (\n                self.visit(alt_schema, f'{name}{\"-\" if name else \"alternative-\"}{i}')\n                for i, alt_schema in enumerate(alt_schemas)\n            )\n        )\n\n    def _visit_pattern(self, pattern, name):\n        \"\"\"\n        Transforms a regular expression pattern into a GBNF rule.\n\n        Input: https://json-schema.org/understanding-json-schema/reference/regular_expressions\n        Output: https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md\n\n        Unsupported features: negative/positive lookaheads, greedy/non-greedy modifiers.\n\n        Mostly a 1:1 translation, except for {x} / {x,} / {x,y} quantifiers for which\n        we define sub-rules to keep the output lean.\n        \"\"\"\n\n        assert pattern.startswith(\"^\") and pattern.endswith(\n            \"$\"\n        ), 'Pattern must start with \"^\" and end with \"$\"'\n        pattern = pattern[1:-1]\n        sub_rule_ids = {}\n\n        i = 0\n        length = len(pattern)\n\n        def to_rule(s: Tuple[str, bool]) -> str:\n            (txt, is_literal) = s\n            return '\"' + txt + '\"' if is_literal else txt\n\n        def transform() -> Tuple[str, bool]:\n            \"\"\"\n            Parse a unit at index i (advancing it), and return its string representation + whether it's a literal.\n            \"\"\"\n            nonlocal i\n            nonlocal pattern\n            nonlocal sub_rule_ids\n\n            start = i\n            # For each component of this sequence, store its string representation and whether it's a literal.\n            # We only need a flat structure here to apply repetition operators to the last item, and\n            # to merge literals at the and (we're parsing grouped ( sequences ) recursively and don't treat '|' specially\n            # (GBNF's syntax is luckily very close to regular expressions!)\n            seq: list[Tuple[str, bool]] = []\n\n            def get_dot():\n                if self._dotall:\n                    rule = DOTALL\n                else:\n                    # Accept any character... except \\n and \\r line break chars (\\x0A and \\xOD)\n                    rule = DOT\n                return self._add_rule(f\"dot\", rule)\n\n            def join_seq():\n                nonlocal seq\n                ret = []\n                for is_literal, g in groupby(seq, lambda x: x[1]):\n                    if is_literal:\n                        ret.append((\"\".join(x[0] for x in g), True))\n                    else:\n                        ret.extend(g)\n                if len(ret) == 1:\n                    return ret[0]\n                return (\" \".join(to_rule(x) for x in seq), False)\n\n            while i < length:\n                c = pattern[i]\n                if c == \".\":\n                    seq.append((get_dot(), False))\n                    i += 1\n                elif c == \"(\":\n                    i += 1\n                    if i < length:\n                        assert (\n                            pattern[i] != \"?\"\n                        ), f'Unsupported pattern syntax \"{pattern[i]}\" at index {i} of /{pattern}/'\n                    seq.append((f\"({to_rule(transform())})\", False))\n                elif c == \")\":\n                    i += 1\n                    assert (\n                        start > 0 and pattern[start - 1] == \"(\"\n                    ), f\"Unbalanced parentheses; start = {start}, i = {i}, pattern = {pattern}\"\n                    return join_seq()\n                elif c == \"[\":\n                    square_brackets = c\n                    i += 1\n                    while i < length and pattern[i] != \"]\":\n                        if pattern[i] == \"\\\\\":\n                            square_brackets += pattern[i : i + 2]\n                            i += 2\n                        else:\n                            square_brackets += pattern[i]\n                            i += 1\n                    assert (\n                        i < length\n                    ), f\"Unbalanced square brackets; start = {start}, i = {i}, pattern = {pattern}\"\n                    square_brackets += \"]\"\n                    i += 1\n                    seq.append((square_brackets, False))\n                elif c == \"|\":\n                    seq.append((\"|\", False))\n                    i += 1\n                elif c in (\"*\", \"+\", \"?\"):\n                    seq[-1] = (to_rule(seq[-1]) + c, False)\n                    i += 1\n                elif c == \"{\":\n                    curly_brackets = c\n                    i += 1\n                    while i < length and pattern[i] != \"}\":\n                        curly_brackets += pattern[i]\n                        i += 1\n                    assert (\n                        i < length\n                    ), f\"Unbalanced curly brackets; start = {start}, i = {i}, pattern = {pattern}\"\n                    curly_brackets += \"}\"\n                    i += 1\n                    nums = [s.strip() for s in curly_brackets[1:-1].split(\",\")]\n                    min_times = 0\n                    max_times = None\n                    try:\n                        if len(nums) == 1:\n                            min_times = int(nums[0])\n                            max_times = min_times\n                        else:\n                            assert len(nums) == 2\n                            min_times = int(nums[0]) if nums[0] else 0\n                            max_times = int(nums[1]) if nums[1] else None\n                    except ValueError:\n                        raise ValueError(\n                            f\"Invalid quantifier {curly_brackets} in /{pattern}/\"\n                        )\n\n                    (sub, sub_is_literal) = seq[-1]\n\n                    if not sub_is_literal:\n                        id = sub_rule_ids.get(sub)\n                        if id is None:\n                            id = self._add_rule(f\"{name}-{len(sub_rule_ids) + 1}\", sub)\n                            sub_rule_ids[sub] = id\n                        sub = id\n\n                    seq[-1] = (\n                        _build_repetition(\n                            f'\"{sub}\"' if sub_is_literal else sub,\n                            min_times,\n                            max_times,\n                            item_rule_is_literal=sub_is_literal,\n                        ),\n                        False,\n                    )\n                else:\n                    literal = \"\"\n                    while i < length:\n                        if pattern[i] == \"\\\\\" and i < length - 1:\n                            next = pattern[i + 1]\n                            if next in ESCAPED_IN_REGEXPS_BUT_NOT_IN_LITERALS:\n                                i += 1\n                                literal += pattern[i]\n                                i += 1\n                            else:\n                                literal += pattern[i : i + 2]\n                                i += 2\n                        elif pattern[i] == '\"' and not self._raw_pattern:\n                            literal += '\\\\\"'\n                            i += 1\n                        elif pattern[i] not in NON_LITERAL_SET and (\n                            i == length - 1\n                            or literal == \"\"\n                            or pattern[i + 1] == \".\"\n                            or pattern[i + 1] not in NON_LITERAL_SET\n                        ):\n                            literal += pattern[i]\n                            i += 1\n                        else:\n                            break\n                    if literal:\n                        seq.append((literal, True))\n\n            return join_seq()\n\n        return self._add_rule(\n            name,\n            (\n                to_rule(transform())\n                if self._raw_pattern\n                else '\"\\\\\"\" ' + to_rule(transform()) + ' \"\\\\\"\" space'\n            ),\n        )\n\n    def _resolve_ref(self, ref):\n        ref_name = ref.split(\"/\")[-1]\n        if ref_name not in self._rules and ref not in self._refs_being_resolved:\n            self._refs_being_resolved.add(ref)\n            resolved = self._refs[ref]\n            ref_name = self.visit(resolved, ref_name)\n            self._refs_being_resolved.remove(ref)\n        return ref_name\n\n    def _generate_constant_rule(self, value):\n        return self._format_literal(json.dumps(value))\n\n    def visit(self, schema, name):\n        schema_type = schema.get(\"type\")\n        schema_format = schema.get(\"format\")\n        rule_name = name + \"-\" if name in RESERVED_NAMES else name or \"root\"\n\n        if (ref := schema.get(\"$ref\")) is not None:\n            return self._add_rule(rule_name, self._resolve_ref(ref))\n\n        elif \"oneOf\" in schema or \"anyOf\" in schema:\n            return self._add_rule(\n                rule_name,\n                self._generate_union_rule(name, schema.get(\"oneOf\") or schema[\"anyOf\"]),\n            )\n\n        elif isinstance(schema_type, list):\n            return self._add_rule(\n                rule_name,\n                self._generate_union_rule(name, [{\"type\": t} for t in schema_type]),\n            )\n\n        elif \"const\" in schema:\n            return self._add_rule(\n                rule_name, self._generate_constant_rule(schema[\"const\"])\n            )\n\n        elif \"enum\" in schema:\n            rule = \" | \".join((self._generate_constant_rule(v) for v in schema[\"enum\"]))\n            return self._add_rule(rule_name, rule)\n\n        elif schema_type in (None, \"object\") and (\n            \"properties\" in schema\n            or (\n                \"additionalProperties\" in schema\n                and schema[\"additionalProperties\"] is not True\n            )\n        ):\n            required = set(schema.get(\"required\", []))\n            properties = list(schema.get(\"properties\", {}).items())\n            return self._add_rule(\n                rule_name,\n                self._build_object_rule(\n                    properties, required, name, schema.get(\"additionalProperties\")\n                ),\n            )\n\n        elif schema_type in (None, \"object\") and \"allOf\" in schema:\n            required = set()\n            properties = []\n            hybrid_name = name\n\n            def add_component(comp_schema, is_required):\n                if (ref := comp_schema.get(\"$ref\")) is not None:\n                    comp_schema = self._refs[ref]\n\n                if \"properties\" in comp_schema:\n                    for prop_name, prop_schema in comp_schema[\"properties\"].items():\n                        properties.append((prop_name, prop_schema))\n                        if is_required:\n                            required.add(prop_name)\n\n            for t in schema[\"allOf\"]:\n                if \"anyOf\" in t:\n                    for tt in t[\"anyOf\"]:\n                        add_component(tt, is_required=False)\n                else:\n                    add_component(t, is_required=True)\n\n            return self._add_rule(\n                rule_name,\n                self._build_object_rule(\n                    properties, required, hybrid_name, additional_properties=[]\n                ),\n            )\n\n        elif schema_type in (None, \"array\") and (\n            \"items\" in schema or \"prefixItems\" in schema\n        ):\n            items = schema.get(\"items\") or schema[\"prefixItems\"]\n            if isinstance(items, list):\n                return self._add_rule(\n                    rule_name,\n                    '\"[\" space '\n                    + ' \",\" space '.join(\n                        self.visit(item, f'{name}{\"-\" if name else \"\"}tuple-{i}')\n                        for i, item in enumerate(items)\n                    )\n                    + ' \"]\" space',\n                )\n            else:\n                item_rule_name = self.visit(items, f'{name}{\"-\" if name else \"\"}item')\n                min_items = schema.get(\"minItems\", 0)\n                max_items = schema.get(\"maxItems\")\n                return self._add_rule(\n                    rule_name,\n                    '\"[\" space '\n                    + _build_repetition(\n                        item_rule_name, min_items, max_items, separator_rule='\",\" space'\n                    )\n                    + ' \"]\" space',\n                )\n\n        elif schema_type in (None, \"string\") and \"pattern\" in schema:\n            return self._visit_pattern(schema[\"pattern\"], rule_name)\n\n        elif schema_type in (None, \"string\") and re.match(\n            r\"^uuid[1-5]?$\", schema_format or \"\"\n        ):\n            return self._add_primitive(\n                \"root\" if rule_name == \"root\" else schema_format,\n                PRIMITIVE_RULES[\"uuid\"],\n            )\n\n        elif (\n            schema_type in (None, \"string\")\n            and f\"{schema_format}-string\" in STRING_FORMAT_RULES\n        ):\n            prim_name = f\"{schema_format}-string\"\n            return self._add_rule(\n                rule_name,\n                self._add_primitive(prim_name, STRING_FORMAT_RULES[prim_name]),\n            )\n\n        elif schema_type == \"string\" and (\n            \"minLength\" in schema or \"maxLength\" in schema\n        ):\n            char_rule = self._add_primitive(\"char\", PRIMITIVE_RULES[\"char\"])\n            min_len = schema.get(\"minLength\", 0)\n            max_len = schema.get(\"maxLength\")\n\n            return self._add_rule(\n                rule_name,\n                r'\"\\\"\" '\n                + _build_repetition(char_rule, min_len, max_len)\n                + r' \"\\\"\" space',\n            )\n\n        elif (schema_type == \"object\") or (len(schema) == 0):\n            return self._add_rule(\n                rule_name, self._add_primitive(\"object\", PRIMITIVE_RULES[\"object\"])\n            )\n\n        else:\n            assert schema_type in PRIMITIVE_RULES, f\"Unrecognized schema: {schema}\"\n            # TODO: support minimum, maximum, exclusiveMinimum, exclusiveMaximum at least for zero\n            return self._add_primitive(\n                \"root\" if rule_name == \"root\" else schema_type,\n                PRIMITIVE_RULES[schema_type],\n            )\n\n    def _add_primitive(self, name: str, rule: BuiltinRule):\n        n = self._add_rule(name, rule.content)\n\n        for dep in rule.deps:\n            dep_rule = PRIMITIVE_RULES.get(dep) or STRING_FORMAT_RULES.get(dep)\n            assert dep_rule, f\"Rule {dep} not known\"\n            if dep not in self._rules:\n                self._add_primitive(dep, dep_rule)\n        return n\n\n    def _build_object_rule(\n        self,\n        properties: List[Tuple[str, Any]],\n        required: Set[str],\n        name: str,\n        additional_properties: Union[bool, Any],\n    ):\n        prop_order = self._prop_order\n        # sort by position in prop_order (if specified) then by original order\n        sorted_props = [\n            kv[0]\n            for _, kv in sorted(\n                enumerate(properties),\n                key=lambda ikv: (prop_order.get(ikv[1][0], len(prop_order)), ikv[0]),\n            )\n        ]\n\n        prop_kv_rule_names = {}\n        for prop_name, prop_schema in properties:\n            prop_rule_name = self.visit(\n                prop_schema, f'{name}{\"-\" if name else \"\"}{prop_name}'\n            )\n            prop_kv_rule_names[prop_name] = self._add_rule(\n                f'{name}{\"-\" if name else \"\"}{prop_name}-kv',\n                rf'{self._format_literal(json.dumps(prop_name))} space \":\" space {prop_rule_name}',\n            )\n        required_props = [k for k in sorted_props if k in required]\n        optional_props = [k for k in sorted_props if k not in required]\n\n        if additional_properties == True or isinstance(additional_properties, dict):\n            sub_name = f'{name}{\"-\" if name else \"\"}additional'\n            value_rule = self.visit(\n                {} if additional_properties == True else additional_properties,\n                f\"{sub_name}-value\",\n            )\n            prop_kv_rule_names[\"*\"] = self._add_rule(\n                f\"{sub_name}-kv\",\n                self._add_primitive(\"string\", PRIMITIVE_RULES[\"string\"])\n                + f' \":\" space {value_rule}',\n            )\n            optional_props.append(\"*\")\n\n        rule = '\"{\" space '\n        rule += ' \",\" space '.join(prop_kv_rule_names[k] for k in required_props)\n\n        if optional_props:\n            rule += \" (\"\n            if required_props:\n                rule += ' \",\" space ( '\n\n            def get_recursive_refs(ks, first_is_optional):\n                [k, *rest] = ks\n                kv_rule_name = prop_kv_rule_names[k]\n                if k == \"*\":\n                    res = self._add_rule(\n                        f'{name}{\"-\" if name else \"\"}additional-kvs',\n                        f'{kv_rule_name} ( \",\" space ' + kv_rule_name + \" )*\",\n                    )\n                elif first_is_optional:\n                    res = f'( \",\" space {kv_rule_name} )?'\n                else:\n                    res = kv_rule_name\n                if len(rest) > 0:\n                    res += \" \" + self._add_rule(\n                        f'{name}{\"-\" if name else \"\"}{k}-rest',\n                        get_recursive_refs(rest, first_is_optional=True),\n                    )\n                return res\n\n            rule += \" | \".join(\n                get_recursive_refs(optional_props[i:], first_is_optional=False)\n                for i in range(len(optional_props))\n            )\n            if required_props:\n                rule += \" )\"\n            rule += \" )?\"\n\n        rule += ' \"}\" space'\n\n        return rule\n\n    def format_grammar(self):\n        return \"\\n\".join(\n            f\"{name} ::= {rule}\"\n            for name, rule in sorted(self._rules.items(), key=lambda kv: kv[0])\n        )\n\n\ndef json_schema_to_gbnf(schema: str, prop_order: Optional[List[str]] = None):\n    prop_order = prop_order or []\n    schema = json.loads(schema)\n    prop_order = {name: idx for idx, name in enumerate(prop_order)}\n    converter = SchemaConverter(\n        prop_order=prop_order, allow_fetch=False, dotall=False, raw_pattern=False\n    )\n    schema = converter.resolve_refs(schema, \"stdin\")\n    converter.visit(schema, \"\")\n    return converter.format_grammar()\n"
  },
  {
    "path": "llama_cpp/llama_speculative.py",
    "content": "import abc\n\nfrom typing import Any\n\nimport numpy as np\nimport numpy.typing as npt\n\n\nclass LlamaDraftModel(abc.ABC):\n    @abc.abstractmethod\n    def __call__(\n        self, input_ids: npt.NDArray[np.intc], /, **kwargs: Any\n    ) -> npt.NDArray[np.intc]:\n        raise NotImplementedError()\n\n\nclass LlamaPromptLookupDecoding(LlamaDraftModel):\n    \"\"\"Based on https://github.com/apoorvumang/prompt-lookup-decoding\"\"\"\n\n    def __init__(self, max_ngram_size: int = 2, num_pred_tokens: int = 10):\n        self.max_ngram_size = max_ngram_size\n        self.num_pred_tokens = num_pred_tokens\n\n    @staticmethod\n    def find_candidate_pred_tokens(\n        input_ids: npt.NDArray[np.intc],\n        max_ngram_size: int,\n        num_pred_tokens: int,\n    ):\n        input_length = input_ids.shape[0]\n\n        for ngram_size in range(min(max_ngram_size, input_length - 1), 0, -1):\n            # Create sliding windows of size ngram_size\n            windows = np.lib.stride_tricks.sliding_window_view(input_ids, (ngram_size,))\n\n            # Convert ngram to an array for comparison\n            ngram_array = input_ids[-ngram_size:]\n\n            # Find where the windows match the ngram\n            matches = np.all(windows == ngram_array, axis=1)\n\n            # Get the indices of matches\n            match_indices = np.nonzero(matches)[0]\n\n            # Iterate through match indices to find a valid continuation\n            for idx in match_indices:\n                start_idx = idx + ngram_size\n                end_idx = start_idx + num_pred_tokens\n                end_idx = min(end_idx, input_length)\n\n                if start_idx < end_idx:\n                    return input_ids[start_idx:end_idx]\n\n        # If no match is found, return an empty array\n        return np.array([], dtype=np.intc)\n\n    def __call__(\n        self, input_ids: npt.NDArray[np.intc], /, **kwargs: Any\n    ) -> npt.NDArray[np.intc]:\n        return self.find_candidate_pred_tokens(\n            input_ids=input_ids,\n            max_ngram_size=self.max_ngram_size,\n            num_pred_tokens=self.num_pred_tokens,\n        )\n"
  },
  {
    "path": "llama_cpp/llama_tokenizer.py",
    "content": "from __future__ import annotations\n\nimport abc\nfrom typing import (\n    List,\n    Optional,\n    Any,\n)\n\nimport llama_cpp\nfrom llama_cpp.llama_types import List\n\n\nclass BaseLlamaTokenizer(abc.ABC):\n    @abc.abstractmethod\n    def tokenize(\n        self, text: bytes, add_bos: bool = True, special: bool = True\n    ) -> List[int]:\n        \"\"\"Tokenize the text into tokens.\n\n        Args:\n            text: The utf-8 encoded string to tokenize.\n            add_bos: Whether to add a beginning of sequence token.\n            special: Whether to tokenize special tokens.\n        \"\"\"\n        raise NotImplementedError\n\n    @abc.abstractmethod\n    def detokenize(\n        self,\n        tokens: List[int],\n        prev_tokens: Optional[List[int]] = None,\n        special: bool = False,\n    ) -> bytes:\n        \"\"\"Detokenize the tokens into text.\n\n        Args:\n            tokens: The list of tokens to detokenize.\n            prev_tokens: The list of previous tokens. Offset mapping will be performed if provided.\n            special: Whether to detokenize special tokens.\n        \"\"\"\n        raise NotImplementedError\n\n\nclass LlamaTokenizer(BaseLlamaTokenizer):\n    def __init__(self, llama: llama_cpp.Llama):\n        self._model = llama._model  # type: ignore\n\n    def tokenize(\n        self, text: bytes, add_bos: bool = True, special: bool = True\n    ) -> List[int]:\n        return self._model.tokenize(text, add_bos=add_bos, special=special)\n\n    def detokenize(\n        self,\n        tokens: List[int],\n        prev_tokens: Optional[List[int]] = None,\n        special: bool = False,\n    ) -> bytes:\n        return self._model.detokenize(tokens, special=special)\n\n    def encode(\n        self, text: str, add_bos: bool = True, special: bool = True\n    ) -> List[int]:\n        return self.tokenize(\n            text.encode(\"utf-8\", errors=\"ignore\"), add_bos=add_bos, special=special\n        )\n\n    def decode(self, tokens: List[int]) -> str:\n        return self.detokenize(tokens).decode(\"utf-8\", errors=\"ignore\")\n\n    @classmethod\n    def from_ggml_file(cls, path: str) -> \"LlamaTokenizer\":\n        return cls(llama_cpp.Llama(model_path=path, vocab_only=True))\n\n\nclass LlamaHFTokenizer(BaseLlamaTokenizer):\n    def __init__(self, hf_tokenizer: Any):\n        self.hf_tokenizer = hf_tokenizer\n\n    def tokenize(\n        self, text: bytes, add_bos: bool = True, special: bool = True\n    ) -> List[int]:\n        return self.hf_tokenizer.encode(\n            text.decode(\"utf-8\", errors=\"ignore\"), add_special_tokens=special\n        )\n\n    def detokenize(\n        self,\n        tokens: List[int],\n        prev_tokens: Optional[List[int]] = None,\n        special: bool = False,\n    ) -> bytes:\n        skip_special_tokens = not special\n        if prev_tokens is not None:\n            text = self.hf_tokenizer.decode(\n                prev_tokens + tokens, skip_special_tokens=skip_special_tokens\n            ).encode(\"utf-8\", errors=\"ignore\")\n            prev_text = self.hf_tokenizer.decode(\n                prev_tokens, skip_special_tokens=skip_special_tokens\n            ).encode(\"utf-8\", errors=\"ignore\")\n            return text[len(prev_text) :]\n        else:\n            return self.hf_tokenizer.decode(\n                tokens, skip_special_tokens=skip_special_tokens\n            ).encode(\"utf-8\", errors=\"ignore\")\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path: str) -> \"LlamaHFTokenizer\":\n        try:\n            from transformers import AutoTokenizer\n        except ImportError:\n            raise ImportError(\n                \"The `transformers` library is required to use the `HFTokenizer`.\"\n                \"You can install it with `pip install transformers`.\"\n            )\n        hf_tokenizer = AutoTokenizer.from_pretrained(\n            pretrained_model_name_or_path=pretrained_model_name_or_path\n        )\n        return cls(hf_tokenizer)\n"
  },
  {
    "path": "llama_cpp/llama_types.py",
    "content": "\"\"\"Types and request signatures for OpenAI compatibility\n\nNOTE: These types may change to match the OpenAI OpenAPI specification.\n\nBased on the OpenAI OpenAPI specification:\nhttps://github.com/openai/openai-openapi/blob/master/openapi.yaml\n\n\"\"\"\n\nfrom typing import Any, List, Optional, Dict, Union\nfrom typing_extensions import TypedDict, NotRequired, Literal\n\n\n# NOTE: Defining this correctly using annotations seems to break pydantic validation.\n#       This is a workaround until we can figure out how to do this correctly\n# JsonType = Union[None, int, str, bool, List[\"JsonType\"], Dict[str, \"JsonType\"]]\nJsonType = Union[None, int, str, bool, List[Any], Dict[str, Any]]\n\n\nclass EmbeddingUsage(TypedDict):\n    prompt_tokens: int\n    total_tokens: int\n\n\nclass Embedding(TypedDict):\n    index: int\n    object: str\n    embedding: Union[List[float], List[List[float]]]\n\n\nclass CreateEmbeddingResponse(TypedDict):\n    object: Literal[\"list\"]\n    model: str\n    data: List[Embedding]\n    usage: EmbeddingUsage\n\n\nclass CompletionLogprobs(TypedDict):\n    text_offset: List[int]\n    token_logprobs: List[Optional[float]]\n    tokens: List[str]\n    top_logprobs: List[Optional[Dict[str, float]]]\n\n\nclass CompletionChoice(TypedDict):\n    text: str\n    index: int\n    logprobs: Optional[CompletionLogprobs]\n    finish_reason: Optional[Literal[\"stop\", \"length\"]]\n\n\nclass CompletionUsage(TypedDict):\n    prompt_tokens: int\n    completion_tokens: int\n    total_tokens: int\n\n\nclass CreateCompletionResponse(TypedDict):\n    id: str\n    object: Literal[\"text_completion\"]\n    created: int\n    model: str\n    choices: List[CompletionChoice]\n    usage: NotRequired[CompletionUsage]\n\n\nclass ChatCompletionResponseFunctionCall(TypedDict):\n    name: str\n    arguments: str\n\n\nclass ChatCompletionResponseMessage(TypedDict):\n    content: Optional[str]\n    tool_calls: NotRequired[\"ChatCompletionMessageToolCalls\"]\n    role: Literal[\"assistant\", \"function\"]  # NOTE: \"function\" may be incorrect here\n    function_call: NotRequired[ChatCompletionResponseFunctionCall]  # DEPRECATED\n\n\nclass ChatCompletionFunction(TypedDict):\n    name: str\n    description: NotRequired[str]\n    parameters: Dict[str, JsonType]  # TODO: make this more specific\n\n\nclass ChatCompletionTopLogprobToken(TypedDict):\n    token: str\n    logprob: float\n    bytes: Optional[List[int]]\n\n\nclass ChatCompletionLogprobToken(ChatCompletionTopLogprobToken):\n    token: str\n    logprob: float\n    bytes: Optional[List[int]]\n    top_logprobs: List[ChatCompletionTopLogprobToken]\n\n\nclass ChatCompletionLogprobs(TypedDict):\n    content: Optional[List[ChatCompletionLogprobToken]]\n    refusal: Optional[List[ChatCompletionLogprobToken]]\n\n\nclass ChatCompletionResponseChoice(TypedDict):\n    index: int\n    message: \"ChatCompletionResponseMessage\"\n    logprobs: Optional[ChatCompletionLogprobs]\n    finish_reason: Optional[str]\n\n\nclass CreateChatCompletionResponse(TypedDict):\n    id: str\n    object: Literal[\"chat.completion\"]\n    created: int\n    model: str\n    choices: List[\"ChatCompletionResponseChoice\"]\n    usage: CompletionUsage\n\n\nclass ChatCompletionMessageToolCallChunkFunction(TypedDict):\n    name: Optional[str]\n    arguments: str\n\n\nclass ChatCompletionMessageToolCallChunk(TypedDict):\n    index: int\n    id: NotRequired[str]\n    type: Literal[\"function\"]\n    function: ChatCompletionMessageToolCallChunkFunction\n\n\nclass ChatCompletionStreamResponseDeltaEmpty(TypedDict):\n    pass\n\n\nclass ChatCompletionStreamResponseDeltaFunctionCall(TypedDict):\n    name: str\n    arguments: str\n\n\nclass ChatCompletionStreamResponseDelta(TypedDict):\n    content: NotRequired[Optional[str]]\n    function_call: NotRequired[\n        Optional[ChatCompletionStreamResponseDeltaFunctionCall]\n    ]  # DEPRECATED\n    tool_calls: NotRequired[Optional[List[ChatCompletionMessageToolCallChunk]]]\n    role: NotRequired[Optional[Literal[\"system\", \"user\", \"assistant\", \"tool\"]]]\n\n\nclass ChatCompletionStreamResponseChoice(TypedDict):\n    index: int\n    delta: Union[\n        ChatCompletionStreamResponseDelta, ChatCompletionStreamResponseDeltaEmpty\n    ]\n    finish_reason: Optional[Literal[\"stop\", \"length\", \"tool_calls\", \"function_call\"]]\n    logprobs: NotRequired[Optional[ChatCompletionLogprobs]]\n\n\nclass CreateChatCompletionStreamResponse(TypedDict):\n    id: str\n    model: str\n    object: Literal[\"chat.completion.chunk\"]\n    created: int\n    choices: List[ChatCompletionStreamResponseChoice]\n\n\nclass ChatCompletionFunctions(TypedDict):\n    name: str\n    description: NotRequired[str]\n    parameters: Dict[str, JsonType]  # TODO: make this more specific\n\n\nclass ChatCompletionFunctionCallOption(TypedDict):\n    name: str\n\n\nclass ChatCompletionRequestResponseFormat(TypedDict):\n    type: Literal[\"text\", \"json_object\"]\n    schema: NotRequired[\n        JsonType\n    ]  # https://docs.endpoints.anyscale.com/guides/json_mode/\n\n\nclass ChatCompletionRequestMessageContentPartText(TypedDict):\n    type: Literal[\"text\"]\n    text: str\n\n\nclass ChatCompletionRequestMessageContentPartImageImageUrl(TypedDict):\n    url: str\n    detail: NotRequired[Literal[\"auto\", \"low\", \"high\"]]\n\n\nclass ChatCompletionRequestMessageContentPartImage(TypedDict):\n    type: Literal[\"image_url\"]\n    image_url: Union[str, ChatCompletionRequestMessageContentPartImageImageUrl]\n\n\nChatCompletionRequestMessageContentPart = Union[\n    ChatCompletionRequestMessageContentPartText,\n    ChatCompletionRequestMessageContentPartImage,\n]\n\n\nclass ChatCompletionRequestSystemMessage(TypedDict):\n    role: Literal[\"system\"]\n    content: Optional[str]\n\n\nclass ChatCompletionRequestUserMessage(TypedDict):\n    role: Literal[\"user\"]\n    content: Optional[Union[str, List[ChatCompletionRequestMessageContentPart]]]\n\n\nclass ChatCompletionMessageToolCallFunction(TypedDict):\n    name: str\n    arguments: str\n\n\nclass ChatCompletionMessageToolCall(TypedDict):\n    id: str\n    type: Literal[\"function\"]\n    function: ChatCompletionMessageToolCallFunction\n\n\nChatCompletionMessageToolCalls = List[ChatCompletionMessageToolCall]\n\n\nclass ChatCompletionRequestAssistantMessageFunctionCall(TypedDict):\n    name: str\n    arguments: str\n\n\nclass ChatCompletionRequestAssistantMessage(TypedDict):\n    role: Literal[\"assistant\"]\n    content: NotRequired[str]\n    tool_calls: NotRequired[ChatCompletionMessageToolCalls]\n    function_call: NotRequired[\n        ChatCompletionRequestAssistantMessageFunctionCall\n    ]  # DEPRECATED\n\n\nclass ChatCompletionRequestToolMessage(TypedDict):\n    role: Literal[\"tool\"]\n    content: Optional[str]\n    tool_call_id: str\n\n\nclass ChatCompletionRequestFunctionMessage(TypedDict):\n    role: Literal[\"function\"]\n    content: Optional[str]\n    name: str\n\n\nChatCompletionRequestMessage = Union[\n    ChatCompletionRequestSystemMessage,\n    ChatCompletionRequestUserMessage,\n    ChatCompletionRequestAssistantMessage,\n    ChatCompletionRequestUserMessage,\n    ChatCompletionRequestToolMessage,\n    ChatCompletionRequestFunctionMessage,\n]\n\n\nclass ChatCompletionRequestFunctionCallOption(TypedDict):\n    name: str\n\n\nChatCompletionRequestFunctionCall = Union[\n    Literal[\"none\", \"auto\"], ChatCompletionRequestFunctionCallOption\n]\n\nChatCompletionFunctionParameters = Dict[str, JsonType]  # TODO: make this more specific\n\n\nclass ChatCompletionToolFunction(TypedDict):\n    name: str\n    description: NotRequired[str]\n    parameters: ChatCompletionFunctionParameters\n\n\nclass ChatCompletionTool(TypedDict):\n    type: Literal[\"function\"]\n    function: ChatCompletionToolFunction\n\n\nclass ChatCompletionNamedToolChoiceFunction(TypedDict):\n    name: str\n\n\nclass ChatCompletionNamedToolChoice(TypedDict):\n    type: Literal[\"function\"]\n    function: ChatCompletionNamedToolChoiceFunction\n\n\nChatCompletionToolChoiceOption = Union[\n    Literal[\"none\", \"auto\", \"required\"], ChatCompletionNamedToolChoice\n]\n\n\n# NOTE: The following type names are not part of the OpenAI OpenAPI specification\n# and will be removed in a future major release.\n\nEmbeddingData = Embedding\nCompletionChunk = CreateCompletionResponse\nCompletion = CreateCompletionResponse\nCreateCompletionStreamResponse = CreateCompletionResponse\nChatCompletionMessage = ChatCompletionResponseMessage\nChatCompletionChoice = ChatCompletionResponseChoice\nChatCompletion = CreateChatCompletionResponse\nChatCompletionChunkDeltaEmpty = ChatCompletionStreamResponseDeltaEmpty\nChatCompletionChunkChoice = ChatCompletionStreamResponseChoice\nChatCompletionChunkDelta = ChatCompletionStreamResponseDelta\nChatCompletionChunk = CreateChatCompletionStreamResponse\nChatCompletionStreamResponse = CreateChatCompletionStreamResponse\nChatCompletionResponseFunction = ChatCompletionFunction\nChatCompletionFunctionCall = ChatCompletionResponseFunctionCall\n"
  },
  {
    "path": "llama_cpp/llava_cpp.py",
    "content": "from __future__ import annotations\n\nimport os\nfrom ctypes import (\n    c_bool,\n    c_char_p,\n    c_int,\n    c_uint8,\n    c_float,\n    c_void_p,\n    POINTER,\n    _Pointer,  # type: ignore\n    Structure,\n)\nimport pathlib\nfrom typing import (\n    Union,\n    NewType,\n    Optional,\n    TYPE_CHECKING,\n)\n\nimport llama_cpp.llama_cpp as llama_cpp\n\nfrom llama_cpp._ctypes_extensions import (\n    load_shared_library,\n    ctypes_function_for_shared_library,\n)\n\nif TYPE_CHECKING:\n    from llama_cpp._ctypes_extensions import (\n        CtypesArray,\n    )\n\n\n# Specify the base name of the shared library to load\n_libllava_base_name = \"llava\"\n_libllava_override_path = os.environ.get(\"LLAVA_CPP_LIB\")\n_libllava_base_path = pathlib.Path(os.path.abspath(os.path.dirname(__file__))) / \"lib\" if _libllava_override_path is None else pathlib.Path()\n\n# Load the library\n_libllava = load_shared_library(_libllava_base_name, _libllava_base_path)\n\nctypes_function = ctypes_function_for_shared_library(_libllava)\n\n\n################################################\n# llava.h\n################################################\n\n# struct clip_ctx;\nclip_ctx_p = NewType(\"clip_ctx_p\", int)\nclip_ctx_p_ctypes = c_void_p\n\n\n# struct llava_image_embed {\n#     float * embed;\n#     int n_image_pos;\n# };\nclass llava_image_embed(Structure):\n    _fields_ = [\n        (\"embed\", POINTER(c_float)),\n        (\"n_image_pos\", c_int),\n    ]\n\n\n# /** sanity check for clip <-> llava embed size match */\n# LLAVA_API bool llava_validate_embed_size(const llama_context * ctx_llama, const clip_ctx * ctx_clip);\n@ctypes_function(\n    \"llava_validate_embed_size\",\n    [llama_cpp.llama_context_p_ctypes, clip_ctx_p_ctypes],\n    c_bool,\n)\ndef llava_validate_embed_size(\n    ctx_llama: llama_cpp.llama_context_p, ctx_clip: clip_ctx_p, /\n) -> bool:\n    ...\n\n\n# /** build an image embed from image file bytes */\n# LLAVA_API struct llava_image_embed * llava_image_embed_make_with_bytes(struct clip_ctx * ctx_clip, int n_threads, const unsigned char * image_bytes, int image_bytes_length);\n@ctypes_function(\n    \"llava_image_embed_make_with_bytes\",\n    [clip_ctx_p_ctypes, c_int, POINTER(c_uint8), c_int],\n    POINTER(llava_image_embed),\n)\ndef llava_image_embed_make_with_bytes(\n    ctx_clip: clip_ctx_p,\n    n_threads: Union[c_int, int],\n    image_bytes: CtypesArray[c_uint8],\n    image_bytes_length: Union[c_int, int],\n    /,\n) -> \"_Pointer[llava_image_embed]\":\n    ...\n\n\n# /** build an image embed from a path to an image filename */\n# LLAVA_API struct llava_image_embed * llava_image_embed_make_with_filename(struct clip_ctx * ctx_clip, int n_threads, const char * image_path);\n@ctypes_function(\n    \"llava_image_embed_make_with_filename\",\n    [clip_ctx_p_ctypes, c_int, c_char_p],\n    POINTER(llava_image_embed),\n)\ndef llava_image_embed_make_with_filename(\n    ctx_clip: clip_ctx_p, n_threads: Union[c_int, int], image_path: bytes, /\n) -> \"_Pointer[llava_image_embed]\":\n    ...\n\n\n# LLAVA_API void llava_image_embed_free(struct llava_image_embed * embed);\n# /** free an embedding made with llava_image_embed_make_* */\n@ctypes_function(\"llava_image_embed_free\", [POINTER(llava_image_embed)], None)\ndef llava_image_embed_free(embed: \"_Pointer[llava_image_embed]\", /):\n    ...\n\n\n# /** write the image represented by embed into the llama context with batch size n_batch, starting at context pos n_past. on completion, n_past points to the next position in the context after the image embed. */\n# LLAVA_API bool llava_eval_image_embed(struct llama_context * ctx_llama, const struct llava_image_embed * embed, int n_batch, int * n_past);\n@ctypes_function(\n    \"llava_eval_image_embed\",\n    [\n        llama_cpp.llama_context_p_ctypes,\n        POINTER(llava_image_embed),\n        c_int,\n        POINTER(c_int),\n    ],\n    c_bool,\n)\ndef llava_eval_image_embed(\n    ctx_llama: llama_cpp.llama_context_p,\n    embed: \"_Pointer[llava_image_embed]\",\n    n_batch: Union[c_int, int],\n    n_past: \"_Pointer[c_int]\",\n    /,\n) -> bool:\n    ...\n\n\n################################################\n# clip.h\n################################################\n\n\n# /** load mmproj model */\n# CLIP_API struct clip_ctx * clip_model_load    (const char * fname, int verbosity);\n@ctypes_function(\"clip_model_load\", [c_char_p, c_int], clip_ctx_p_ctypes)\ndef clip_model_load(\n    fname: bytes, verbosity: Union[c_int, int], /\n) -> Optional[clip_ctx_p]:\n    ...\n\n\n# /** free mmproj model */\n# CLIP_API void clip_free(struct clip_ctx * ctx);\n@ctypes_function(\"clip_free\", [clip_ctx_p_ctypes], None)\ndef clip_free(ctx: clip_ctx_p, /):\n    ...\n\n"
  },
  {
    "path": "llama_cpp/mtmd_cpp.py",
    "content": "from __future__ import annotations\n\nimport os\nfrom ctypes import (\n    c_bool,\n    c_char_p,\n    c_int,\n    c_uint8,\n    c_uint32,\n    c_float,\n    c_void_p,\n    c_size_t,\n    POINTER,\n    _Pointer,  # type: ignore\n    Structure,\n    byref,\n)\nimport pathlib\nfrom typing import (\n    Union,\n    NewType,\n    Optional,\n    TYPE_CHECKING,\n)\n\nimport llama_cpp.llama_cpp as llama_cpp\n\nfrom llama_cpp._ctypes_extensions import (\n    load_shared_library,\n    ctypes_function_for_shared_library,\n)\n\nif TYPE_CHECKING:\n    from llama_cpp._ctypes_extensions import (\n        CtypesArray,\n    )\n\n\n# Specify the base name of the shared library to load\n_libmtmd_base_name = \"mtmd\"\n_libmtmd_override_path = os.environ.get(\"MTMD_CPP_LIB\")\n_libmtmd_base_path = pathlib.Path(os.path.abspath(os.path.dirname(__file__))) / \"lib\" if _libmtmd_override_path is None else pathlib.Path()\n\n# Load the library\n_libmtmd = load_shared_library(_libmtmd_base_name, _libmtmd_base_path)\n\nctypes_function = ctypes_function_for_shared_library(_libmtmd)\n\n################################################\n# mtmd.h types\n################################################\n\n# Opaque types\nmtmd_context_p = NewType(\"mtmd_context_p\", int)\nmtmd_context_p_ctypes = c_void_p\n\nmtmd_bitmap_p = NewType(\"mtmd_bitmap_p\", int)\nmtmd_bitmap_p_ctypes = c_void_p\n\nmtmd_image_tokens_p = NewType(\"mtmd_image_tokens_p\", int)\nmtmd_image_tokens_p_ctypes = c_void_p\n\nmtmd_input_chunk_p = NewType(\"mtmd_input_chunk_p\", int)\nmtmd_input_chunk_p_ctypes = c_void_p\n\nmtmd_input_chunks_p = NewType(\"mtmd_input_chunks_p\", int)\nmtmd_input_chunks_p_ctypes = c_void_p\n\n# Enums\nMTMD_INPUT_CHUNK_TYPE_TEXT = 0\nMTMD_INPUT_CHUNK_TYPE_IMAGE = 1\nMTMD_INPUT_CHUNK_TYPE_AUDIO = 2\n\n# Structures\nclass mtmd_context_params(Structure):\n    _fields_ = [\n        (\"use_gpu\", c_bool),\n        (\"print_timings\", c_bool),\n        (\"n_threads\", c_int),\n        (\"verbosity\", c_int),  # ggml_log_level\n        (\"image_marker\", c_char_p),\n        (\"media_marker\", c_char_p),\n    ]\n\nclass mtmd_input_text(Structure):\n    _fields_ = [\n        (\"text\", c_char_p),\n        (\"add_special\", c_bool),\n        (\"parse_special\", c_bool),\n    ]\n\n################################################\n# mtmd.h functions\n################################################\n\n# MTMD_API const char * mtmd_default_marker(void);\n@ctypes_function(\"mtmd_default_marker\", [], c_char_p)\ndef mtmd_default_marker() -> bytes:\n    ...\n\n# MTMD_API struct mtmd_context_params mtmd_context_params_default(void);\n@ctypes_function(\"mtmd_context_params_default\", [], mtmd_context_params)\ndef mtmd_context_params_default() -> mtmd_context_params:\n    ...\n\n# MTMD_API mtmd_context * mtmd_init_from_file(const char * mmproj_fname,\n#                                             const struct llama_model * text_model,\n#                                             const struct mtmd_context_params ctx_params);\n@ctypes_function(\n    \"mtmd_init_from_file\",\n    [c_char_p, llama_cpp.llama_model_p_ctypes, mtmd_context_params],\n    mtmd_context_p_ctypes\n)\ndef mtmd_init_from_file(\n    mmproj_fname: bytes,\n    text_model: llama_cpp.llama_model_p,\n    ctx_params: mtmd_context_params,\n    /,\n) -> Optional[mtmd_context_p]:\n    ...\n\n# MTMD_API void mtmd_free(mtmd_context * ctx);\n@ctypes_function(\"mtmd_free\", [mtmd_context_p_ctypes], None)\ndef mtmd_free(ctx: mtmd_context_p, /):\n    ...\n\n# MTMD_API bool mtmd_support_vision(mtmd_context * ctx);\n@ctypes_function(\"mtmd_support_vision\", [mtmd_context_p_ctypes], c_bool)\ndef mtmd_support_vision(ctx: mtmd_context_p, /) -> bool:\n    ...\n\n# MTMD_API mtmd_bitmap * mtmd_bitmap_init(uint32_t nx, uint32_t ny, const unsigned char * data);\n@ctypes_function(\n    \"mtmd_bitmap_init\",\n    [c_uint32, c_uint32, POINTER(c_uint8)],\n    mtmd_bitmap_p_ctypes\n)\ndef mtmd_bitmap_init(\n    nx: Union[c_uint32, int],\n    ny: Union[c_uint32, int],\n    data: CtypesArray[c_uint8],\n    /,\n) -> Optional[mtmd_bitmap_p]:\n    ...\n\n# MTMD_API void mtmd_bitmap_free(mtmd_bitmap * bitmap);\n@ctypes_function(\"mtmd_bitmap_free\", [mtmd_bitmap_p_ctypes], None)\ndef mtmd_bitmap_free(bitmap: mtmd_bitmap_p, /):\n    ...\n\n# MTMD_API mtmd_input_chunks * mtmd_input_chunks_init(void);\n@ctypes_function(\"mtmd_input_chunks_init\", [], mtmd_input_chunks_p_ctypes)\ndef mtmd_input_chunks_init() -> Optional[mtmd_input_chunks_p]:\n    ...\n\n# MTMD_API void mtmd_input_chunks_free(mtmd_input_chunks * chunks);\n@ctypes_function(\"mtmd_input_chunks_free\", [mtmd_input_chunks_p_ctypes], None)\ndef mtmd_input_chunks_free(chunks: mtmd_input_chunks_p, /):\n    ...\n\n# MTMD_API size_t mtmd_input_chunks_size(const mtmd_input_chunks * chunks);\n@ctypes_function(\"mtmd_input_chunks_size\", [mtmd_input_chunks_p_ctypes], c_size_t)\ndef mtmd_input_chunks_size(chunks: mtmd_input_chunks_p, /) -> int:\n    ...\n\n# MTMD_API const mtmd_input_chunk * mtmd_input_chunks_get(const mtmd_input_chunks * chunks, size_t idx);\n@ctypes_function(\n    \"mtmd_input_chunks_get\",\n    [mtmd_input_chunks_p_ctypes, c_size_t],\n    mtmd_input_chunk_p_ctypes\n)\ndef mtmd_input_chunks_get(\n    chunks: mtmd_input_chunks_p, idx: Union[c_size_t, int], /\n) -> Optional[mtmd_input_chunk_p]:\n    ...\n\n# MTMD_API int32_t mtmd_tokenize(mtmd_context * ctx,\n#                                mtmd_input_chunks * output,\n#                                const mtmd_input_text * text,\n#                                const mtmd_bitmap ** bitmaps,\n#                                size_t n_bitmaps);\n@ctypes_function(\n    \"mtmd_tokenize\",\n    [\n        mtmd_context_p_ctypes,\n        mtmd_input_chunks_p_ctypes,\n        POINTER(mtmd_input_text),\n        POINTER(mtmd_bitmap_p_ctypes),\n        c_size_t,\n    ],\n    c_int,\n)\ndef mtmd_tokenize(\n    ctx: mtmd_context_p,\n    output: mtmd_input_chunks_p,\n    text: \"_Pointer[mtmd_input_text]\",\n    bitmaps: CtypesArray[mtmd_bitmap_p_ctypes],\n    n_bitmaps: Union[c_size_t, int],\n    /,\n) -> int:\n    ...\n\n# MTMD_API size_t mtmd_input_chunk_get_n_tokens(const mtmd_input_chunk * chunk);\n@ctypes_function(\"mtmd_input_chunk_get_n_tokens\", [mtmd_input_chunk_p_ctypes], c_size_t)\ndef mtmd_input_chunk_get_n_tokens(chunk: mtmd_input_chunk_p, /) -> int:\n    ...\n\n# MTMD_API enum mtmd_input_chunk_type mtmd_input_chunk_get_type(const mtmd_input_chunk * chunk);\n@ctypes_function(\"mtmd_input_chunk_get_type\", [mtmd_input_chunk_p_ctypes], c_int)\ndef mtmd_input_chunk_get_type(chunk: mtmd_input_chunk_p, /) -> int:\n    ...\n\n# MTMD_API const llama_token * mtmd_input_chunk_get_tokens_text(const mtmd_input_chunk * chunk, size_t * n_tokens_output);\n@ctypes_function(\n    \"mtmd_input_chunk_get_tokens_text\",\n    [mtmd_input_chunk_p_ctypes, POINTER(c_size_t)],\n    POINTER(llama_cpp.llama_token)\n)\ndef mtmd_input_chunk_get_tokens_text(\n    chunk: mtmd_input_chunk_p, n_tokens_output: \"_Pointer[c_size_t]\", /\n) -> Optional[\"_Pointer[llama_cpp.llama_token]\"]:\n    ...\n\n################################################\n# mtmd-helper.h functions\n################################################\n\n# MTMD_API mtmd_bitmap * mtmd_helper_bitmap_init_from_buf(mtmd_context * ctx, const unsigned char * buf, size_t len);\n@ctypes_function(\n    \"mtmd_helper_bitmap_init_from_buf\",\n    [mtmd_context_p_ctypes, POINTER(c_uint8), c_size_t],\n    mtmd_bitmap_p_ctypes\n)\ndef mtmd_helper_bitmap_init_from_buf(\n    ctx: mtmd_context_p,\n    buf: CtypesArray[c_uint8],\n    length: Union[c_size_t, int],\n    /,\n) -> Optional[mtmd_bitmap_p]:\n    ...\n\n# MTMD_API size_t mtmd_helper_get_n_tokens(const mtmd_input_chunks * chunks);\n@ctypes_function(\"mtmd_helper_get_n_tokens\", [mtmd_input_chunks_p_ctypes], c_size_t)\ndef mtmd_helper_get_n_tokens(chunks: mtmd_input_chunks_p, /) -> int:\n    ...\n\n# MTMD_API int32_t mtmd_helper_eval_chunk_single(mtmd_context * ctx,\n#                                                struct llama_context * lctx,\n#                                                const mtmd_input_chunk * chunk,\n#                                                llama_pos n_past,\n#                                                llama_seq_id seq_id,\n#                                                int32_t n_batch,\n#                                                bool logits_last,\n#                                                llama_pos * new_n_past);\n@ctypes_function(\n    \"mtmd_helper_eval_chunk_single\",\n    [\n        mtmd_context_p_ctypes,\n        llama_cpp.llama_context_p_ctypes,\n        mtmd_input_chunk_p_ctypes,\n        llama_cpp.llama_pos,\n        llama_cpp.llama_seq_id,\n        c_int,\n        c_bool,\n        POINTER(llama_cpp.llama_pos),\n    ],\n    c_int,\n)\ndef mtmd_helper_eval_chunk_single(\n    ctx: mtmd_context_p,\n    lctx: llama_cpp.llama_context_p,\n    chunk: mtmd_input_chunk_p,\n    n_past: llama_cpp.llama_pos,\n    seq_id: llama_cpp.llama_seq_id,\n    n_batch: Union[c_int, int],\n    logits_last: Union[c_bool, bool],\n    new_n_past: \"_Pointer[llama_cpp.llama_pos]\",\n    /,\n) -> int:\n    ...\n"
  },
  {
    "path": "llama_cpp/py.typed",
    "content": ""
  },
  {
    "path": "llama_cpp/server/__init__.py",
    "content": ""
  },
  {
    "path": "llama_cpp/server/__main__.py",
    "content": "\"\"\"Example FastAPI server for llama.cpp.\n\nTo run this example:\n\n```bash\npip install fastapi uvicorn sse-starlette pydantic-settings\nexport MODEL=../models/7B/...\n```\n\nThen run:\n```\nuvicorn llama_cpp.server.app:create_app --reload\n```\n\nor\n\n```\npython3 -m llama_cpp.server\n```\n\nThen visit http://localhost:8000/docs to see the interactive API docs.\n\n\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nimport sys\nimport argparse\n\nimport uvicorn\n\nfrom llama_cpp.server.app import create_app\nfrom llama_cpp.server.settings import (\n    Settings,\n    ServerSettings,\n    ModelSettings,\n    ConfigFileSettings,\n)\nfrom llama_cpp.server.cli import add_args_from_model, parse_model_from_args\n\n\ndef main():\n    description = \"🦙 Llama.cpp python server. Host your own LLMs!🚀\"\n    parser = argparse.ArgumentParser(description=description)\n\n    add_args_from_model(parser, Settings)\n    parser.add_argument(\n        \"--config_file\",\n        type=str,\n        help=\"Path to a config file to load.\",\n    )\n    server_settings: ServerSettings | None = None\n    model_settings: list[ModelSettings] = []\n    args = parser.parse_args()\n    try:\n        # Load server settings from config_file if provided\n        config_file = os.environ.get(\"CONFIG_FILE\", args.config_file)\n        if config_file:\n            if not os.path.exists(config_file):\n                raise ValueError(f\"Config file {config_file} not found!\")\n            with open(config_file, \"rb\") as f:\n                # Check if yaml file\n                if config_file.endswith(\".yaml\") or config_file.endswith(\".yml\"):\n                    import yaml\n                    import json\n\n                    config_file_settings = ConfigFileSettings.model_validate_json(\n                        json.dumps(yaml.safe_load(f))\n                    )\n                else:\n                    config_file_settings = ConfigFileSettings.model_validate_json(\n                        f.read()\n                    )\n                server_settings = ServerSettings.model_validate(config_file_settings)\n                model_settings = config_file_settings.models\n        else:\n            server_settings = parse_model_from_args(ServerSettings, args)\n            model_settings = [parse_model_from_args(ModelSettings, args)]\n    except Exception as e:\n        print(e, file=sys.stderr)\n        parser.print_help()\n        sys.exit(1)\n    assert server_settings is not None\n    assert model_settings is not None\n    app = create_app(\n        server_settings=server_settings,\n        model_settings=model_settings,\n    )\n    uvicorn.run(\n        app,\n        host=os.getenv(\"HOST\", server_settings.host),\n        port=int(os.getenv(\"PORT\", server_settings.port)),\n        ssl_keyfile=server_settings.ssl_keyfile,\n        ssl_certfile=server_settings.ssl_certfile,\n    )\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "llama_cpp/server/app.py",
    "content": "from __future__ import annotations\n\nimport os\nimport json\nimport typing\nimport contextlib\n\nfrom anyio import Lock\nfrom functools import partial\nfrom typing import List, Optional, Union, Dict\n\nimport llama_cpp\n\nimport anyio\nfrom anyio.streams.memory import MemoryObjectSendStream\nfrom starlette.concurrency import run_in_threadpool, iterate_in_threadpool\nfrom fastapi import Depends, FastAPI, APIRouter, Request, HTTPException, status, Body\nfrom fastapi.middleware import Middleware\nfrom fastapi.middleware.cors import CORSMiddleware\nfrom fastapi.security import HTTPBearer\nfrom sse_starlette.sse import EventSourceResponse\nfrom starlette_context.plugins import RequestIdPlugin  # type: ignore\nfrom starlette_context.middleware import RawContextMiddleware\n\nfrom llama_cpp.server.model import (\n    LlamaProxy,\n)\nfrom llama_cpp.server.settings import (\n    ConfigFileSettings,\n    Settings,\n    ModelSettings,\n    ServerSettings,\n)\nfrom llama_cpp.server.types import (\n    CreateCompletionRequest,\n    CreateEmbeddingRequest,\n    CreateChatCompletionRequest,\n    ModelList,\n    TokenizeInputRequest,\n    TokenizeInputResponse,\n    TokenizeInputCountResponse,\n    DetokenizeInputRequest,\n    DetokenizeInputResponse,\n)\nfrom llama_cpp.server.errors import RouteErrorHandler\n\n\nrouter = APIRouter(route_class=RouteErrorHandler)\n\n_server_settings: Optional[ServerSettings] = None\n\n\ndef set_server_settings(server_settings: ServerSettings):\n    global _server_settings\n    _server_settings = server_settings\n\n\ndef get_server_settings():\n    yield _server_settings\n\n\n_llama_proxy: Optional[LlamaProxy] = None\n\nllama_outer_lock = Lock()\nllama_inner_lock = Lock()\n\n\ndef set_llama_proxy(model_settings: List[ModelSettings]):\n    global _llama_proxy\n    _llama_proxy = LlamaProxy(models=model_settings)\n\n\nasync def get_llama_proxy():\n    # NOTE: This double lock allows the currently streaming llama model to\n    # check if any other requests are pending in the same thread and cancel\n    # the stream if so.\n    await llama_outer_lock.acquire()\n    release_outer_lock = True\n    try:\n        await llama_inner_lock.acquire()\n        try:\n            llama_outer_lock.release()\n            release_outer_lock = False\n            yield _llama_proxy\n        finally:\n            llama_inner_lock.release()\n    finally:\n        if release_outer_lock:\n            llama_outer_lock.release()\n\n\n_ping_message_factory: typing.Optional[typing.Callable[[], bytes]] = None\n\n\ndef set_ping_message_factory(factory: typing.Callable[[], bytes]):\n    global _ping_message_factory\n    _ping_message_factory = factory\n\n\ndef create_app(\n    settings: Settings | None = None,\n    server_settings: ServerSettings | None = None,\n    model_settings: List[ModelSettings] | None = None,\n):\n    config_file = os.environ.get(\"CONFIG_FILE\", None)\n    if config_file is not None:\n        if not os.path.exists(config_file):\n            raise ValueError(f\"Config file {config_file} not found!\")\n        with open(config_file, \"rb\") as f:\n            # Check if yaml file\n            if config_file.endswith(\".yaml\") or config_file.endswith(\".yml\"):\n                import yaml\n\n                config_file_settings = ConfigFileSettings.model_validate_json(\n                    json.dumps(yaml.safe_load(f))\n                )\n            else:\n                config_file_settings = ConfigFileSettings.model_validate_json(f.read())\n            server_settings = ServerSettings.model_validate(config_file_settings)\n            model_settings = config_file_settings.models\n\n    if server_settings is None and model_settings is None:\n        if settings is None:\n            settings = Settings()\n        server_settings = ServerSettings.model_validate(settings)\n        model_settings = [ModelSettings.model_validate(settings)]\n\n    assert (\n        server_settings is not None and model_settings is not None\n    ), \"server_settings and model_settings must be provided together\"\n\n    set_server_settings(server_settings)\n    middleware = [Middleware(RawContextMiddleware, plugins=(RequestIdPlugin(),))]\n    app = FastAPI(\n        middleware=middleware,\n        title=\"🦙 llama.cpp Python API\",\n        version=llama_cpp.__version__,\n        root_path=server_settings.root_path,\n    )\n    app.add_middleware(\n        CORSMiddleware,\n        allow_origins=[\"*\"],\n        allow_credentials=True,\n        allow_methods=[\"*\"],\n        allow_headers=[\"*\"],\n    )\n    app.include_router(router)\n\n    assert model_settings is not None\n    set_llama_proxy(model_settings=model_settings)\n\n    if server_settings.disable_ping_events:\n        set_ping_message_factory(lambda: bytes())\n\n    return app\n\n\ndef prepare_request_resources(\n    body: CreateCompletionRequest | CreateChatCompletionRequest,\n    llama_proxy: LlamaProxy,\n    body_model: str | None,\n    kwargs,\n) -> llama_cpp.Llama:\n    if llama_proxy is None:\n        raise HTTPException(\n            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,\n            detail=\"Service is not available\",\n        )\n    llama = llama_proxy(body_model)\n    if body.logit_bias is not None:\n        kwargs[\"logit_bias\"] = (\n            _logit_bias_tokens_to_input_ids(llama, body.logit_bias)\n            if body.logit_bias_type == \"tokens\"\n            else body.logit_bias\n        )\n\n    if body.grammar is not None:\n        kwargs[\"grammar\"] = llama_cpp.LlamaGrammar.from_string(body.grammar)\n\n    if body.min_tokens > 0:\n        _min_tokens_logits_processor = llama_cpp.LogitsProcessorList(\n            [llama_cpp.MinTokensLogitsProcessor(body.min_tokens, llama.token_eos())]\n        )\n        if \"logits_processor\" not in kwargs:\n            kwargs[\"logits_processor\"] = _min_tokens_logits_processor\n        else:\n            kwargs[\"logits_processor\"].extend(_min_tokens_logits_processor)\n    return llama\n\n\nasync def get_event_publisher(\n    request: Request,\n    inner_send_chan: MemoryObjectSendStream[typing.Any],\n    body: CreateCompletionRequest | CreateChatCompletionRequest,\n    body_model: str | None,\n    llama_call,\n    kwargs,\n):\n    server_settings = next(get_server_settings())\n    interrupt_requests = (\n        server_settings.interrupt_requests if server_settings else False\n    )\n    async with contextlib.asynccontextmanager(get_llama_proxy)() as llama_proxy:\n        llama = prepare_request_resources(body, llama_proxy, body_model, kwargs)\n        async with inner_send_chan:\n            try:\n                iterator = await run_in_threadpool(llama_call, llama, **kwargs)\n                async for chunk in iterate_in_threadpool(iterator):\n                    await inner_send_chan.send(dict(data=json.dumps(chunk)))\n                    if await request.is_disconnected():\n                        raise anyio.get_cancelled_exc_class()()\n                    if interrupt_requests and llama_outer_lock.locked():\n                        await inner_send_chan.send(dict(data=\"[DONE]\"))\n                        raise anyio.get_cancelled_exc_class()()\n                await inner_send_chan.send(dict(data=\"[DONE]\"))\n            except anyio.get_cancelled_exc_class() as e:\n                print(\"disconnected\")\n                with anyio.move_on_after(1, shield=True):\n                    print(\n                        f\"Disconnected from client (via refresh/close) {request.client}\"\n                    )\n                    raise e\n\n\ndef _logit_bias_tokens_to_input_ids(\n    llama: llama_cpp.Llama,\n    logit_bias: Dict[str, float],\n) -> Dict[str, float]:\n    to_bias: Dict[str, float] = {}\n    for token, score in logit_bias.items():\n        token = token.encode(\"utf-8\")\n        for input_id in llama.tokenize(token, add_bos=False, special=True):\n            to_bias[str(input_id)] = score\n    return to_bias\n\n\n# Setup Bearer authentication scheme\nbearer_scheme = HTTPBearer(auto_error=False)\n\n\nasync def authenticate(\n    settings: Settings = Depends(get_server_settings),\n    authorization: Optional[str] = Depends(bearer_scheme),\n):\n    # Skip API key check if it's not set in settings\n    if settings.api_key is None:\n        return True\n\n    # check bearer credentials against the api_key\n    if authorization and authorization.credentials == settings.api_key:\n        # api key is valid\n        return authorization.credentials\n\n    # raise http error 401\n    raise HTTPException(\n        status_code=status.HTTP_401_UNAUTHORIZED,\n        detail=\"Invalid API key\",\n    )\n\n\nopenai_v1_tag = \"OpenAI V1\"\n\n\n@router.post(\n    \"/v1/completions\",\n    summary=\"Completion\",\n    dependencies=[Depends(authenticate)],\n    response_model=Union[\n        llama_cpp.CreateCompletionResponse,\n        str,\n    ],\n    responses={\n        \"200\": {\n            \"description\": \"Successful Response\",\n            \"content\": {\n                \"application/json\": {\n                    \"schema\": {\n                        \"anyOf\": [\n                            {\"$ref\": \"#/components/schemas/CreateCompletionResponse\"}\n                        ],\n                        \"title\": \"Completion response, when stream=False\",\n                    }\n                },\n                \"text/event-stream\": {\n                    \"schema\": {\n                        \"type\": \"string\",\n                        \"title\": \"Server Side Streaming response, when stream=True. \"\n                        + \"See SSE format: https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format\",  # noqa: E501\n                        \"example\": \"\"\"data: {... see CreateCompletionResponse ...} \\\\n\\\\n data: ... \\\\n\\\\n ... data: [DONE]\"\"\",\n                    }\n                },\n            },\n        }\n    },\n    tags=[openai_v1_tag],\n)\n@router.post(\n    \"/v1/engines/copilot-codex/completions\",\n    include_in_schema=False,\n    dependencies=[Depends(authenticate)],\n    tags=[openai_v1_tag],\n)\nasync def create_completion(\n    request: Request,\n    body: CreateCompletionRequest,\n) -> llama_cpp.Completion:\n    if isinstance(body.prompt, list):\n        assert len(body.prompt) <= 1\n        body.prompt = body.prompt[0] if len(body.prompt) > 0 else \"\"\n\n    body_model = (\n        body.model\n        if request.url.path != \"/v1/engines/copilot-codex/completions\"\n        else \"copilot-codex\"\n    )\n\n    exclude = {\n        \"n\",\n        \"best_of\",\n        \"logit_bias_type\",\n        \"user\",\n        \"min_tokens\",\n    }\n    kwargs = body.model_dump(exclude=exclude)\n\n    # handle streaming request\n    if kwargs.get(\"stream\", False):\n        send_chan, recv_chan = anyio.create_memory_object_stream(10)\n        return EventSourceResponse(\n            recv_chan,\n            data_sender_callable=partial(  # type: ignore\n                get_event_publisher,\n                request=request,\n                inner_send_chan=send_chan,\n                body=body,\n                body_model=body_model,\n                llama_call=llama_cpp.Llama.__call__,\n                kwargs=kwargs,\n            ),\n            sep=\"\\n\",\n            ping_message_factory=_ping_message_factory,\n        )\n\n    # handle regular request\n    async with contextlib.asynccontextmanager(get_llama_proxy)() as llama_proxy:\n        llama = prepare_request_resources(body, llama_proxy, body_model, kwargs)\n\n        if await request.is_disconnected():\n            print(\n                f\"Disconnected from client (via refresh/close) before llm invoked {request.client}\"\n            )\n            raise HTTPException(\n                status_code=status.HTTP_400_BAD_REQUEST,\n                detail=\"Client closed request\",\n            )\n\n        return await run_in_threadpool(llama, **kwargs)\n\n\n@router.post(\n    \"/v1/embeddings\",\n    summary=\"Embedding\",\n    dependencies=[Depends(authenticate)],\n    tags=[openai_v1_tag],\n)\nasync def create_embedding(\n    request: CreateEmbeddingRequest,\n    llama_proxy: LlamaProxy = Depends(get_llama_proxy),\n):\n    return await run_in_threadpool(\n        llama_proxy(request.model).create_embedding,\n        **request.model_dump(exclude={\"user\"}),\n    )\n\n\n@router.post(\n    \"/v1/chat/completions\",\n    summary=\"Chat\",\n    dependencies=[Depends(authenticate)],\n    response_model=Union[llama_cpp.ChatCompletion, str],\n    responses={\n        \"200\": {\n            \"description\": \"Successful Response\",\n            \"content\": {\n                \"application/json\": {\n                    \"schema\": {\n                        \"anyOf\": [\n                            {\n                                \"$ref\": \"#/components/schemas/CreateChatCompletionResponse\"\n                            }\n                        ],\n                        \"title\": \"Completion response, when stream=False\",\n                    }\n                },\n                \"text/event-stream\": {\n                    \"schema\": {\n                        \"type\": \"string\",\n                        \"title\": \"Server Side Streaming response, when stream=True\"\n                        + \"See SSE format: https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format\",  # noqa: E501\n                        \"example\": \"\"\"data: {... see CreateChatCompletionResponse ...} \\\\n\\\\n data: ... \\\\n\\\\n ... data: [DONE]\"\"\",\n                    }\n                },\n            },\n        }\n    },\n    tags=[openai_v1_tag],\n)\nasync def create_chat_completion(\n    request: Request,\n    body: CreateChatCompletionRequest = Body(\n        openapi_examples={\n            \"normal\": {\n                \"summary\": \"Chat Completion\",\n                \"value\": {\n                    \"model\": \"gpt-3.5-turbo\",\n                    \"messages\": [\n                        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n                        {\"role\": \"user\", \"content\": \"What is the capital of France?\"},\n                    ],\n                },\n            },\n            \"json_mode\": {\n                \"summary\": \"JSON Mode\",\n                \"value\": {\n                    \"model\": \"gpt-3.5-turbo\",\n                    \"messages\": [\n                        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n                        {\"role\": \"user\", \"content\": \"Who won the world series in 2020\"},\n                    ],\n                    \"response_format\": {\"type\": \"json_object\"},\n                },\n            },\n            \"tool_calling\": {\n                \"summary\": \"Tool Calling\",\n                \"value\": {\n                    \"model\": \"gpt-3.5-turbo\",\n                    \"messages\": [\n                        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n                        {\"role\": \"user\", \"content\": \"Extract Jason is 30 years old.\"},\n                    ],\n                    \"tools\": [\n                        {\n                            \"type\": \"function\",\n                            \"function\": {\n                                \"name\": \"User\",\n                                \"description\": \"User record\",\n                                \"parameters\": {\n                                    \"type\": \"object\",\n                                    \"properties\": {\n                                        \"name\": {\"type\": \"string\"},\n                                        \"age\": {\"type\": \"number\"},\n                                    },\n                                    \"required\": [\"name\", \"age\"],\n                                },\n                            },\n                        }\n                    ],\n                    \"tool_choice\": {\n                        \"type\": \"function\",\n                        \"function\": {\n                            \"name\": \"User\",\n                        },\n                    },\n                },\n            },\n            \"logprobs\": {\n                \"summary\": \"Logprobs\",\n                \"value\": {\n                    \"model\": \"gpt-3.5-turbo\",\n                    \"messages\": [\n                        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n                        {\"role\": \"user\", \"content\": \"What is the capital of France?\"},\n                    ],\n                    \"logprobs\": True,\n                    \"top_logprobs\": 10,\n                },\n            },\n        }\n    ),\n) -> llama_cpp.ChatCompletion:\n    # This is a workaround for an issue in FastAPI dependencies\n    # where the dependency is cleaned up before a StreamingResponse\n    # is complete.\n    # https://github.com/tiangolo/fastapi/issues/11143\n\n    body_model = body.model\n    exclude = {\n        \"n\",\n        \"logit_bias_type\",\n        \"user\",\n        \"min_tokens\",\n    }\n    kwargs = body.model_dump(exclude=exclude)\n\n    # handle streaming request\n    if kwargs.get(\"stream\", False):\n        send_chan, recv_chan = anyio.create_memory_object_stream(10)\n        return EventSourceResponse(\n            recv_chan,\n            data_sender_callable=partial(  # type: ignore\n                get_event_publisher,\n                request=request,\n                inner_send_chan=send_chan,\n                body=body,\n                body_model=body_model,\n                llama_call=llama_cpp.Llama.create_chat_completion,\n                kwargs=kwargs,\n            ),\n            sep=\"\\n\",\n            ping_message_factory=_ping_message_factory,\n        )\n\n    # handle regular request\n    async with contextlib.asynccontextmanager(get_llama_proxy)() as llama_proxy:\n        llama = prepare_request_resources(body, llama_proxy, body_model, kwargs)\n\n        if await request.is_disconnected():\n            print(\n                f\"Disconnected from client (via refresh/close) before llm invoked {request.client}\"\n            )\n            raise HTTPException(\n                status_code=status.HTTP_400_BAD_REQUEST,\n                detail=\"Client closed request\",\n            )\n\n        return await run_in_threadpool(llama.create_chat_completion, **kwargs)\n\n\n@router.get(\n    \"/v1/models\",\n    summary=\"Models\",\n    dependencies=[Depends(authenticate)],\n    tags=[openai_v1_tag],\n)\nasync def get_models(\n    llama_proxy: LlamaProxy = Depends(get_llama_proxy),\n) -> ModelList:\n    return {\n        \"object\": \"list\",\n        \"data\": [\n            {\n                \"id\": model_alias,\n                \"object\": \"model\",\n                \"owned_by\": \"me\",\n                \"permissions\": [],\n            }\n            for model_alias in llama_proxy\n        ],\n    }\n\n\nextras_tag = \"Extras\"\n\n\n@router.post(\n    \"/extras/tokenize\",\n    summary=\"Tokenize\",\n    dependencies=[Depends(authenticate)],\n    tags=[extras_tag],\n)\nasync def tokenize(\n    body: TokenizeInputRequest,\n    llama_proxy: LlamaProxy = Depends(get_llama_proxy),\n) -> TokenizeInputResponse:\n    tokens = llama_proxy(body.model).tokenize(body.input.encode(\"utf-8\"), special=True)\n\n    return TokenizeInputResponse(tokens=tokens)\n\n\n@router.post(\n    \"/extras/tokenize/count\",\n    summary=\"Tokenize Count\",\n    dependencies=[Depends(authenticate)],\n    tags=[extras_tag],\n)\nasync def count_query_tokens(\n    body: TokenizeInputRequest,\n    llama_proxy: LlamaProxy = Depends(get_llama_proxy),\n) -> TokenizeInputCountResponse:\n    tokens = llama_proxy(body.model).tokenize(body.input.encode(\"utf-8\"), special=True)\n\n    return TokenizeInputCountResponse(count=len(tokens))\n\n\n@router.post(\n    \"/extras/detokenize\",\n    summary=\"Detokenize\",\n    dependencies=[Depends(authenticate)],\n    tags=[extras_tag],\n)\nasync def detokenize(\n    body: DetokenizeInputRequest,\n    llama_proxy: LlamaProxy = Depends(get_llama_proxy),\n) -> DetokenizeInputResponse:\n    text = llama_proxy(body.model).detokenize(body.tokens).decode(\"utf-8\")\n\n    return DetokenizeInputResponse(text=text)\n"
  },
  {
    "path": "llama_cpp/server/cli.py",
    "content": "from __future__ import annotations\n\nimport argparse\n\nfrom typing import List, Literal, Union, Any, Type, TypeVar\n\nfrom pydantic import BaseModel\n\n\ndef _get_base_type(annotation: Type[Any]) -> Type[Any]:\n    if getattr(annotation, \"__origin__\", None) is Literal:\n        assert hasattr(annotation, \"__args__\") and len(annotation.__args__) >= 1  # type: ignore\n        return type(annotation.__args__[0])  # type: ignore\n    elif getattr(annotation, \"__origin__\", None) is Union:\n        assert hasattr(annotation, \"__args__\") and len(annotation.__args__) >= 1  # type: ignore\n        non_optional_args: List[Type[Any]] = [\n            arg for arg in annotation.__args__ if arg is not type(None)  # type: ignore\n        ]\n        if non_optional_args:\n            return _get_base_type(non_optional_args[0])\n    elif (\n        getattr(annotation, \"__origin__\", None) is list\n        or getattr(annotation, \"__origin__\", None) is List\n    ):\n        assert hasattr(annotation, \"__args__\") and len(annotation.__args__) >= 1  # type: ignore\n        return _get_base_type(annotation.__args__[0])  # type: ignore\n    return annotation\n\n\ndef _contains_list_type(annotation: Type[Any] | None) -> bool:\n    origin = getattr(annotation, \"__origin__\", None)\n\n    if origin is list or origin is List:\n        return True\n    elif origin in (Literal, Union):\n        return any(_contains_list_type(arg) for arg in annotation.__args__)  # type: ignore\n    else:\n        return False\n\n\ndef _parse_bool_arg(arg: str | bytes | bool) -> bool:\n    if isinstance(arg, bytes):\n        arg = arg.decode(\"utf-8\")\n\n    true_values = {\"1\", \"on\", \"t\", \"true\", \"y\", \"yes\"}\n    false_values = {\"0\", \"off\", \"f\", \"false\", \"n\", \"no\"}\n\n    arg_str = str(arg).lower().strip()\n\n    if arg_str in true_values:\n        return True\n    elif arg_str in false_values:\n        return False\n    else:\n        raise ValueError(f\"Invalid boolean argument: {arg}\")\n\n\ndef add_args_from_model(parser: argparse.ArgumentParser, model: Type[BaseModel]):\n    \"\"\"Add arguments from a pydantic model to an argparse parser.\"\"\"\n\n    for name, field in model.model_fields.items():\n        description = field.description\n        if field.default and description and not field.is_required():\n            description += f\" (default: {field.default})\"\n        base_type = (\n            _get_base_type(field.annotation) if field.annotation is not None else str\n        )\n        list_type = _contains_list_type(field.annotation)\n        if base_type is not bool:\n            parser.add_argument(\n                f\"--{name}\",\n                dest=name,\n                nargs=\"*\" if list_type else None,\n                type=base_type,\n                help=description,\n            )\n        if base_type is bool:\n            parser.add_argument(\n                f\"--{name}\",\n                dest=name,\n                type=_parse_bool_arg,\n                help=f\"{description}\",\n            )\n\n\nT = TypeVar(\"T\", bound=Type[BaseModel])\n\n\ndef parse_model_from_args(model: T, args: argparse.Namespace) -> T:\n    \"\"\"Parse a pydantic model from an argparse namespace.\"\"\"\n    return model(\n        **{\n            k: v\n            for k, v in vars(args).items()\n            if v is not None and k in model.model_fields\n        }\n    )\n"
  },
  {
    "path": "llama_cpp/server/errors.py",
    "content": "from __future__ import annotations\n\nimport sys\nimport traceback\nimport time\nfrom re import compile, Match, Pattern\nfrom typing import Callable, Coroutine, Optional, Tuple, Union, Dict\nfrom typing_extensions import TypedDict\n\n\nfrom fastapi import (\n    Request,\n    Response,\n    HTTPException,\n)\nfrom fastapi.responses import JSONResponse\nfrom fastapi.routing import APIRoute\n\nfrom llama_cpp.server.types import (\n    CreateCompletionRequest,\n    CreateEmbeddingRequest,\n    CreateChatCompletionRequest,\n)\n\n\nclass ErrorResponse(TypedDict):\n    \"\"\"OpenAI style error response\"\"\"\n\n    message: str\n    type: str\n    param: Optional[str]\n    code: Optional[str]\n\n\nclass ErrorResponseFormatters:\n    \"\"\"Collection of formatters for error responses.\n\n    Args:\n        request (Union[CreateCompletionRequest, CreateChatCompletionRequest]):\n            Request body\n        match (Match[str]): Match object from regex pattern\n\n    Returns:\n        Tuple[int, ErrorResponse]: Status code and error response\n    \"\"\"\n\n    @staticmethod\n    def context_length_exceeded(\n        request: Union[\"CreateCompletionRequest\", \"CreateChatCompletionRequest\"],\n        match,  # type: Match[str] # type: ignore\n    ) -> Tuple[int, ErrorResponse]:\n        \"\"\"Formatter for context length exceeded error\"\"\"\n\n        context_window = int(match.group(2))\n        prompt_tokens = int(match.group(1))\n        completion_tokens = request.max_tokens\n        if hasattr(request, \"messages\"):\n            # Chat completion\n            message = (\n                \"This model's maximum context length is {} tokens. \"\n                \"However, you requested {} tokens \"\n                \"({} in the messages, {} in the completion). \"\n                \"Please reduce the length of the messages or completion.\"\n            )\n        else:\n            # Text completion\n            message = (\n                \"This model's maximum context length is {} tokens, \"\n                \"however you requested {} tokens \"\n                \"({} in your prompt; {} for the completion). \"\n                \"Please reduce your prompt; or completion length.\"\n            )\n        return 400, ErrorResponse(\n            message=message.format(\n                context_window,\n                (completion_tokens or 0) + prompt_tokens,\n                prompt_tokens,\n                completion_tokens,\n            ),  # type: ignore\n            type=\"invalid_request_error\",\n            param=\"messages\",\n            code=\"context_length_exceeded\",\n        )\n\n    @staticmethod\n    def model_not_found(\n        request: Union[\"CreateCompletionRequest\", \"CreateChatCompletionRequest\"],\n        match,  # type: Match[str] # type: ignore\n    ) -> Tuple[int, ErrorResponse]:\n        \"\"\"Formatter for model_not_found error\"\"\"\n\n        model_path = str(match.group(1))\n        message = f\"The model `{model_path}` does not exist\"\n        return 400, ErrorResponse(\n            message=message,\n            type=\"invalid_request_error\",\n            param=None,\n            code=\"model_not_found\",\n        )\n\n\nclass RouteErrorHandler(APIRoute):\n    \"\"\"Custom APIRoute that handles application errors and exceptions\"\"\"\n\n    # key: regex pattern for original error message from llama_cpp\n    # value: formatter function\n    pattern_and_formatters: Dict[\n        \"Pattern[str]\",\n        Callable[\n            [\n                Union[\"CreateCompletionRequest\", \"CreateChatCompletionRequest\"],\n                \"Match[str]\",\n            ],\n            Tuple[int, ErrorResponse],\n        ],\n    ] = {\n        compile(\n            r\"Requested tokens \\((\\d+)\\) exceed context window of (\\d+)\"\n        ): ErrorResponseFormatters.context_length_exceeded,\n        compile(\n            r\"Model path does not exist: (.+)\"\n        ): ErrorResponseFormatters.model_not_found,\n    }\n\n    def error_message_wrapper(\n        self,\n        error: Exception,\n        body: Optional[\n            Union[\n                \"CreateChatCompletionRequest\",\n                \"CreateCompletionRequest\",\n                \"CreateEmbeddingRequest\",\n            ]\n        ] = None,\n    ) -> Tuple[int, ErrorResponse]:\n        \"\"\"Wraps error message in OpenAI style error response\"\"\"\n        if body is not None and isinstance(\n            body,\n            (\n                CreateCompletionRequest,\n                CreateChatCompletionRequest,\n            ),\n        ):\n            # When text completion or chat completion\n            for pattern, callback in self.pattern_and_formatters.items():\n                match = pattern.search(str(error))\n                if match is not None:\n                    return callback(body, match)\n\n        # Only print the trace on unexpected exceptions\n        print(f\"Exception: {str(error)}\", file=sys.stderr)\n        traceback.print_exc(file=sys.stderr)\n\n        # Wrap other errors as internal server error\n        return 500, ErrorResponse(\n            message=str(error),\n            type=\"internal_server_error\",\n            param=None,\n            code=None,\n        )\n\n    def get_route_handler(\n        self,\n    ) -> Callable[[Request], Coroutine[None, None, Response]]:\n        \"\"\"Defines custom route handler that catches exceptions and formats\n        in OpenAI style error response\"\"\"\n\n        original_route_handler = super().get_route_handler()\n\n        async def custom_route_handler(request: Request) -> Response:\n            try:\n                start_sec = time.perf_counter()\n                response = await original_route_handler(request)\n                elapsed_time_ms = int((time.perf_counter() - start_sec) * 1000)\n                response.headers[\"openai-processing-ms\"] = f\"{elapsed_time_ms}\"\n                return response\n            except HTTPException as unauthorized:\n                # api key check failed\n                raise unauthorized\n            except Exception as exc:\n                json_body = await request.json()\n                try:\n                    if \"messages\" in json_body:\n                        # Chat completion\n                        body: Optional[\n                            Union[\n                                CreateChatCompletionRequest,\n                                CreateCompletionRequest,\n                                CreateEmbeddingRequest,\n                            ]\n                        ] = CreateChatCompletionRequest(**json_body)\n                    elif \"prompt\" in json_body:\n                        # Text completion\n                        body = CreateCompletionRequest(**json_body)\n                    else:\n                        # Embedding\n                        body = CreateEmbeddingRequest(**json_body)\n                except Exception:\n                    # Invalid request body\n                    body = None\n\n                # Get proper error message from the exception\n                (\n                    status_code,\n                    error_message,\n                ) = self.error_message_wrapper(error=exc, body=body)\n                return JSONResponse(\n                    {\"error\": error_message},\n                    status_code=status_code,\n                )\n\n        return custom_route_handler\n"
  },
  {
    "path": "llama_cpp/server/model.py",
    "content": "from __future__ import annotations\n\nimport json\n\nfrom typing import Dict, Optional, Union, List\n\nimport llama_cpp\nimport llama_cpp.llama_speculative as llama_speculative\nimport llama_cpp.llama_tokenizer as llama_tokenizer\n\nfrom llama_cpp.server.settings import ModelSettings\n\n\nclass LlamaProxy:\n    def __init__(self, models: List[ModelSettings]) -> None:\n        assert len(models) > 0, \"No models provided!\"\n\n        self._model_settings_dict: dict[str, ModelSettings] = {}\n        for model in models:\n            if not model.model_alias:\n                model.model_alias = model.model\n            self._model_settings_dict[model.model_alias] = model\n\n        self._current_model: Optional[llama_cpp.Llama] = None\n        self._current_model_alias: Optional[str] = None\n\n        self._default_model_settings: ModelSettings = models[0]\n        self._default_model_alias: str = self._default_model_settings.model_alias  # type: ignore\n\n        # Load default model\n        self._current_model = self.load_llama_from_model_settings(\n            self._default_model_settings\n        )\n        self._current_model_alias = self._default_model_alias\n\n    def __call__(self, model: Optional[str] = None) -> llama_cpp.Llama:\n        if model is None:\n            model = self._default_model_alias\n\n        if model not in self._model_settings_dict:\n            model = self._default_model_alias\n\n        if model == self._current_model_alias:\n            if self._current_model is not None:\n                return self._current_model\n\n        if self._current_model:\n            self._current_model.close()\n        self._current_model = None\n\n        settings = self._model_settings_dict[model]\n        self._current_model = self.load_llama_from_model_settings(settings)\n        self._current_model_alias = model\n        return self._current_model\n\n    def __getitem__(self, model: str):\n        return self._model_settings_dict[model].model_dump()\n\n    def __setitem__(self, model: str, settings: Union[ModelSettings, str, bytes]):\n        if isinstance(settings, (bytes, str)):\n            settings = ModelSettings.model_validate_json(settings)\n        self._model_settings_dict[model] = settings\n\n    def __iter__(self):\n        for model in self._model_settings_dict:\n            yield model\n\n    def free(self):\n        if self._current_model:\n            self._current_model.close()\n            del self._current_model\n\n    @staticmethod\n    def load_llama_from_model_settings(settings: ModelSettings) -> llama_cpp.Llama:\n        chat_handler = None\n        if settings.chat_format == \"llava-1-5\":\n            assert settings.clip_model_path is not None, \"clip model not found\"\n            if settings.hf_model_repo_id is not None:\n                chat_handler = (\n                    llama_cpp.llama_chat_format.Llava15ChatHandler.from_pretrained(\n                        repo_id=settings.hf_model_repo_id,\n                        filename=settings.clip_model_path,\n                        verbose=settings.verbose,\n                    )\n                )\n            else:\n                chat_handler = llama_cpp.llama_chat_format.Llava15ChatHandler(\n                    clip_model_path=settings.clip_model_path, verbose=settings.verbose\n                )\n        elif settings.chat_format == \"obsidian\":\n            assert settings.clip_model_path is not None, \"clip model not found\"\n            if settings.hf_model_repo_id is not None:\n                chat_handler = (\n                    llama_cpp.llama_chat_format.ObsidianChatHandler.from_pretrained(\n                        repo_id=settings.hf_model_repo_id,\n                        filename=settings.clip_model_path,\n                        verbose=settings.verbose,\n                    )\n                )\n            else:\n                chat_handler = llama_cpp.llama_chat_format.ObsidianChatHandler(\n                    clip_model_path=settings.clip_model_path, verbose=settings.verbose\n                )\n        elif settings.chat_format == \"llava-1-6\":\n            assert settings.clip_model_path is not None, \"clip model not found\"\n            if settings.hf_model_repo_id is not None:\n                chat_handler = (\n                    llama_cpp.llama_chat_format.Llava16ChatHandler.from_pretrained(\n                        repo_id=settings.hf_model_repo_id,\n                        filename=settings.clip_model_path,\n                        verbose=settings.verbose,\n                    )\n                )\n            else:\n                chat_handler = llama_cpp.llama_chat_format.Llava16ChatHandler(\n                    clip_model_path=settings.clip_model_path, verbose=settings.verbose\n                )\n        elif settings.chat_format == \"moondream\":\n            assert settings.clip_model_path is not None, \"clip model not found\"\n            if settings.hf_model_repo_id is not None:\n                chat_handler = (\n                    llama_cpp.llama_chat_format.MoondreamChatHandler.from_pretrained(\n                        repo_id=settings.hf_model_repo_id,\n                        filename=settings.clip_model_path,\n                        verbose=settings.verbose,\n                    )\n                )\n            else:\n                chat_handler = llama_cpp.llama_chat_format.MoondreamChatHandler(\n                    clip_model_path=settings.clip_model_path, verbose=settings.verbose\n                )\n        elif settings.chat_format == \"nanollava\":\n            assert settings.clip_model_path is not None, \"clip model not found\"\n            if settings.hf_model_repo_id is not None:\n                chat_handler = (\n                    llama_cpp.llama_chat_format.NanoLlavaChatHandler.from_pretrained(\n                        repo_id=settings.hf_model_repo_id,\n                        filename=settings.clip_model_path,\n                        verbose=settings.verbose,\n                    )\n                )\n            else:\n                chat_handler = llama_cpp.llama_chat_format.NanoLlavaChatHandler(\n                    clip_model_path=settings.clip_model_path, verbose=settings.verbose\n                )\n        elif settings.chat_format == \"llama-3-vision-alpha\":\n            assert settings.clip_model_path is not None, \"clip model not found\"\n            if settings.hf_model_repo_id is not None:\n                chat_handler = (\n                    llama_cpp.llama_chat_format.Llama3VisionAlpha.from_pretrained(\n                        repo_id=settings.hf_model_repo_id,\n                        filename=settings.clip_model_path,\n                        verbose=settings.verbose,\n                    )\n                )\n            else:\n                chat_handler = llama_cpp.llama_chat_format.Llama3VisionAlpha(\n                    clip_model_path=settings.clip_model_path, verbose=settings.verbose\n                )\n        elif settings.chat_format == \"minicpm-v-2.6\":\n            assert settings.clip_model_path is not None, \"clip model not found\"\n            if settings.hf_model_repo_id is not None:\n                chat_handler = (\n                    llama_cpp.llama_chat_format.MiniCPMv26ChatHandler.from_pretrained(\n                        repo_id=settings.hf_model_repo_id,\n                        filename=settings.clip_model_path,\n                        verbose=settings.verbose,\n                    )\n                )\n            else:\n                chat_handler = llama_cpp.llama_chat_format.MiniCPMv26ChatHandler(\n                    clip_model_path=settings.clip_model_path, verbose=settings.verbose\n                )\n        elif settings.chat_format == \"qwen2.5-vl\":\n            assert settings.clip_model_path is not None, \"clip model not found\"\n            if settings.hf_model_repo_id is not None:\n                chat_handler = (\n                    llama_cpp.llama_chat_format.Qwen25VLChatHandler.from_pretrained(\n                        repo_id=settings.hf_model_repo_id,\n                        filename=settings.clip_model_path,\n                        verbose=settings.verbose,\n                    )\n                )\n            else:\n                chat_handler = llama_cpp.llama_chat_format.Qwen25VLChatHandler(\n                    clip_model_path=settings.clip_model_path, verbose=settings.verbose\n                )\n        elif settings.chat_format == \"hf-autotokenizer\":\n            assert (\n                settings.hf_pretrained_model_name_or_path is not None\n            ), \"hf_pretrained_model_name_or_path must be set for hf-autotokenizer\"\n            chat_handler = (\n                llama_cpp.llama_chat_format.hf_autotokenizer_to_chat_completion_handler(\n                    settings.hf_pretrained_model_name_or_path\n                )\n            )\n        elif settings.chat_format == \"hf-tokenizer-config\":\n            assert (\n                settings.hf_tokenizer_config_path is not None\n            ), \"hf_tokenizer_config_path must be set for hf-tokenizer-config\"\n            chat_handler = llama_cpp.llama_chat_format.hf_tokenizer_config_to_chat_completion_handler(\n                json.load(open(settings.hf_tokenizer_config_path))\n            )\n\n        tokenizer: Optional[llama_cpp.BaseLlamaTokenizer] = None\n        if settings.hf_pretrained_model_name_or_path is not None:\n            tokenizer = llama_tokenizer.LlamaHFTokenizer.from_pretrained(\n                settings.hf_pretrained_model_name_or_path\n            )\n\n        draft_model = None\n        if settings.draft_model is not None:\n            draft_model = llama_speculative.LlamaPromptLookupDecoding(\n                num_pred_tokens=settings.draft_model_num_pred_tokens\n            )\n\n        kv_overrides: Optional[Dict[str, Union[bool, int, float, str]]] = None\n        if settings.kv_overrides is not None:\n            assert isinstance(settings.kv_overrides, list)\n            kv_overrides = {}\n            for kv in settings.kv_overrides:\n                key, value = kv.split(\"=\")\n                if \":\" in value:\n                    value_type, value = value.split(\":\")\n                    if value_type == \"bool\":\n                        kv_overrides[key] = value.lower() in [\"true\", \"1\"]\n                    elif value_type == \"int\":\n                        kv_overrides[key] = int(value)\n                    elif value_type == \"float\":\n                        kv_overrides[key] = float(value)\n                    elif value_type == \"str\":\n                        kv_overrides[key] = value\n                    else:\n                        raise ValueError(f\"Unknown value type {value_type}\")\n\n        import functools\n\n        kwargs = {}\n\n        if settings.hf_model_repo_id is not None:\n            create_fn = functools.partial(\n                llama_cpp.Llama.from_pretrained,\n                repo_id=settings.hf_model_repo_id,\n                filename=settings.model,\n            )\n        else:\n            create_fn = llama_cpp.Llama\n            kwargs[\"model_path\"] = settings.model\n\n        _model = create_fn(\n            **kwargs,\n            # Model Params\n            n_gpu_layers=settings.n_gpu_layers,\n            split_mode=settings.split_mode,\n            main_gpu=settings.main_gpu,\n            tensor_split=settings.tensor_split,\n            vocab_only=settings.vocab_only,\n            use_mmap=settings.use_mmap,\n            use_mlock=settings.use_mlock,\n            kv_overrides=kv_overrides,\n            rpc_servers=settings.rpc_servers,\n            # Context Params\n            seed=settings.seed,\n            n_ctx=settings.n_ctx,\n            n_batch=settings.n_batch,\n            n_ubatch=settings.n_ubatch,\n            n_threads=settings.n_threads,\n            n_threads_batch=settings.n_threads_batch,\n            rope_scaling_type=settings.rope_scaling_type,\n            rope_freq_base=settings.rope_freq_base,\n            rope_freq_scale=settings.rope_freq_scale,\n            yarn_ext_factor=settings.yarn_ext_factor,\n            yarn_attn_factor=settings.yarn_attn_factor,\n            yarn_beta_fast=settings.yarn_beta_fast,\n            yarn_beta_slow=settings.yarn_beta_slow,\n            yarn_orig_ctx=settings.yarn_orig_ctx,\n            mul_mat_q=settings.mul_mat_q,\n            logits_all=settings.logits_all,\n            embedding=settings.embedding,\n            offload_kqv=settings.offload_kqv,\n            flash_attn=settings.flash_attn,\n            # Sampling Params\n            last_n_tokens_size=settings.last_n_tokens_size,\n            # LoRA Params\n            lora_base=settings.lora_base,\n            lora_path=settings.lora_path,\n            # Backend Params\n            numa=settings.numa,\n            # Chat Format Params\n            chat_format=settings.chat_format,\n            chat_handler=chat_handler,\n            # Speculative Decoding\n            draft_model=draft_model,\n            # KV Cache Quantization\n            type_k=settings.type_k,\n            type_v=settings.type_v,\n            # Tokenizer\n            tokenizer=tokenizer,\n            # Misc\n            verbose=settings.verbose,\n        )\n        if settings.cache:\n            if settings.cache_type == \"disk\":\n                if settings.verbose:\n                    print(f\"Using disk cache with size {settings.cache_size}\")\n                cache = llama_cpp.LlamaDiskCache(capacity_bytes=settings.cache_size)\n            else:\n                if settings.verbose:\n                    print(f\"Using ram cache with size {settings.cache_size}\")\n                cache = llama_cpp.LlamaRAMCache(capacity_bytes=settings.cache_size)\n            _model.set_cache(cache)\n        return _model\n"
  },
  {
    "path": "llama_cpp/server/settings.py",
    "content": "from __future__ import annotations\n\nimport multiprocessing\n\nfrom typing import Optional, List, Literal, Union, Dict, cast\nfrom typing_extensions import Self\n\nfrom pydantic import Field, model_validator\nfrom pydantic_settings import BaseSettings\n\nimport llama_cpp\n\n# Disable warning for model and model_alias settings\nBaseSettings.model_config[\"protected_namespaces\"] = ()\n\n\nclass ModelSettings(BaseSettings):\n    \"\"\"Model settings used to load a Llama model.\"\"\"\n\n    model: str = Field(\n        description=\"The path to the model to use for generating completions.\"\n    )\n    model_alias: Optional[str] = Field(\n        default=None,\n        description=\"The alias of the model to use for generating completions.\",\n    )\n    # Model Params\n    n_gpu_layers: int = Field(\n        default=0,\n        ge=-1,\n        description=\"The number of layers to put on the GPU. The rest will be on the CPU. Set -1 to move all to GPU.\",\n    )\n    split_mode: int = Field(\n        default=llama_cpp.LLAMA_SPLIT_MODE_LAYER,\n        description=\"The split mode to use.\",\n    )\n    main_gpu: int = Field(\n        default=0,\n        ge=0,\n        description=\"Main GPU to use.\",\n    )\n    tensor_split: Optional[List[float]] = Field(\n        default=None,\n        description=\"Split layers across multiple GPUs in proportion.\",\n    )\n    vocab_only: bool = Field(\n        default=False, description=\"Whether to only return the vocabulary.\"\n    )\n    use_mmap: bool = Field(\n        default=llama_cpp.llama_supports_mmap(),\n        description=\"Use mmap.\",\n    )\n    use_mlock: bool = Field(\n        default=llama_cpp.llama_supports_mlock(),\n        description=\"Use mlock.\",\n    )\n    kv_overrides: Optional[List[str]] = Field(\n        default=None,\n        description=\"List of model kv overrides in the format key=type:value where type is one of (bool, int, float). Valid true values are (true, TRUE, 1), otherwise false.\",\n    )\n    rpc_servers: Optional[str] = Field(\n        default=None,\n        description=\"comma seperated list of rpc servers for offloading\",\n    )\n    # Context Params\n    seed: int = Field(\n        default=llama_cpp.LLAMA_DEFAULT_SEED, description=\"Random seed. -1 for random.\"\n    )\n    n_ctx: int = Field(default=2048, ge=0, description=\"The context size.\")\n    n_batch: int = Field(\n        default=512, ge=1, description=\"The batch size to use per eval.\"\n    )\n    n_ubatch: int = Field(\n        default=512, ge=1, description=\"The physical batch size used by llama.cpp\"\n    )\n    n_threads: int = Field(\n        default=max(multiprocessing.cpu_count() // 2, 1),\n        ge=1,\n        description=\"The number of threads to use. Use -1 for max cpu threads\",\n    )\n    n_threads_batch: int = Field(\n        default=max(multiprocessing.cpu_count(), 1),\n        ge=0,\n        description=\"The number of threads to use when batch processing. Use -1 for max cpu threads\",\n    )\n    rope_scaling_type: int = Field(\n        default=llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED\n    )\n    rope_freq_base: float = Field(default=0.0, description=\"RoPE base frequency\")\n    rope_freq_scale: float = Field(\n        default=0.0, description=\"RoPE frequency scaling factor\"\n    )\n    yarn_ext_factor: float = Field(default=-1.0)\n    yarn_attn_factor: float = Field(default=1.0)\n    yarn_beta_fast: float = Field(default=32.0)\n    yarn_beta_slow: float = Field(default=1.0)\n    yarn_orig_ctx: int = Field(default=0)\n    mul_mat_q: bool = Field(\n        default=True, description=\"if true, use experimental mul_mat_q kernels\"\n    )\n    logits_all: bool = Field(default=True, description=\"Whether to return logits.\")\n    embedding: bool = Field(default=False, description=\"Whether to use embeddings.\")\n    offload_kqv: bool = Field(\n        default=True, description=\"Whether to offload kqv to the GPU.\"\n    )\n    flash_attn: bool = Field(\n        default=False, description=\"Whether to use flash attention.\"\n    )\n    # Sampling Params\n    last_n_tokens_size: int = Field(\n        default=64,\n        ge=0,\n        description=\"Last n tokens to keep for repeat penalty calculation.\",\n    )\n    # LoRA Params\n    lora_base: Optional[str] = Field(\n        default=None,\n        description=\"Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.\",\n    )\n    lora_path: Optional[str] = Field(\n        default=None,\n        description=\"Path to a LoRA file to apply to the model.\",\n    )\n    # Backend Params\n    numa: Union[bool, int] = Field(\n        default=False,\n        description=\"Enable NUMA support.\",\n    )\n    # Chat Format Params\n    chat_format: Optional[str] = Field(\n        default=None,\n        description=\"Chat format to use.\",\n    )\n    clip_model_path: Optional[str] = Field(\n        default=None,\n        description=\"Path to a CLIP model to use for multi-modal chat completion.\",\n    )\n    # Cache Params\n    cache: bool = Field(\n        default=False,\n        description=\"Use a cache to reduce processing times for evaluated prompts.\",\n    )\n    cache_type: Literal[\"ram\", \"disk\"] = Field(\n        default=\"ram\",\n        description=\"The type of cache to use. Only used if cache is True.\",\n    )\n    cache_size: int = Field(\n        default=2 << 30,\n        description=\"The size of the cache in bytes. Only used if cache is True.\",\n    )\n    # Tokenizer Options\n    hf_tokenizer_config_path: Optional[str] = Field(\n        default=None,\n        description=\"The path to a HuggingFace tokenizer_config.json file.\",\n    )\n    hf_pretrained_model_name_or_path: Optional[str] = Field(\n        default=None,\n        description=\"The model name or path to a pretrained HuggingFace tokenizer model. Same as you would pass to AutoTokenizer.from_pretrained().\",\n    )\n    # Loading from HuggingFace Model Hub\n    hf_model_repo_id: Optional[str] = Field(\n        default=None,\n        description=\"The model repo id to use for the HuggingFace tokenizer model.\",\n    )\n    # Speculative Decoding\n    draft_model: Optional[str] = Field(\n        default=None,\n        description=\"Method to use for speculative decoding. One of (prompt-lookup-decoding).\",\n    )\n    draft_model_num_pred_tokens: int = Field(\n        default=10,\n        description=\"Number of tokens to predict using the draft model.\",\n    )\n    # KV Cache Quantization\n    type_k: Optional[int] = Field(\n        default=None,\n        description=\"Type of the key cache quantization.\",\n    )\n    type_v: Optional[int] = Field(\n        default=None,\n        description=\"Type of the value cache quantization.\",\n    )\n    # Misc\n    verbose: bool = Field(\n        default=True, description=\"Whether to print debug information.\"\n    )\n\n    @model_validator(\n        mode=\"before\"\n    )  # pre=True to ensure this runs before any other validation\n    def set_dynamic_defaults(self) -> Self:\n        # If n_threads or n_threads_batch is -1, set it to multiprocessing.cpu_count()\n        cpu_count = multiprocessing.cpu_count()\n        values = cast(Dict[str, int], self)\n        if values.get(\"n_threads\", 0) == -1:\n            values[\"n_threads\"] = cpu_count\n        if values.get(\"n_threads_batch\", 0) == -1:\n            values[\"n_threads_batch\"] = cpu_count\n        return self\n\n\nclass ServerSettings(BaseSettings):\n    \"\"\"Server settings used to configure the FastAPI and Uvicorn server.\"\"\"\n\n    # Uvicorn Settings\n    host: str = Field(default=\"localhost\", description=\"Listen address\")\n    port: int = Field(default=8000, description=\"Listen port\")\n    ssl_keyfile: Optional[str] = Field(\n        default=None, description=\"SSL key file for HTTPS\"\n    )\n    ssl_certfile: Optional[str] = Field(\n        default=None, description=\"SSL certificate file for HTTPS\"\n    )\n    # FastAPI Settings\n    api_key: Optional[str] = Field(\n        default=None,\n        description=\"API key for authentication. If set all requests need to be authenticated.\",\n    )\n    interrupt_requests: bool = Field(\n        default=True,\n        description=\"Whether to interrupt requests when a new request is received.\",\n    )\n    disable_ping_events: bool = Field(\n        default=False,\n        description=\"Disable EventSource pings (may be needed for some clients).\",\n    )\n    root_path: str = Field(\n        default=\"\",\n        description=\"The root path for the server. Useful when running behind a reverse proxy.\",\n    )\n\n\nclass Settings(ServerSettings, ModelSettings):\n    pass\n\n\nclass ConfigFileSettings(ServerSettings):\n    \"\"\"Configuration file format settings.\"\"\"\n\n    models: List[ModelSettings] = Field(default=[], description=\"Model configs\")\n"
  },
  {
    "path": "llama_cpp/server/types.py",
    "content": "from __future__ import annotations\n\nfrom typing import List, Optional, Union, Dict\nfrom typing_extensions import TypedDict, Literal\n\nfrom pydantic import BaseModel, Field\n\nimport llama_cpp\n\n\nmodel_field = Field(\n    description=\"The model to use for generating completions.\", default=None\n)\n\nmax_tokens_field = Field(\n    default=16, ge=1, description=\"The maximum number of tokens to generate.\"\n)\n\nmin_tokens_field = Field(\n    default=0,\n    ge=0,\n    description=\"The minimum number of tokens to generate. It may return fewer tokens if another condition is met (e.g. max_tokens, stop).\",\n)\n\ntemperature_field = Field(\n    default=0.8,\n    description=\"Adjust the randomness of the generated text.\\n\\n\"\n    + \"Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The default value is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run.\",\n)\n\ntop_p_field = Field(\n    default=0.95,\n    ge=0.0,\n    le=1.0,\n    description=\"Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P.\\n\\n\"\n    + \"Top-p sampling, also known as nucleus sampling, is another text generation method that selects the next token from a subset of tokens that together have a cumulative probability of at least p. This method provides a balance between diversity and quality by considering both the probabilities of tokens and the number of tokens to sample from. A higher value for top_p (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text.\",\n)\n\nmin_p_field = Field(\n    default=0.05,\n    ge=0.0,\n    le=1.0,\n    description=\"Sets a minimum base probability threshold for token selection.\\n\\n\"\n    + \"The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter min_p represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with min_p=0.05 and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out.\",\n)\n\nstop_field = Field(\n    default=None,\n    description=\"A list of tokens at which to stop generation. If None, no stop tokens are used.\",\n)\n\nstream_field = Field(\n    default=False,\n    description=\"Whether to stream the results as they are generated. Useful for chatbots.\",\n)\n\ntop_k_field = Field(\n    default=40,\n    ge=0,\n    description=\"Limit the next token selection to the K most probable tokens.\\n\\n\"\n    + \"Top-k sampling is a text generation method that selects the next token only from the top k most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top_k (e.g., 100) will consider more tokens and lead to more diverse text, while a lower value (e.g., 10) will focus on the most probable tokens and generate more conservative text.\",\n)\n\nrepeat_penalty_field = Field(\n    default=1.1,\n    ge=0.0,\n    description=\"A penalty applied to each token that is already generated. This helps prevent the model from repeating itself.\\n\\n\"\n    + \"Repeat penalty is a hyperparameter used to penalize the repetition of token sequences during text generation. It helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient.\",\n)\n\npresence_penalty_field = Field(\n    default=0.0,\n    ge=-2.0,\n    le=2.0,\n    description=\"Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.\",\n)\n\nfrequency_penalty_field = Field(\n    default=0.0,\n    ge=-2.0,\n    le=2.0,\n    description=\"Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.\",\n)\n\nmirostat_mode_field = Field(\n    default=0,\n    ge=0,\n    le=2,\n    description=\"Enable Mirostat constant-perplexity algorithm of the specified version (1 or 2; 0 = disabled)\",\n)\n\nmirostat_tau_field = Field(\n    default=5.0,\n    ge=0.0,\n    le=10.0,\n    description=\"Mirostat target entropy, i.e. the target perplexity - lower values produce focused and coherent text, larger values produce more diverse and less coherent text\",\n)\n\nmirostat_eta_field = Field(\n    default=0.1, ge=0.001, le=1.0, description=\"Mirostat learning rate\"\n)\n\ngrammar = Field(\n    default=None,\n    description=\"A CBNF grammar (as string) to be used for formatting the model's output.\",\n)\n\n\nclass CreateCompletionRequest(BaseModel):\n    prompt: Union[str, List[str]] = Field(\n        default=\"\", description=\"The prompt to generate completions for.\"\n    )\n    suffix: Optional[str] = Field(\n        default=None,\n        description=\"A suffix to append to the generated text. If None, no suffix is appended. Useful for chatbots.\",\n    )\n    max_tokens: Optional[int] = Field(\n        default=16, ge=0, description=\"The maximum number of tokens to generate.\"\n    )\n    min_tokens: int = min_tokens_field\n    temperature: float = temperature_field\n    top_p: float = top_p_field\n    min_p: float = min_p_field\n    echo: bool = Field(\n        default=False,\n        description=\"Whether to echo the prompt in the generated text. Useful for chatbots.\",\n    )\n    stop: Optional[Union[str, List[str]]] = stop_field\n    stream: bool = stream_field\n    logprobs: Optional[int] = Field(\n        default=None,\n        ge=0,\n        description=\"The number of logprobs to generate. If None, no logprobs are generated.\",\n    )\n    presence_penalty: Optional[float] = presence_penalty_field\n    frequency_penalty: Optional[float] = frequency_penalty_field\n    logit_bias: Optional[Dict[str, float]] = Field(None)\n    seed: Optional[int] = Field(None)\n\n    # ignored or currently unsupported\n    model: Optional[str] = model_field\n    n: Optional[int] = 1\n    best_of: Optional[int] = 1\n    user: Optional[str] = Field(default=None)\n\n    # llama.cpp specific parameters\n    top_k: int = top_k_field\n    repeat_penalty: float = repeat_penalty_field\n    logit_bias_type: Optional[Literal[\"input_ids\", \"tokens\"]] = Field(None)\n    mirostat_mode: int = mirostat_mode_field\n    mirostat_tau: float = mirostat_tau_field\n    mirostat_eta: float = mirostat_eta_field\n    grammar: Optional[str] = None\n\n    model_config = {\n        \"json_schema_extra\": {\n            \"examples\": [\n                {\n                    \"prompt\": \"\\n\\n### Instructions:\\nWhat is the capital of France?\\n\\n### Response:\\n\",\n                    \"stop\": [\"\\n\", \"###\"],\n                }\n            ]\n        }\n    }\n\n\nclass CreateEmbeddingRequest(BaseModel):\n    model: Optional[str] = model_field\n    input: Union[str, List[str]] = Field(description=\"The input to embed.\")\n    user: Optional[str] = Field(default=None)\n\n    model_config = {\n        \"json_schema_extra\": {\n            \"examples\": [\n                {\n                    \"input\": \"The food was delicious and the waiter...\",\n                }\n            ]\n        }\n    }\n\n\nclass ChatCompletionRequestMessage(BaseModel):\n    role: Literal[\"system\", \"user\", \"assistant\", \"function\"] = Field(\n        default=\"user\", description=\"The role of the message.\"\n    )\n    content: Optional[str] = Field(\n        default=\"\", description=\"The content of the message.\"\n    )\n\n\nclass CreateChatCompletionRequest(BaseModel):\n    messages: List[llama_cpp.ChatCompletionRequestMessage] = Field(\n        default=[], description=\"A list of messages to generate completions for.\"\n    )\n    functions: Optional[List[llama_cpp.ChatCompletionFunction]] = Field(\n        default=None,\n        description=\"A list of functions to apply to the generated completions.\",\n    )\n    function_call: Optional[llama_cpp.ChatCompletionRequestFunctionCall] = Field(\n        default=None,\n        description=\"A function to apply to the generated completions.\",\n    )\n    tools: Optional[List[llama_cpp.ChatCompletionTool]] = Field(\n        default=None,\n        description=\"A list of tools to apply to the generated completions.\",\n    )\n    tool_choice: Optional[llama_cpp.ChatCompletionToolChoiceOption] = Field(\n        default=None,\n        description=\"A tool to apply to the generated completions.\",\n    )  # TODO: verify\n    max_tokens: Optional[int] = Field(\n        default=None,\n        description=\"The maximum number of tokens to generate. Defaults to inf\",\n    )\n    min_tokens: int = min_tokens_field\n    logprobs: Optional[bool] = Field(\n        default=False,\n        description=\"Whether to output the logprobs or not. Default is True\",\n    )\n    top_logprobs: Optional[int] = Field(\n        default=None,\n        ge=0,\n        description=\"The number of logprobs to generate. If None, no logprobs are generated. logprobs need to set to True.\",\n    )\n    temperature: float = temperature_field\n    top_p: float = top_p_field\n    min_p: float = min_p_field\n    stop: Optional[Union[str, List[str]]] = stop_field\n    stream: bool = stream_field\n    presence_penalty: Optional[float] = presence_penalty_field\n    frequency_penalty: Optional[float] = frequency_penalty_field\n    logit_bias: Optional[Dict[str, float]] = Field(None)\n    seed: Optional[int] = Field(None)\n    response_format: Optional[llama_cpp.ChatCompletionRequestResponseFormat] = Field(\n        default=None,\n    )\n\n    # ignored or currently unsupported\n    model: Optional[str] = model_field\n    n: Optional[int] = 1\n    user: Optional[str] = Field(None)\n\n    # llama.cpp specific parameters\n    top_k: int = top_k_field\n    repeat_penalty: float = repeat_penalty_field\n    logit_bias_type: Optional[Literal[\"input_ids\", \"tokens\"]] = Field(None)\n    mirostat_mode: int = mirostat_mode_field\n    mirostat_tau: float = mirostat_tau_field\n    mirostat_eta: float = mirostat_eta_field\n    grammar: Optional[str] = None\n\n    model_config = {\n        \"json_schema_extra\": {\n            \"examples\": [\n                {\n                    \"messages\": [\n                        ChatCompletionRequestMessage(\n                            role=\"system\", content=\"You are a helpful assistant.\"\n                        ).model_dump(),\n                        ChatCompletionRequestMessage(\n                            role=\"user\", content=\"What is the capital of France?\"\n                        ).model_dump(),\n                    ]\n                }\n            ]\n        }\n    }\n\n\nclass ModelData(TypedDict):\n    id: str\n    object: Literal[\"model\"]\n    owned_by: str\n    permissions: List[str]\n\n\nclass ModelList(TypedDict):\n    object: Literal[\"list\"]\n    data: List[ModelData]\n\n\nclass TokenizeInputRequest(BaseModel):\n    model: Optional[str] = model_field\n    input: str = Field(description=\"The input to tokenize.\")\n\n    model_config = {\n        \"json_schema_extra\": {\"examples\": [{\"input\": \"How many tokens in this query?\"}]}\n    }\n\n\nclass TokenizeInputResponse(BaseModel):\n    tokens: List[int] = Field(description=\"A list of tokens.\")\n\n    model_config = {\"json_schema_extra\": {\"example\": {\"tokens\": [123, 321, 222]}}}\n\n\nclass TokenizeInputCountResponse(BaseModel):\n    count: int = Field(description=\"The number of tokens in the input.\")\n\n    model_config = {\"json_schema_extra\": {\"example\": {\"count\": 5}}}\n\n\nclass DetokenizeInputRequest(BaseModel):\n    model: Optional[str] = model_field\n    tokens: List[int] = Field(description=\"A list of toekns to detokenize.\")\n\n    model_config = {\"json_schema_extra\": {\"example\": [{\"tokens\": [123, 321, 222]}]}}\n\n\nclass DetokenizeInputResponse(BaseModel):\n    text: str = Field(description=\"The detokenized text.\")\n\n    model_config = {\n        \"json_schema_extra\": {\"example\": {\"text\": \"How many tokens in this query?\"}}\n    }\n"
  },
  {
    "path": "mkdocs.yml",
    "content": "site_name: llama-cpp-python\nrepo_url: https://github.com/abetlen/llama-cpp-python\n\ntheme:\n  name: material\n  palette: \n\n    # Palette toggle for light mode\n    - scheme: default\n      primary: indigo\n      toggle:\n        icon: material/brightness-7 \n        name: Switch to dark mode\n\n    # Palette toggle for dark mode\n    - scheme: slate\n      primary: indigo\n      toggle:\n        icon: material/brightness-4\n        name: Switch to light mode\n\nplugins:\n  - search\n  - mkdocstrings:\n      handlers:\n        python:\n          options:\n            members_order: source\n            group_by_category: false\n            signature_crossrefs: true\n            show_signature: true\n            docstring_section_style: list\n            show_root_heading: true\n            heading_level: 3\n            preload_modules:\n              - typing\n              - typing_extensions\n              - ctypes\n          import:\n            - https://docs.python.org/3/objects.inv\n            - https://numpy.org/doc/stable/objects.inv\n\nwatch:\n  - llama_cpp\n  - README.md\n\nnav:\n  - \"Getting Started\": \"index.md\"\n  - \"Installation Guides\":\n    - \"macOS (Metal)\": \"install/macos.md\"\n  - \"API Reference\": \"api-reference.md\"\n  - \"OpenAI Compatible Web Server\": \"server.md\"\n  - \"Changelog\": \"changelog.md\"\n\nmarkdown_extensions:\n  - attr_list\n  - pymdownx.emoji:\n      emoji_index: !!python/name:materialx.emoji.twemoji\n      emoji_generator: !!python/name:materialx.emoji.to_svg\n  - pymdownx.highlight:\n      anchor_linenums: true\n      line_spans: __span\n      pygments_lang_class: true\n  - pymdownx.inlinehilite\n  - pymdownx.magiclink:\n      repo_url_shorthand: true\n      user: abetlen\n      repo: llama-cpp-python\n  - pymdownx.snippets\n  - pymdownx.superfences\n  - pymdownx.tabbed:\n      alternate_style: true \n  - pymdownx.tilde\n  - tables\n"
  },
  {
    "path": "pyproject.toml",
    "content": "[build-system]\nrequires = [\"scikit-build-core[pyproject]>=0.9.2\"]\nbuild-backend = \"scikit_build_core.build\"\n\n[project]\nname = \"llama_cpp_python\"\ndynamic = [\"version\"]\ndescription = \"Python bindings for the llama.cpp library\"\nreadme = \"README.md\"\nlicense = { text = \"MIT\" }\nauthors = [\n    { name = \"Andrei Betlen\", email = \"abetlen@gmail.com\" },\n]\ndependencies = [\n    \"typing-extensions>=4.5.0\",\n    \"numpy>=1.20.0\",\n    \"diskcache>=5.6.1\",\n    \"jinja2>=2.11.3\",\n]\nrequires-python = \">=3.8\"\nclassifiers = [\n    \"Programming Language :: Python :: 3\",\n    \"Programming Language :: Python :: 3.8\",\n    \"Programming Language :: Python :: 3.9\",\n    \"Programming Language :: Python :: 3.10\",\n    \"Programming Language :: Python :: 3.11\",\n    \"Programming Language :: Python :: 3.12\",\n    \"Programming Language :: Python :: 3.13\",\n]\n\n\n[project.optional-dependencies]\nserver = [\n    \"uvicorn>=0.22.0\",\n    \"fastapi>=0.100.0\",\n    \"pydantic-settings>=2.0.1\",\n    \"sse-starlette>=1.6.1\",\n    \"starlette-context>=0.3.6,<0.4\",\n    \"PyYAML>=5.1\",\n]\ntest = [\n    \"pytest>=7.4.0\",\n    \"httpx>=0.24.1\",\n    \"scipy>=1.10\",\n    \"fastapi>=0.100.0\",\n    \"sse-starlette>=1.6.1\",\n    \"starlette-context>=0.3.6,<0.4\",\n    \"pydantic-settings>=2.0.1\",\n    \"huggingface-hub>=0.23.0\"\n]\ndev = [\n    \"black>=23.3.0\",\n    \"twine>=4.0.2\",\n    \"mkdocs>=1.4.3\",\n    \"mkdocstrings[python]>=0.22.0\",\n    \"mkdocs-material>=9.1.18\",\n    \"pytest>=7.4.0\",\n    \"httpx>=0.24.1\",\n]\nall = [\n    \"llama_cpp_python[server,test,dev]\",\n]\n\n[tool.scikit-build]\nwheel.packages = [\"llama_cpp\"]\ncmake.verbose = true\ncmake.minimum-version = \"3.21\"\nminimum-version = \"0.5.1\"\nsdist.include = [\".git\", \"vendor/llama.cpp/*\"]\n\n[tool.scikit-build.metadata.version]\nprovider = \"scikit_build_core.metadata.regex\"\ninput = \"llama_cpp/__init__.py\"\n\n[project.urls]\nHomepage = \"https://github.com/abetlen/llama-cpp-python\"\nIssues = \"https://github.com/abetlen/llama-cpp-python/issues\"\nDocumentation = \"https://llama-cpp-python.readthedocs.io/en/latest/\"\nChangelog = \"https://llama-cpp-python.readthedocs.io/en/latest/changelog/\"\n\n[tool.pytest.ini_options]\ntestpaths = \"tests\"\n"
  },
  {
    "path": "scripts/get-releases.sh",
    "content": "#!/bin/bash\n\n# Function to get all releases\nget_all_releases() {\n    local page=1\n    local per_page=100\n    local releases=\"\"\n    local new_releases\n\n    # Prepare headers\n    local headers=(-H \"Accept: application/vnd.github.v3+json\")\n    if [ -n \"$GITHUB_TOKEN\" ]; then\n        headers+=(-H \"Authorization: Bearer $GITHUB_TOKEN\")\n    fi\n\n    while true; do\n        response=$(curl -s \"${headers[@]}\" \\\n                        \"https://api.github.com/repos/abetlen/llama-cpp-python/releases?page=$page&per_page=$per_page\")\n        \n        # Check if the response is valid JSON\n        if ! echo \"$response\" | jq empty > /dev/null 2>&1; then\n            echo \"Error: Invalid response from GitHub API\" >&2\n            echo \"Response: $response\" >&2\n            return 1\n        fi\n\n        new_releases=$(echo \"$response\" | jq -r '.[].tag_name')\n        if [ -z \"$new_releases\" ]; then\n            break\n        fi\n        releases=\"$releases $new_releases\"\n        ((page++))\n    done\n\n    echo $releases\n}\n\n# Get all releases and save to file\nreleases=$(get_all_releases)\nif [ $? -ne 0 ]; then\n    echo \"Failed to fetch releases. Please check your internet connection and try again later.\" >&2\n    exit 1\nfi\n\necho \"$releases\" | tr ' ' '\\n' > all_releases.txt\n\necho \"All releases have been saved to all_releases.txt\"\n"
  },
  {
    "path": "scripts/releases-to-pep-503.sh",
    "content": "#!/bin/bash\n\n# Enable exit on error\nset -e\n\n# Function for logging\nlog_error() {\n    echo \"ERROR: $1\" >&2\n}\n\nlog_info() {\n    echo \"INFO: $1\"\n}\n\n# Get output directory or default to index/whl/cpu\noutput_dir=${1:-\"index/whl/cpu\"}\n\n# Get pattern from second arg or default to valid python package version pattern\npattern=${2:-\"^[v]?[0-9]+\\.[0-9]+\\.[0-9]+$\"}\n\n# Get the current directory (where the script is run from)\ncurrent_dir=\"$(pwd)\"\n\n# Check if all_releases.txt exists\nif [ ! -f \"$current_dir/all_releases.txt\" ]; then\n    log_error \"all_releases.txt not found in the current directory.\"\n    exit 1\nfi\n\n# Create output directory\nmkdir -p \"$output_dir\"\n\n# Create an index html file\ncat << EOF > \"$output_dir/index.html\"\n<!DOCTYPE html>\n<html>\n  <head></head>\n  <body>\n    <a href=\"llama-cpp-python/\">llama-cpp-python</a>\n    <br>\n  </body>\n</html>\n\nEOF\n\n# Create llama-cpp-python directory\nmkdir -p \"$output_dir/llama-cpp-python\"\n\n# Create an index html file in llama-cpp-python directory\ncat << EOF > \"$output_dir/llama-cpp-python/index.html\"\n<!DOCTYPE html>\n<html>\n  <body>\n    <h1>Links for llama-cpp-python</h1>\nEOF\n\n# Filter releases by pattern\nreleases=$(grep -E \"$pattern\" \"$current_dir/all_releases.txt\")\n\n# Prepare curl headers\nheaders=('--header' 'Accept: application/vnd.github.v3+json')\nif [ -n \"$GITHUB_TOKEN\" ]; then\n    headers+=('--header' \"authorization: Bearer $GITHUB_TOKEN\")\nfi\nheaders+=('--header' 'content-type: application/json')\n\n# For each release, get all assets\nfor release in $releases; do\n    log_info \"Processing release: $release\"\n    response=$(curl -s \"${headers[@]}\" \\\n                    \"https://api.github.com/repos/abetlen/llama-cpp-python/releases/tags/$release\")\n    \n    if [ -z \"$response\" ]; then\n        log_error \"Empty response from GitHub API for release $release\"\n        continue\n    fi\n\n    if ! echo \"$response\" | jq -e '.assets' > /dev/null 2>&1; then\n        log_error \"Invalid or unexpected response from GitHub API for release $release\"\n        log_error \"Response: $response\"\n        continue\n    fi\n\n    # Get release version from release ie v0.1.0-cu121 -> v0.1.0\n    release_version=$(echo \"$release\" | grep -oE \"^[v]?[0-9]+\\.[0-9]+\\.[0-9]+\")\n    echo \"    <h2>$release_version</h2>\" >> \"$output_dir/llama-cpp-python/index.html\"\n    \n    wheel_urls=$(echo \"$response\" | jq -r '.assets[] | select(.name | endswith(\".whl\")) | .browser_download_url')\n    if [ -z \"$wheel_urls\" ]; then\n        log_error \"No wheel files found for release $release\"\n        continue\n    fi\n\n    echo \"$wheel_urls\" | while read -r asset; do\n        echo \"    <a href=\\\"$asset\\\">$asset</a>\" >> \"$output_dir/llama-cpp-python/index.html\"\n        echo \"    <br>\" >> \"$output_dir/llama-cpp-python/index.html\"\n    done\ndone\n\necho \"  </body>\" >> \"$output_dir/llama-cpp-python/index.html\"\necho \"</html>\" >> \"$output_dir/llama-cpp-python/index.html\"\necho \"\" >> \"$output_dir/llama-cpp-python/index.html\"\n\nlog_info \"Index generation complete. Output directory: $output_dir\"\n"
  },
  {
    "path": "tests/test_llama.py",
    "content": "import ctypes\nimport multiprocessing\n\nimport numpy as np\nfrom scipy.special import log_softmax\n\nfrom huggingface_hub import hf_hub_download\n\nimport pytest\n\nimport llama_cpp\nimport llama_cpp._internals as internals\n\n\nMODEL = \"./vendor/llama.cpp/models/ggml-vocab-llama-spm.gguf\"\n\n\ndef test_llama_cpp_version():\n    assert llama_cpp.__version__\n\n\ndef test_llama_cpp_tokenization():\n    llama = llama_cpp.Llama(model_path=MODEL, vocab_only=True, verbose=False)\n\n    assert llama\n    assert llama._ctx.ctx is not None\n\n    text = b\"Hello World\"\n\n    tokens = llama.tokenize(text)\n    assert tokens[0] == llama.token_bos()\n    assert tokens == [1, 15043, 2787]\n    detokenized = llama.detokenize(tokens)\n    assert detokenized == text\n\n    tokens = llama.tokenize(text, add_bos=False)\n    assert tokens[0] != llama.token_bos()\n    assert tokens == [15043, 2787]\n\n    detokenized = llama.detokenize(tokens)\n    assert detokenized != text\n\n    text = b\"Hello World</s>\"\n    tokens = llama.tokenize(text)\n    assert tokens[-1] != llama.token_eos()\n    assert tokens == [1, 15043, 2787, 829, 29879, 29958]\n\n    tokens = llama.tokenize(text, special=True)\n    assert tokens[-1] == llama.token_eos()\n    assert tokens == [1, 15043, 2787, 2]\n\n    text = b\"\"\n    tokens = llama.tokenize(text, add_bos=True, special=True)\n    assert tokens[-1] != llama.token_eos()\n    assert tokens == [llama.token_bos()]\n    assert text == llama.detokenize(tokens)\n\n\n@pytest.fixture\ndef llama_cpp_model_path():\n    repo_id = \"Qwen/Qwen2-0.5B-Instruct-GGUF\"\n    filename = \"qwen2-0_5b-instruct-q8_0.gguf\"\n    model_path = hf_hub_download(repo_id, filename)\n    return model_path\n\n\ndef test_real_model(llama_cpp_model_path):\n    import os\n    assert os.path.exists(llama_cpp_model_path)\n\n    params = llama_cpp.llama_model_default_params()\n    params.use_mmap = llama_cpp.llama_supports_mmap()\n    params.use_mlock = llama_cpp.llama_supports_mlock()\n    params.check_tensors = False\n\n    model = internals.LlamaModel(path_model=llama_cpp_model_path, params=params)\n\n    cparams = llama_cpp.llama_context_default_params()\n    cparams.n_ctx = 16\n    cparams.n_batch = 16\n    cparams.n_ubatch = 16\n    cparams.n_threads = multiprocessing.cpu_count()\n    cparams.n_threads_batch = multiprocessing.cpu_count()\n    cparams.logits_all = False\n    cparams.flash_attn = True\n\n    context = internals.LlamaContext(model=model, params=cparams)\n    tokens = model.tokenize(b\"Hello, world!\", add_bos=True, special=True)\n\n    assert tokens == [9707, 11, 1879, 0]\n\n    tokens = model.tokenize(b\"The quick brown fox jumps\", add_bos=True, special=True)\n\n    batch = internals.LlamaBatch(n_tokens=len(tokens), embd=0, n_seq_max=1)\n\n    seed = 1337\n    sampler = internals.LlamaSampler()\n    sampler.add_top_k(50)\n    sampler.add_top_p(0.9, 1)\n    sampler.add_temp(0.8)\n    sampler.add_dist(seed)\n\n    result = tokens\n    n_eval = 0\n    for _ in range(4):\n        batch.set_batch(tokens, n_past=n_eval, logits_all=False)\n        context.decode(batch)\n        n_eval += len(tokens)\n        token_id = sampler.sample(context, -1)\n        tokens = [token_id]\n        result += tokens\n\n    output = result[5:]\n    output_text = model.detokenize(output, special=True)\n    assert output_text == b\" over the lazy dog\"\n\ndef test_real_llama(llama_cpp_model_path):\n    model = llama_cpp.Llama(\n        llama_cpp_model_path,\n        n_ctx=32,\n        n_batch=32,\n        n_ubatch=32,\n        n_threads=multiprocessing.cpu_count(),\n        n_threads_batch=multiprocessing.cpu_count(),\n        logits_all=False,\n        flash_attn=True,\n    )\n\n    output = model.create_completion(\n        \"The quick brown fox jumps\",\n        max_tokens=4,\n        top_k=50,\n        top_p=0.9,\n        temperature=0.8,\n        seed=1337\n    )\n    assert output[\"choices\"][0][\"text\"] == \" over the lazy dog\"\n\n\n    output = model.create_completion(\n        \"The capital of france is paris, 'true' or 'false'?:\\n\",\n        max_tokens=4,\n        top_k=50,\n        top_p=0.9,\n        temperature=0.8,\n        seed=1337,\n        grammar=llama_cpp.LlamaGrammar.from_string(\"\"\"\nroot ::= \"true\" | \"false\"\n\"\"\")\n    )\n    assert output[\"choices\"][0][\"text\"] == \"true\"\n\n    suffix = b\"rot\"\n    tokens = model.tokenize(suffix, add_bos=True, special=True)\n    def logit_processor_func(input_ids, logits):\n        for token in tokens:\n            logits[token] *= 1000\n        return logits\n\n    logit_processors = llama_cpp.LogitsProcessorList(\n        [logit_processor_func]\n    )\n\n    output = model.create_completion(\n        \"The capital of france is par\",\n        max_tokens=4,\n        top_k=50,\n        top_p=0.9,\n        temperature=0.8,\n        seed=1337,\n        logits_processor=logit_processors\n    )\n    assert output[\"choices\"][0][\"text\"].lower().startswith(\"rot\")\n\n    model.set_seed(1337)\n\n    state = model.save_state()\n\n    output = model.create_completion(\n        \"Pick a number from 1 to 10?:\\n\",\n        max_tokens=4,\n        top_k=50,\n        top_p=0.9,\n        temperature=0.8,\n        grammar=llama_cpp.LlamaGrammar.from_string(\"\"\"\nroot ::= \"1\" | \"2\" | \"3\" | \"4\" | \"5\" | \"6\" | \"7\" | \"8\" | \"9\" | \"10\"\n\"\"\")\n    )\n    number_1 = output[\"choices\"][0][\"text\"]\n\n    output = model.create_completion(\n        \"Pick a number from 1 to 10?:\\n\",\n        max_tokens=4,\n        top_k=50,\n        top_p=0.9,\n        temperature=0.8,\n        grammar=llama_cpp.LlamaGrammar.from_string(\"\"\"\nroot ::= \"1\" | \"2\" | \"3\" | \"4\" | \"5\" | \"6\" | \"7\" | \"8\" | \"9\" | \"10\"\n\"\"\")\n    )\n    number_2 = output[\"choices\"][0][\"text\"]\n\n    model.load_state(state)\n\n    output = model.create_completion(\n        \"Pick a number from 1 to 10?:\\n\",\n        max_tokens=4,\n        top_k=50,\n        top_p=0.9,\n        temperature=0.8,\n        grammar=llama_cpp.LlamaGrammar.from_string(\"\"\"\nroot ::= \"1\" | \"2\" | \"3\" | \"4\" | \"5\" | \"6\" | \"7\" | \"8\" | \"9\" | \"10\"\n\"\"\")\n    )\n    number_3 = output[\"choices\"][0][\"text\"]\n\n    assert number_1 != number_2\n    assert number_1 == number_3\n\n\ndef test_real_llama_embeddings(llama_cpp_model_path):\n    model = llama_cpp.Llama(\n        llama_cpp_model_path,\n        n_ctx=32,\n        n_batch=32,\n        n_ubatch=32,\n        n_threads=multiprocessing.cpu_count(),\n        n_threads_batch=multiprocessing.cpu_count(),\n        logits_all=False,\n        flash_attn=True,\n        embedding=True\n    )\n    # Smoke test for now\n    model.embed(\"Hello World\")\n"
  },
  {
    "path": "tests/test_llama_chat_format.py",
    "content": "import json\n\nimport jinja2\n\nfrom llama_cpp import (\n    ChatCompletionRequestUserMessage,\n)\nimport llama_cpp.llama_types as llama_types\nimport llama_cpp.llama_chat_format as llama_chat_format\n\nfrom llama_cpp.llama_chat_format import hf_tokenizer_config_to_chat_formatter\n\ndef test_mistral_instruct():\n    chat_template = \"{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}\"\n    chat_formatter = jinja2.Template(chat_template)\n    messages = [\n        llama_types.ChatCompletionRequestUserMessage(role=\"user\", content=\"Instruction\"),\n        llama_types.ChatCompletionRequestAssistantMessage(role=\"assistant\", content=\"Model answer\"),\n        llama_types.ChatCompletionRequestUserMessage(role=\"user\", content=\"Follow-up instruction\"),\n    ]\n    response = llama_chat_format.format_mistral_instruct(\n        messages=messages,\n    )\n    prompt = (\"\" if response.added_special else \"<s>\") + response.prompt\n    reference = chat_formatter.render(\n        messages=messages,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n    )\n    assert prompt == reference\n\n\nmistral_7b_tokenizer_config = \"\"\"{\n  \"add_bos_token\": true,\n  \"add_eos_token\": false,\n  \"added_tokens_decoder\": {\n    \"0\": {\n      \"content\": \"<unk>\",\n      \"lstrip\": false,\n      \"normalized\": false,\n      \"rstrip\": false,\n      \"single_word\": false,\n      \"special\": true\n    },\n    \"1\": {\n      \"content\": \"<s>\",\n      \"lstrip\": false,\n      \"normalized\": false,\n      \"rstrip\": false,\n      \"single_word\": false,\n      \"special\": true\n    },\n    \"2\": {\n      \"content\": \"</s>\",\n      \"lstrip\": false,\n      \"normalized\": false,\n      \"rstrip\": false,\n      \"single_word\": false,\n      \"special\": true\n    }\n  },\n  \"additional_special_tokens\": [],\n  \"bos_token\": \"<s>\",\n  \"clean_up_tokenization_spaces\": false,\n  \"eos_token\": \"</s>\",\n  \"legacy\": true,\n  \"model_max_length\": 1000000000000000019884624838656,\n  \"pad_token\": null,\n  \"sp_model_kwargs\": {},\n  \"spaces_between_special_tokens\": false,\n  \"tokenizer_class\": \"LlamaTokenizer\",\n  \"unk_token\": \"<unk>\",\n  \"use_default_system_prompt\": false,\n  \"chat_template\": \"{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}\"\n}\"\"\"\n\n\ndef test_hf_tokenizer_config_str_to_chat_formatter():\n    tokenizer_config = json.loads(mistral_7b_tokenizer_config)\n    chat_formatter = hf_tokenizer_config_to_chat_formatter(\n        tokenizer_config\n    )\n    chat_formatter_respoonse = chat_formatter(\n        messages=[\n            ChatCompletionRequestUserMessage(role=\"user\", content=\"Hello, world!\"),\n        ]\n    )\n\n    assert chat_formatter_respoonse.prompt == (\"<s>[INST] Hello, world! [/INST]</s>\" \"\")\n"
  },
  {
    "path": "tests/test_llama_grammar.py",
    "content": "import llama_cpp\nimport json\n\ntree = \"\"\"\nleaf ::= \".\"\nnode ::= leaf | \"(\" node node \")\"\nroot ::= node\n\"\"\"\n\n\ndef test_grammar_from_string():\n    grammar = llama_cpp.LlamaGrammar.from_string(tree)\n    # assert grammar._n_rules == 3\n    # assert grammar._start_rule_index == 2\n    # assert grammar.grammar is not None\n\n\ndef test_composed_pydantic_grammar():\n    \"\"\"\n    from pydantic import BaseModel\n\n    class A(BaseModel):\n        a: int\n\n    class B(BaseModel):\n        a: A\n        b: int\n    \"\"\"\n\n    # This schema corresponds to the grammar in the comment above.\n    # We don't use the pydantic models directly to avoid the dependency.\n    schema = {\n        \"$defs\": {\n            \"A\": {\n                \"properties\": {\"a\": {\"title\": \"A\", \"type\": \"integer\"}},\n                \"required\": [\"a\"],\n                \"title\": \"A\",\n                \"type\": \"object\",\n            }\n        },\n        \"properties\": {\n            \"a\": {\"$ref\": \"#/$defs/A\"},\n            \"b\": {\"title\": \"B\", \"type\": \"integer\"},\n        },\n        \"required\": [\"a\", \"b\"],\n        \"title\": \"B\",\n        \"type\": \"object\",\n    }\n\n    grammar = llama_cpp.LlamaGrammar.from_json_schema(json.dumps(schema))\n\n    # assert grammar.grammar is not None\n\n\ndef test_grammar_anyof():\n    sch = {\n        \"properties\": {\n            \"temperature\": {\n                \"description\": \"The temperature mentioned\",\n                \"type\": \"number\",\n            },\n            \"unit\": {\n                \"anyOf\": [\n                    {\n                        \"description\": \"Unit for temperature\",\n                        \"enum\": [\"celsius\", \"fahrenheit\"],\n                        \"type\": \"string\",\n                    },\n                    {\"type\": \"null\"},\n                ],\n            },\n        },\n        \"type\": \"object\",\n    }\n\n    grammar = llama_cpp.LlamaGrammar.from_json_schema(json.dumps(sch))\n\n    # assert grammar.grammar is not None\n"
  },
  {
    "path": "tests/test_llama_speculative.py",
    "content": "import numpy as np\n\nfrom llama_cpp.llama_speculative import LlamaPromptLookupDecoding\n\ndef test_find_candidate_pred_tokens():\n    find_candidate_pred_tokens = LlamaPromptLookupDecoding.find_candidate_pred_tokens\n\n    # Test Case 1: Matching ngram is found\n    input_ids1 = np.array([1, 2, 3, 1, 2, 3, 1, 2, 3])\n    result1 = find_candidate_pred_tokens(input_ids1, max_ngram_size=3, num_pred_tokens=2)\n    assert np.array_equal(result1, np.array([1, 2]))\n\n    # Test Case 2: Matching ngram is not found\n    input_ids2 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])\n    result2 = find_candidate_pred_tokens(input_ids2, max_ngram_size=3, num_pred_tokens=2)\n    assert np.array_equal(result2, np.array([]))\n"
  }
]