main ecf9ca104274 cached
178 files
1.5 MB
408.3k tokens
830 symbols
1 requests
Download .txt
Showing preview only (1,550K chars total). Download the full file or copy to clipboard to get everything.
Repository: bitsandbytes-foundation/bitsandbytes
Branch: main
Commit: ecf9ca104274
Files: 178
Total size: 1.5 MB

Directory structure:
gitextract_l_yjgk53/

├── .clang-format
├── .editorconfig
├── .git-blame-ignore-revs
├── .gitattributes
├── .github/
│   ├── FUNDING.yml
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug-report.yml
│   │   └── feature-request.yml
│   ├── dependabot.yml.disabled
│   ├── scripts/
│   │   ├── auditwheel_show.py
│   │   ├── build-cpu.sh
│   │   ├── build-cuda.sh
│   │   ├── build-rocm.sh
│   │   ├── build-xpu-windows.bat
│   │   ├── build-xpu.sh
│   │   └── set_platform_tag.py
│   └── workflows/
│       ├── build_documentation.yml
│       ├── build_pr_documentation.yml
│       ├── lint.yml
│       ├── python-package.yml
│       ├── stale.yml.disabled
│       ├── test-runner.yml
│       ├── tests-nightly.yml
│       ├── tests-pr.yml
│       └── upload_pr_documentation.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .vscode/
│   ├── extensions.json
│   └── settings.json
├── CHANGELOG.md
├── CLAUDE.md
├── CMakeLists.txt
├── CODE_OF_CONDUCT.md
├── COMPILE_H100_L40.md
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── NOTICE.md
├── README.md
├── SECURITY.md
├── _typos.toml
├── agents/
│   ├── api_surface.md
│   ├── architecture_guide.md
│   ├── code_standards.md
│   ├── dispatch_guide.md
│   ├── downstream_integrations.md
│   ├── fetch_issues.py
│   ├── github_tools_guide.md
│   ├── issue_maintenance_guide.md
│   ├── issue_patterns.md
│   ├── issue_triage_workflow.md
│   ├── linting_guide.md
│   ├── pr_review_guide.md
│   ├── query_issues.py
│   ├── security_guide.md
│   ├── testing_guide.md
│   └── worktree_guide.md
├── benchmarking/
│   ├── README.md
│   ├── inference_benchmark.py
│   ├── int8/
│   │   ├── int8_benchmark.py
│   │   └── training_benchmark.py
│   ├── matmul_benchmark.py
│   ├── optimizer_benchmark.py
│   └── xpu/
│       └── inference_benchmark.py
├── bitsandbytes/
│   ├── __init__.py
│   ├── __main__.py
│   ├── _ops.py
│   ├── autograd/
│   │   ├── __init__.py
│   │   └── _functions.py
│   ├── backends/
│   │   ├── __init__.py
│   │   ├── cpu/
│   │   │   ├── __init__.py
│   │   │   └── ops.py
│   │   ├── cuda/
│   │   │   ├── __init__.py
│   │   │   └── ops.py
│   │   ├── default/
│   │   │   ├── __init__.py
│   │   │   └── ops.py
│   │   ├── hpu/
│   │   │   ├── __init__.py
│   │   │   └── ops.py
│   │   ├── mps/
│   │   │   ├── __init__.py
│   │   │   └── ops.py
│   │   ├── triton/
│   │   │   ├── __init__.py
│   │   │   ├── kernels_4bit.py
│   │   │   ├── kernels_8bit_quant.py
│   │   │   ├── kernels_optim.py
│   │   │   └── ops.py
│   │   ├── utils.py
│   │   └── xpu/
│   │       ├── __init__.py
│   │       └── ops.py
│   ├── cextension.py
│   ├── consts.py
│   ├── cuda_specs.py
│   ├── diagnostics/
│   │   ├── __init__.py
│   │   ├── cuda.py
│   │   ├── main.py
│   │   └── utils.py
│   ├── functional.py
│   ├── nn/
│   │   ├── __init__.py
│   │   ├── modules.py
│   │   └── parametrize.py
│   ├── optim/
│   │   ├── __init__.py
│   │   ├── adagrad.py
│   │   ├── adam.py
│   │   ├── adamw.py
│   │   ├── ademamix.py
│   │   ├── lamb.py
│   │   ├── lars.py
│   │   ├── lion.py
│   │   ├── optimizer.py
│   │   ├── rmsprop.py
│   │   └── sgd.py
│   ├── py.typed
│   └── utils.py
├── check_bnb_install.py
├── csrc/
│   ├── common.cuh
│   ├── common.h
│   ├── compat.cuh
│   ├── compat_device.cuh
│   ├── cpu_ops.cpp
│   ├── cpu_ops.h
│   ├── kernels.cu
│   ├── kernels.cuh
│   ├── mps_kernels.metal
│   ├── mps_ops.mm
│   ├── ops.cu
│   ├── ops.cuh
│   ├── pythonInterface.cpp
│   ├── xpu_kernels.cpp
│   ├── xpu_kernels.h
│   ├── xpu_ops.cpp
│   └── xpu_ops.h
├── docs/
│   └── source/
│       ├── _toctree.yml
│       ├── contributing.mdx
│       ├── errors.mdx
│       ├── explanations/
│       │   ├── optimizers.mdx
│       │   └── resources.mdx
│       ├── faqs.mdx
│       ├── fsdp_qlora.md
│       ├── index.mdx
│       ├── installation.mdx
│       ├── integrations.mdx
│       ├── optimizers.mdx
│       ├── quickstart.mdx
│       └── reference/
│           ├── functional.mdx
│           ├── nn/
│           │   ├── embeddings.mdx
│           │   ├── linear4bit.mdx
│           │   └── linear8bit.mdx
│           └── optim/
│               ├── adagrad.mdx
│               ├── adam.mdx
│               ├── adamw.mdx
│               ├── ademamix.mdx
│               ├── lamb.mdx
│               ├── lars.mdx
│               ├── lion.mdx
│               ├── optim_overview.mdx
│               ├── rmsprop.mdx
│               └── sgd.mdx
├── examples/
│   ├── compile_inference.py
│   ├── int8_inference_huggingface.py
│   └── xpu/
│       ├── benchmark_paged_memory.py
│       └── paged_xpu_training.py
├── install_cuda.py
├── install_cuda.sh
├── pyproject.toml
├── scripts/
│   └── stale.py
├── setup.py
└── tests/
    ├── __init__.py
    ├── conftest.py
    ├── fsdp_state_dict_save.py
    ├── helpers.py
    ├── test_autograd.py
    ├── test_cuda_setup_evaluator.py
    ├── test_functional.py
    ├── test_generation.py
    ├── test_linear4bit.py
    ├── test_linear8bitlt.py
    ├── test_modules.py
    ├── test_ops.py
    ├── test_optim.py
    └── test_parametrize.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .clang-format
================================================
---
BasedOnStyle: LLVM
AlignAfterOpenBracket: BlockIndent
BinPackArguments: true
BinPackParameters: true
BracedInitializerIndentWidth: 4
ColumnLimit: 120
Cpp11BracedListStyle: true
IndentWidth: 4
IndentWrappedFunctionNames: true
PointerAlignment: Left
SeparateDefinitionBlocks: Always
Standard: c++17
StatementMacros:
  - 'MAKE_PreconditionOptimizer32bit1State'
  - 'MAKE_PreconditionOptimizer32bit2State'
  - 'MAKE_PreconditionStatic8bit1State'
  - 'MAKE_PreconditionStatic8bit2State'
  - 'MAKE_Optimizer32bit1State'
  - 'MAKE_optimizerStatic8bit1State'
  - 'MAKE_optimizerStatic8bit2State'
  - 'MAKE_OptimizerStatic8bit1StateBlockwise'
  - 'MAKE_OptimizerStatic8bit2StateBlockwise'
  - 'MAKE_kQuantizeBlockwise'
  - 'MAKE_kQuantizeBlockwiseSmall'
  - 'MAKE_BLOCKWISE8'
  - 'MAKE_ELEMENTWISE_FUNC'
  - 'CMAKE_ELEMENTWISE_FUNC'
  - 'MAKE_FUNC8'
  - 'MAKE_FUNC32'
  - 'MAKE_CBLOCKWISE8'
  - 'MAKE_CFUNC8'
  - 'MAKE_CFUNC32'

UseTab: Never

...


================================================
FILE: .editorconfig
================================================
[*]
trim_trailing_whitespace = true
insert_final_newline = true


================================================
FILE: .git-blame-ignore-revs
================================================
# ran black and isort for coherent code formatting
bfa0e33294f2b1dc25e65a33be2397f989824298

# reran black with linelength 80 for greater readability
ea7c14f8ef64924f2d0ff80df3cdabf2c7299848

# Remove f-prefix from strings that don't use formatting
7727fa4c8c6c1ef2b109120aff4196a0a6bf3ed6

# format tests/linear_4bit.py
34735ba89de8235ea9da6ef409f814dcea9e2038

# Reformat with ruff-format
5a4263f4dc05fe8f78f4111beab9f68a81deeab1

# CHANGELOG: to reverse chron order + mdformat
4743ff0d43e04e4cc3e5d8b9e7cd016c0defa36d

# Apply clang-format
4955d136ae083c2be1236d8915913166e1790aad


================================================
FILE: .gitattributes
================================================
*.bat text eol=crlf


================================================
FILE: .github/FUNDING.yml
================================================
open_collective: bitsandbytes


================================================
FILE: .github/ISSUE_TEMPLATE/bug-report.yml
================================================
name: "\U0001F41B Bug Report"
description: Submit a bug report to help us improve bitsandbytes
body:
  - type: textarea
    id: system-info
    attributes:
      label: System Info
      description: Please share your relevant system information with us
      placeholder: platform, python version, hardware, ...
    validations:
      required: true

  - type: textarea
    id: reproduction
    validations:
      required: true
    attributes:
      label: Reproduction
      description: |
        Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
        Please provide the simplest reproducer as possible so that we can quickly fix the issue.

      placeholder: |
        Reproducer:

  - type: textarea
    id: expected-behavior
    validations:
      required: true
    attributes:
      label: Expected behavior
      description: "A clear and concise description of what you would expect to happen."


================================================
FILE: .github/ISSUE_TEMPLATE/feature-request.yml
================================================
name: "\U0001F680 Feature request"
description: Submit a proposal/request for a new feature
labels: ["feature"]
body:
  - type: textarea
    id: feature-request
    validations:
      required: true
    attributes:
      label: Feature request
      description: |
        A clear and concise description of the feature proposal.

  - type: textarea
    id: motivation
    validations:
      required: true
    attributes:
      label: Motivation
      description: |
        Please outline the motivation for the proposal. Is your feature request related to a problem?

  - type: textarea
    id: contribution
    validations:
      required: true
    attributes:
      label: Your contribution
      description: |
        Is there any way that you could help, e.g. by submitting a PR?


================================================
FILE: .github/dependabot.yml.disabled
================================================
version: 2
updates:
  - package-ecosystem: pip
    directory: "/"
    schedule:
      interval: "weekly"
    groups:
      major:
        update-types: [major]
      minor-patch:
        update-types: [minor, patch]


================================================
FILE: .github/scripts/auditwheel_show.py
================================================
import argparse
import subprocess


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("wheels", nargs="*")
    args = ap.parse_args()
    if not args.wheels:
        ap.error("At least one wheel must be provided.")
    for whl in args.wheels:
        print(f"### `{whl}`")

        audit_wheel_output = subprocess.run(
            ["auditwheel", "show", whl],
            capture_output=True,
            text=True,
            errors="backslashreplace",
        )

        if audit_wheel_output.stdout:
            print(audit_wheel_output.stdout)

        if audit_wheel_output.stderr:
            print(f"**Error:**\n```\n{audit_wheel_output.stderr}\n```")

        print("---")


if __name__ == "__main__":
    main()


================================================
FILE: .github/scripts/build-cpu.sh
================================================
#!/bin/bash
declare build_arch
declare build_os

set -xeuo pipefail

pip install cmake==3.28.3

if [ "${build_os:0:5}" == macos ] && [ "${build_arch}" == aarch64 ]; then
	cmake -DCMAKE_OSX_ARCHITECTURES=arm64 -DCOMPUTE_BACKEND=cpu .
else
	cmake -DCOMPUTE_BACKEND=cpu .
fi
cmake --build . --config Release

output_dir="output/${build_os}/${build_arch}"
mkdir -p "${output_dir}"
(shopt -s nullglob && cp bitsandbytes/*.{so,dylib,dll} "${output_dir}")


================================================
FILE: .github/scripts/build-cuda.sh
================================================
#!/bin/bash
declare build_arch
declare build_os
declare cuda_version
declare cuda_targets

set -xeuo pipefail

if [[ -v cuda_targets ]]; then
    build_capability="${cuda_targets}"
elif [ "${build_arch}" = "aarch64" ]; then
    build_capability="75;80;90"

    # CUDA 12.8-12.9: Add sm100/sm120
    [[ "${cuda_version}" == 12.8.* || "${cuda_version}" == 12.9.* ]] && build_capability="75;80;90;100;120"

    # CUDA 13.0+: Add sm100/sm110/sm120
    [[ "${cuda_version}" == 13.*.* ]] && build_capability="75;80;90;100;110;120;121"
else
    # By default, target Pascal through Hopper.
    build_capability="60;70;75;80;86;89;90"

    # CUDA 12.8+: Add sm100 and sm120; remove < sm70 to align with PyTorch 2.8+cu128 minimum
    [[ "${cuda_version}" == 12.8.* || "${cuda_version}" == 12.9.* ]] && build_capability="70;75;80;86;89;90;100;120"

    # CUDA 13.0+: Remove < sm75 to align with PyTorch 2.9+cu130 minimum
    [[ "${cuda_version}" == 13.*.* ]] && build_capability="75;80;86;89;90;100;120"
fi

[[ "${build_os}" = windows-* ]] && python3 -m pip install ninja

if [ "${build_os:0:6}" == ubuntu ]; then
    # We'll use Rocky Linux 8 in order to maintain manylinux 2.24 compatibility.
    image="nvidia/cuda:${cuda_version}-devel-rockylinux8"
    echo "Using image $image"

    docker run -i -w /src -v "$PWD:/src" "$image" bash -c \
        "dnf -y --refresh update --security \
        && dnf -y install cmake gcc-toolset-11 --setopt=install_weak_deps=False --setopt=tsflags=nodocs \
        && source scl_source enable gcc-toolset-11 \
        && cmake -DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY=\"${build_capability}\" . \
        && cmake --build . --config Release"
else
    pip install cmake==3.28.3
    cmake -G Ninja -DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY="${build_capability}" -DCMAKE_BUILD_TYPE=Release -S .
    cmake --build . --config Release
fi


output_dir="output/${build_os}/${build_arch}"
mkdir -p "${output_dir}"
(shopt -s nullglob && cp bitsandbytes/*.{so,dylib,dll} "${output_dir}")


================================================
FILE: .github/scripts/build-rocm.sh
================================================
#!/bin/bash
declare build_arch
declare build_os
declare rocm_version

set -xeuo pipefail
bnb_rocm_arch="gfx90a;gfx942;gfx1100;gfx1101;gfx1102;gfx1103"

# ROCm 6.4+ - Add RDNA4 and RDNA3.5 targets. Note we assume >=6.4.4.
[[ "${rocm_version}" == 6.4.* || "${rocm_version}" == 7.* ]] && bnb_rocm_arch="${bnb_rocm_arch};gfx1150;gfx1151;gfx1152;gfx1153;gfx1200;gfx1201"

# ROCm 7.0+ - Add gfx950
[[ "${rocm_version}" == 7.* ]] && bnb_rocm_arch="${bnb_rocm_arch};gfx950"

if [ "${build_os:0:6}" == ubuntu ]; then
    image=rocm/dev-ubuntu-22.04:${rocm_version}-complete
    echo "Using image $image"
    docker run --rm --platform "linux/$build_arch" -i \
        -w /src -v "$PWD:/src" "$image" sh -c \
        "apt-get update \
      && pip install cmake==3.31.6 \
      && cmake -DCOMPUTE_BACKEND=hip -DCMAKE_BUILD_TYPE=MinSizeRel -DCMAKE_HIP_FLAGS=\"--offload-compress\" -DBNB_ROCM_ARCH=\"${bnb_rocm_arch}\" . \
      && cmake --build ."
fi

output_dir="output/${build_os}/${build_arch}"
mkdir -p "${output_dir}"
(shopt -s nullglob && cp bitsandbytes/*.{so,dylib,dll} "${output_dir}")


================================================
FILE: .github/scripts/build-xpu-windows.bat
================================================
set INTEL_DLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/75d4eb97-914a-4a95-852c-7b9733d80f74/intel-deep-learning-essentials-2025.1.3.8_offline.exe
set INTEL_DLE_TMP=%RUNNER_TEMP%\intel_dle
set INTEL_DLE_LOG=%RUNNER_TEMP%\intel_dle_log.txt

echo ::group::Intel Deep Learning Essentials Installation
curl -o intel-dle-installer.exe %INTEL_DLE_URL%
start /wait "Intel DLE Install" intel-dle-installer.exe -f %INTEL_DLE_TMP% -l %INTEL_DLE_LOG% --silent -a --eula=accept -p=NEED_VS2022_INTEGRATION=0
type %INTEL_DLE_LOG%
if ERRORLEVEL 1 (
    echo Failed to install Intel Deep Learning Essentials
    exit /b 1
)
echo ::endgroup::

echo ::group::Build Environment Setup
call "%ProgramFiles(x86)%\Intel\oneAPI\setvars.bat"
cmake -G Ninja -DCOMPUTE_BACKEND=xpu -DCMAKE_BUILD_TYPE=Release .
if ERRORLEVEL 1 (
    echo Failed to setup environment
    exit /b 1
)
echo ::endgroup::

echo ::group::Building with XPU backend
cmake --build . --config Release
if ERRORLEVEL 1 (
    echo Build failed
    exit /b 1
)
echo ::endgroup::

set output_dir=output\%build_os%\x86_64
if not exist "%output_dir%" mkdir "%output_dir%"
copy bitsandbytes\*.dll "%output_dir%\" 2>nul


================================================
FILE: .github/scripts/build-xpu.sh
================================================
#!/bin/bash
declare build_os

set -xeuo pipefail

# We currently only build XPU on Linux.
if [ "${build_os:0:6}" == ubuntu ]; then
    # TODO: We might want to pre-build this as our own customized image in the future.
    image=intel/deep-learning-essentials:2025.1.3-0-devel-ubuntu22.04
    echo "Using image $image"
    docker run --rm -i \
        -w /src -v "$PWD:/src" "$image" sh -c \
        "apt-get update \
      && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        cmake bison intel-fw-gpu intel-ocloc \
      && cmake -DCOMPUTE_BACKEND=xpu . \
      && cmake --build . --config Release"
fi

output_dir="output/${build_os}/x86_64"
mkdir -p "${output_dir}"
(shopt -s nullglob && cp bitsandbytes/*.{so,dylib,dll} "${output_dir}")


================================================
FILE: .github/scripts/set_platform_tag.py
================================================
import argparse
import platform
import sys


def get_platform_tag(architecture):
    system = platform.system()

    if system == "Linux":
        tag = "manylinux_2_24_x86_64" if architecture == "x86_64" else "manylinux_2_24_aarch64"
    elif system == "Darwin":
        tag = "macosx_14_0_arm64"
    elif system == "Windows":
        tag = "win_amd64" if architecture == "x86_64" else "win_arm64"
    else:
        sys.exit(f"Unsupported system: {system}")

    return tag


def main():
    parser = argparse.ArgumentParser(description="Determine platform tag.")
    parser.add_argument("arch", type=str, help="Architecture (e.g., x86_64, aarch64)")
    args = parser.parse_args()

    tag = get_platform_tag(args.arch)

    print(tag)  # This will be captured by the GitHub Actions workflow


if __name__ == "__main__":
    main()


================================================
FILE: .github/workflows/build_documentation.yml
================================================
name: Build documentation

on:
  push:
    branches:
      - main
      - doc-builder*
      - v*-release

jobs:
  build:
    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
    with:
      commit_sha: ${{ github.sha }}
      package: bitsandbytes
      repo_owner: bitsandbytes-foundation
      # avoid /src suffix leading to wrong links, like bitsandbytes/blob/main/src/bitsandbytes/nn/
      version_tag_suffix: ''  # defaults to '/src'
      custom_container: huggingface/transformers-doc-builder
    secrets:
      hf_token: ${{ secrets.HUGGINGFACE_PUSH }}


================================================
FILE: .github/workflows/build_pr_documentation.yml
================================================
name: Build PR Documentation

on:
  pull_request:

concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true

jobs:
  build:
    if: github.repository == 'bitsandbytes-foundation/bitsandbytes'
    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
    with:
      commit_sha: ${{ github.event.pull_request.head.sha }}
      pr_number: ${{ github.event.number }}
      package: bitsandbytes
      repo_owner: bitsandbytes-foundation
      # avoid /src suffix leading to wrong links, like bitsandbytes/blob/main/src/bitsandbytes/nn/
      version_tag_suffix: ''  # defaults to '/src'
      custom_container: huggingface/transformers-doc-builder


================================================
FILE: .github/workflows/lint.yml
================================================
name: Lint

on:
  push:
    branches:
      - main
  pull_request:

jobs:
  Lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with:
          python-version: "3.12"
      - uses: pre-commit/action@v3.0.0
        env:
          RUFF_OUTPUT_FORMAT: github


================================================
FILE: .github/workflows/python-package.yml
================================================
name: Python package

on:
  push: {}
  pull_request:
    branches: [main]
    paths:
      - ".github/workflows/python-package.yml"
      - ".github/scripts/**"
      - "bitsandbytes/**"
      - "csrc/**"
      - "include/**"
      - "tests/**"
      - "CMakeLists.txt"
      - "MANIFEST.in"
      - "setup.py"
      - "pyproject.toml"
  release:
    types: [published]
  workflow_dispatch: {} # Allow manual trigger
  workflow_call: {} # Allow triggering from other worfkflows

concurrency:
  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
  cancel-in-progress: true

jobs:
  ##
  # This job matrix builds the CPU versions of the libraries for all supported platforms.
  ##
  build-cpu:
    strategy:
      matrix:
        include:
          - os: ubuntu-22.04
            arch: x86_64
          - os: ubuntu-22.04-arm
            arch: aarch64
          - os: windows-2025
            arch: x86_64
          - os: macos-15
            arch: arm64
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - name: Setup MSVC
        if: startsWith(matrix.os, 'windows')
        uses: ilammy/msvc-dev-cmd@v1.13.0 # to use cl
      - name: Build C++
        run: bash .github/scripts/build-cpu.sh
        env:
          build_os: ${{ matrix.os }}
          build_arch: ${{ matrix.arch }}
      - name: Upload build artifact
        uses: actions/upload-artifact@v4
        with:
          name: shared_library_${{ matrix.os }}_${{ matrix.arch }}
          path: output/*
          retention-days: 7

  ##
  # This job matrix builds the CUDA versions of the libraries for platforms that support CUDA (Linux x64/aarch64 + Windows x64)
  ##
  build-cuda:
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu-22.04, ubuntu-22.04-arm, windows-2025]
        include:
          - os: ubuntu-22.04
            arch: x86_64
          - os: ubuntu-22.04-arm
            arch: aarch64
          - os: windows-2025
            arch: x86_64
        cuda_version:
          ["11.8.0", "12.0.1", "12.1.1", "12.2.2", "12.3.2", "12.4.1", "12.5.1", "12.6.3", "12.8.1", "12.9.1", "13.0.2"]
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
        # Windows: We install Cuda on the agent (slow)
      - uses: Jimver/cuda-toolkit@6008063726ffe3309d1b22e413d9e88fed91a2f2 # v0.2.29
        if: startsWith(matrix.os, 'windows')
        id: cuda-toolkit
        with:
          cuda: ${{ matrix.cuda_version }}
          method: "network"
          # The "crt" "nvvm" and "nvptxcompiler" components are added for CUDA 13.
          sub-packages: ${{ format('["nvcc"{0},"cudart","cublas","thrust","cublas_dev"]', startsWith(matrix.cuda_version, '13.') && ',"crt","nvvm","nvptxcompiler"' || '') }}
          use-github-cache: false
          use-local-cache: false
          log-file-suffix: ${{matrix.os}}-${{matrix.cuda_version}}.txt
      - name: Setup MSVC
        if: startsWith(matrix.os, 'windows')
        uses: ilammy/msvc-dev-cmd@v1.13.0 # to use cl
      - name: Build C++
        run: bash .github/scripts/build-cuda.sh
        env:
          build_os: ${{ matrix.os }}
          build_arch: ${{ matrix.arch }}
          cuda_version: ${{ matrix.cuda_version }}
      - name: Upload build artifact
        uses: actions/upload-artifact@v4
        with:
          name: shared_library_cuda_${{ matrix.os }}_${{ matrix.arch }}_${{ matrix.cuda_version }}
          path: output/*
          retention-days: 7

  build-xpu:
    strategy:
      matrix:
        os: [ubuntu-22.04, windows-2025]
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - name: Build C++ (Linux)
        if: runner.os == 'Linux'
        run: bash .github/scripts/build-xpu.sh
        env:
          build_os: ${{ matrix.os }}
      - name: Build C++ (Windows)
        if: runner.os == 'Windows'
        run: .github/scripts/build-xpu-windows.bat
        shell: cmd
        env:
          build_os: ${{ matrix.os }}
      - name: Upload build artifact
        uses: actions/upload-artifact@v4
        with:
          name: shared_library_xpu_${{ matrix.os }}_x86_64
          path: output/*
          retention-days: 7

  build-rocm:
    strategy:
      matrix:
        os: [ubuntu-22.04]
        arch: [x86_64]
        rocm_version: ["6.2.4", "6.3.4", "6.4.4", "7.0.2", "7.1", "7.2"]
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - name: Clean up disk space
        run: |
          echo "Disk space before cleanup:"
          df -h

          # These are the biggest disk space hogs.
          sudo rm -rf \
            /opt/hostedtoolcache/CodeQL \
            /usr/lib/dotnet \
            /usr/lib/jvm \
            /usr/local/.ghcup \
            /usr/local/lib/android \
            /usr/share/swift

          echo "Disk space after cleanup:"
          df -h
      - name: Build C++
        run: bash .github/scripts/build-rocm.sh
        env:
          build_os: ${{ matrix.os }}
          build_arch: ${{ matrix.arch }}
          rocm_version: ${{ matrix.rocm_version }}
      - name: Upload build artifact
        uses: actions/upload-artifact@v4
        with:
          name: shared_library_rocm_${{ matrix.os }}_${{ matrix.arch }}_${{ matrix.rocm_version }}
          path: output/*
          retention-days: 7

  build-wheels:
    env:
      # Skip rebuilding the CPU library when building the wheels.
      BNB_SKIP_CMAKE: 1
    needs:
      - build-cpu
      - build-cuda
      - build-rocm
      - build-xpu
    strategy:
      matrix:
        os: [ubuntu-22.04, ubuntu-22.04-arm, windows-2025, macos-15]
        include:
          - os: ubuntu-22.04
            arch: x86_64
          - os: ubuntu-22.04-arm
            arch: aarch64
          - os: windows-2025
            arch: x86_64
          - os: macos-15
            arch: arm64
        # The specific Python version is irrelevant in this context as we are only packaging non-C extension
        # code. This ensures compatibility across Python versions, as compatibility is
        # dictated by the packaged code itself, not the Python version used for packaging.
        python-version: ["3.10"]
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - name: Download build artifacts
        uses: actions/download-artifact@v4
        with:
          merge-multiple: true
          pattern: "shared_library*_${{ matrix.os }}_${{ matrix.arch }}*"
          path: output/
      - name: Copy correct platform shared library
        shell: bash
        run: |
          ls -lR output/
          cp output/${{ matrix.os }}/${{ matrix.arch }}/* bitsandbytes/
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: pip
      - run: pip install build wheel
      - run: python -m build .
      - name: Determine and Set Platform Tag, then Tag Wheel
        shell: bash
        run: |
          PLATFORM_TAG=$(python .github/scripts/set_platform_tag.py "${{ matrix.arch }}")
          echo "PLATFORM_TAG=$PLATFORM_TAG"
          wheel tags --remove --abi-tag=none --python-tag=py3 --platform-tag=$PLATFORM_TAG dist/bitsandbytes-*.whl
      - name: Upload build artifact
        uses: actions/upload-artifact@v4
        with:
          name: bdist_wheel_${{ matrix.os }}_${{ matrix.arch }}
          path: dist/bitsandbytes-*.whl
          retention-days: 7

  upload-pre-release-wheels:
    name: Create release and upload artifacts
    runs-on: ubuntu-latest
    if: github.ref_name == 'main'
    permissions:
      contents: write
    needs:
      - build-wheels
    steps:
      - name: Download and rename artifacts
        uses: actions/download-artifact@v4
        with:
          path: tmp/
          pattern: "bdist_wheel_*"
          merge-multiple: true

      - name: Inspect tmp directory after downloading artifacts

        run: |
          ls -alFR tmp/
          WHEEL_COUNT=$(find tmp/ -type f -name "*.whl" | wc -l)
          echo "Found $WHEEL_COUNT wheel files"
          if [ "$WHEEL_COUNT" -eq 0 ]; then
            echo "::error::No wheel files found in tmp directory! Cannot proceed with release."
            exit 1
          fi

      - name: Move and rename wheel files with pattern replacement
        run: |
          mkdir -p wheels/

          # The whole point of the continuous release is to have a stable download link and the only way to have a PEP 440–compliant wheel name
          # is to use a stable placeholder version. Otherwise, pip won't let you install the wheel. The cool thing is that we can now install the
          # wheel directly from the GH pre-release which gets updated continuously, e.g.
          # `pip install https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl`
          STABLE_PLACEHOLDER_VERSION="1.33.7.preview"

          find tmp/ -type f -name '*.whl' -print0 | while IFS= read -r -d '' wheel; do
            wheel_filename=$(basename "$wheel")

            # Strip off the original version
            rest=${wheel_filename#bitsandbytes-*-}
            new_name="bitsandbytes-${STABLE_PLACEHOLDER_VERSION}-${rest}"

            echo "Renaming $wheel_filename → $new_name"
            mv "$wheel" "wheels/${new_name}"
          done

      - name: Inspect wheels directory after renaming files
        run: ls -alFR wheels/

      - uses: actions/checkout@v4
        with:
          path: repo

      - name: Delete old pre-release (if exists)
        run: |
          cd repo && gh release delete continuous-release_main --cleanup-tag -y
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Ensure tag exists
        run: |
          cd repo
          git tag -f continuous-release_main
          git push -f origin continuous-release_main
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Generate pip install commands for release body
        run: |
          cat > body.md << 'ENDOFMARKDOWN'
          ## Latest `main` pre-release wheel

          This pre-release contains the latest development wheels for all supported platforms, rebuilt automatically on every commit to the `main` branch.

          **How to install:**
          Pick the correct command for your platform and run it in your terminal:

          ENDOFMARKDOWN

          for whl in wheels/*.whl; do
            fname=$(basename "$whl")
            url="https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/$fname"

            if [[ "$fname" == *"manylinux_2_24_x86_64"* ]]; then
              echo "### Linux (x86_64)" >> body.md
            elif [[ "$fname" == *"manylinux_2_24_aarch64"* ]]; then
              echo "### Linux (aarch64)" >> body.md
            elif [[ "$fname" == *"win_amd64"* ]]; then
              echo "### Windows (x86_64)" >> body.md
            elif [[ "$fname" == *"macosx"* ]]; then
              echo "### macOS 14+ (arm64)" >> body.md
            else
              echo "### Other platform" >> body.md
            fi

            echo "\`\`\`sh" >> body.md
            echo "pip install --force-reinstall $url" >> body.md
            echo "\`\`\`" >> body.md
            echo "" >> body.md
          done

          cat >> body.md << 'ENDOFMARKDOWN'
          > **Note:**
          > These wheels are updated automatically with every commit to `main` and become available as soon as the [python-package.yml](.github/workflows/python-package.yml) workflow finishes.

          The version number is replaced with 1.33.7-preview in order to keep the link stable, this however does not affect the installed version at all:
          ```
          > pip install https://.../bitsandbytes-1.33.7-preview-py3-none-manylinux_2_24_x86_64.whl
          Collecting bitsandbytes==1.33.7rc0
          ...
          Successfully installed bitsandbytes-0.49.0.dev0
          ```
          ENDOFMARKDOWN

          # for debugging:
          cat body.md

      - name: Create new pre-release and upload artifacts
        uses: softprops/action-gh-release@v2.2.1
        with:
          files: wheels/*.whl
          prerelease: true
          name: Latest `main` wheel
          body_path: body.md
          tag_name: continuous-release_main
          make_latest: false
          draft: false

  audit-wheels:
    needs: build-wheels
    strategy:
      matrix:
        os: [ubuntu-22.04, ubuntu-22.04-arm]
        include:
          - os: ubuntu-22.04
            arch: x86_64
          - os: ubuntu-22.04-arm
            arch: aarch64
    runs-on: ${{ matrix.os }}
    env:
      PIP_DISABLE_PIP_VERSION_CHECK: 1
    steps:
      - uses: actions/checkout@v4
      - name: Download wheel
        uses: actions/download-artifact@v4
        with:
          name: bdist_wheel_${{ matrix.os }}_${{ matrix.arch }}
          path: wheels/
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install auditwheel
      - run: python ./.github/scripts/auditwheel_show.py wheels/* | tee $GITHUB_STEP_SUMMARY

  publish-wheels:
    name: Publish wheels to PyPI
    needs: [build-wheels, audit-wheels]
    runs-on: ubuntu-latest
    if: |
      github.repository == 'bitsandbytes-foundation/bitsandbytes'
      && github.event_name == 'push' && startsWith(github.ref, 'refs/tags')
    environment:
      name: release
      url: https://pypi.org/p/bitsandbytes
    permissions:
      id-token: write
    steps:
      - name: Download distribution artifacts
        uses: actions/download-artifact@v4
        with:
          path: dist/
          pattern: "bdist_wheel_*"
          merge-multiple: true

      - name: Publish to PyPI
        uses: pypa/gh-action-pypi-publish@release/v1
        with:
          print-hash: true


================================================
FILE: .github/workflows/stale.yml.disabled
================================================
name: Stale Bot

on:
  schedule:
    - cron: "0 15 * * *"

jobs:
  close_stale_issues:
    name: Close Stale Issues
    if: github.repository == 'TimDettmers/bitsandbytes'
    runs-on: ubuntu-latest
    env:
      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    steps:
    - uses: actions/checkout@v3

    - name: Setup Python
      uses: actions/setup-python@v4
      with:
        python-version: 3.8

    - name: Install requirements
      run: |
        pip install PyGithub
    - name: Close stale issues
      run: |
        python scripts/stale.py


================================================
FILE: .github/workflows/test-runner.yml
================================================
name: Test Runner

on:
  workflow_call:
    inputs:
      platform:
        type: string
        required: true
        description: "Platform: linux-x64, linux-aarch64, windows, macos"
      backend:
        type: string
        required: true
        description: "Backend: cpu, cuda"
      torch_version:
        type: string
        required: true
        description: "PyTorch version to install"
      pypi_index:
        type: string
        default: "https://download.pytorch.org/whl/cpu"
        description: "PyPI index URL for torch installation"
      cuda_version:
        type: string
        default: ""
        description: "CUDA version (required for cuda backend)"
      gpu_type:
        type: string
        default: ""
        description: "GPU type for CUDA testing: T4, A10,L40S"
      # cpu_type currently only affects linux x64 CPU testing to select specific CPU architectures
      cpu_type:
        type: string
        default: ""
        description: "CPU architecture for testing: icelake, cascadelake (default: platform default runner)"

env:
  BNB_SKIP_CMAKE: 1

jobs:
  build:
    runs-on: >-
      ${{
        inputs.platform == 'linux-x64' && 'ubuntu-22.04' ||
        inputs.platform == 'linux-aarch64' && 'ubuntu-22.04-arm' ||
        inputs.platform == 'macos' && 'macos-15' ||
        'windows-2025'
      }}
    outputs:
      test_runner: ${{ steps.config.outputs.test_runner }}
      artifact_name: ${{ steps.config.outputs.artifact_name }}
      build_os: ${{ steps.config.outputs.build_os }}
      arch: ${{ steps.config.outputs.arch }}
    steps:
      - name: Configure test runner and paths
        id: config
        shell: bash
        run: |
          # Map platform to OS identifiers, architecture, and test runner
          case "${{ inputs.platform }}" in
            linux-x64)
              BUILD_OS="ubuntu-22.04"
              ARCH="x64"
              if [[ "${{ inputs.backend }}" == "cuda" ]]; then
                case "${{ inputs.gpu_type }}" in
                  T4)
                    TEST_RUNNER="bandb-aws-g4dn-4xlarge-plus-use1-public-80"
                    ;;
                  A10)
                    TEST_RUNNER="bandb-aws-g5-4xlarge-plus-use1-public-80"
                    ;;
                  L40S)
                    TEST_RUNNER="bandb-aws-g6e-4xlarge-plus-use1-public-80"
                    ;;
                  *)
                    echo "::error::Must specify gpu_type (T4, A10, L40S) for linux-x64 cuda backend"
                    exit 1
                    ;;
                esac
              else
                case "${{ inputs.cpu_type }}" in
                  icelake)
                    TEST_RUNNER="banb-aws-general-8-plus-use1-public-80"
                    ;;
                  cascadelake)
                    TEST_RUNNER="bandb-aws-g4dn-4xlarge-plus-use1-public-80"
                    ;;
                  "")
                    TEST_RUNNER="ubuntu-22.04"
                    ;;
                  *)
                    echo "::error::Invalid cpu_type: ${{ inputs.cpu_type }}"
                    exit 1
                    ;;
                esac
              fi
              ;;
            linux-aarch64)
              BUILD_OS="ubuntu-22.04-arm"
              ARCH="aarch64"
              TEST_RUNNER="ubuntu-22.04-arm"
              ;;
            macos)
              BUILD_OS="macos-15"
              ARCH="arm64"
              TEST_RUNNER="macos-15"
              ;;
            windows)
              BUILD_OS="windows-2025"
              ARCH="x64"
              if [[ "${{ inputs.backend }}" == "cuda" ]]; then
                TEST_RUNNER="CUDA-Windows-x64"
              else
                TEST_RUNNER="windows-2025"
              fi
              ;;
            *)
              echo "::error::Unsupported platform: ${{ inputs.platform }}"
              exit 1
              ;;
          esac

          # Create unique artifact name per configuration
          ARTIFACT="lib_${{ inputs.backend }}_${BUILD_OS}_${ARCH}"
          if [[ "${{ inputs.backend }}" == "cuda" ]]; then
            ARTIFACT="${ARTIFACT}_${{ inputs.cuda_version }}_${{ inputs.gpu_type }}"
          else
            ARTIFACT="${ARTIFACT}_${{ inputs.cpu_type }}"
          fi
          ARTIFACT="${ARTIFACT}_torch${{ inputs.torch_version }}_${{ github.run_id }}_${{ github.run_attempt }}"

          echo "test_runner=${TEST_RUNNER}" >> $GITHUB_OUTPUT
          echo "artifact_name=${ARTIFACT}" >> $GITHUB_OUTPUT
          echo "build_os=${BUILD_OS}" >> $GITHUB_OUTPUT
          echo "arch=${ARCH}" >> $GITHUB_OUTPUT

      - uses: actions/checkout@v4

      - name: Set build environment variables
        shell: bash
        run: |
          echo "build_os=${{ steps.config.outputs.build_os }}" >> $GITHUB_ENV
          echo "build_arch=${{ steps.config.outputs.arch }}" >> $GITHUB_ENV

      # Windows + CUDA: Install CUDA Toolkit
      - name: Install CUDA Toolkit
        if: inputs.backend == 'cuda' && inputs.platform == 'windows'
        uses: Jimver/cuda-toolkit@6008063726ffe3309d1b22e413d9e88fed91a2f2 # v0.2.29
        with:
          cuda: ${{ inputs.cuda_version }}
          method: "network"
          sub-packages: '["nvcc","cudart","cublas","thrust","nvrtc_dev","cublas_dev"]'
          use-github-cache: false

      # Windows: Setup MSVC (needed for both CPU and CUDA builds)
      - name: Setup MSVC
        if: inputs.platform == 'windows'
        uses: ilammy/msvc-dev-cmd@v1.13.0

      # Build CPU backend
      - name: Build C++
        if: inputs.backend == 'cpu'
        run: bash .github/scripts/build-cpu.sh

      # Build CUDA backend
      - name: Build C++ / CUDA
        if: inputs.backend == 'cuda'
        run: bash .github/scripts/build-cuda.sh
        env:
          cuda_version: ${{ inputs.cuda_version }}
          cuda_targets: "75;80;89"

      - name: Upload build artifact
        uses: actions/upload-artifact@v4
        with:
          name: ${{ steps.config.outputs.artifact_name }}
          path: output/${{ steps.config.outputs.build_os }}/${{ steps.config.outputs.arch }}/*
          retention-days: 7

  test:
    needs: build
    runs-on: ${{ needs.build.outputs.test_runner }}
    env:
      BNB_TEST_DEVICE: ${{ inputs.backend }}
    steps:
      # CUDA: Show GPU information
      - name: Show GPU Information
        if: inputs.backend == 'cuda'
        run: nvidia-smi

      - uses: actions/checkout@v4

      - name: Download build artifact
        uses: actions/download-artifact@v4
        with:
          name: ${{ needs.build.outputs.artifact_name }}
          path: bitsandbytes/
          merge-multiple: true

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'

      # Windows: Setup MSVC for torch.compile
      - name: Setup MSVC
        if: inputs.platform == 'windows'
        uses: ilammy/msvc-dev-cmd@v1.13.0

      - name: Install dependencies
        run: |
          pip install torch==${{ inputs.torch_version }} --index-url ${{ inputs.pypi_index }}
          pip install -e ".[test]" -v
          pip install pytest-cov

      # Windows: Downgrade NumPy for torch<2.4.1 compatibility
      # See: https://github.com/pytorch/pytorch/issues/131668
      - name: Downgrade NumPy
        if: inputs.platform == 'windows' && startsWith(inputs.torch_version, '2.3.')
        run: pip install "numpy<2"

      - name: Show installed packages
        run: pip list

      - name: Show environment information
        run: python -m torch.utils.collect_env

      - name: Run tests
        run: pytest --durations=100


================================================
FILE: .github/workflows/tests-nightly.yml
================================================
name: Nightly Tests

on:
  workflow_dispatch:
  schedule:
    # Every day at 02:15 AM UTC
    - cron: "15 2 * * *"

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  test-cpu:
    name: CPU
    if: github.repository == 'bitsandbytes-foundation/bitsandbytes'
    strategy:
      fail-fast: false
      matrix:
        platform: [linux-x64, linux-aarch64, macos, windows]
        # default runners don't have AVX-512 support, but icelake does
        cpu_type: ["", icelake]
        torch_version: ["2.3.1", "2.9.1", "2.10.0"]

        exclude:
          # aarch64 minimum torch version is 2.5.1
          - platform: linux-aarch64
            torch_version: "2.3.1"
          # icelake only applies to linux-x64
          - platform: linux-aarch64
            cpu_type: icelake
          - platform: macos
            cpu_type: icelake
          - platform: windows
            cpu_type: icelake

        include:
          # Add aarch64 with torch 2.5.1
          - platform: linux-aarch64
            cpu_type: ""
            torch_version: "2.5.1"

    uses: ./.github/workflows/test-runner.yml
    with:
      platform: ${{ matrix.platform }}
      backend: cpu
      torch_version: ${{ matrix.torch_version }}
      pypi_index: "https://download.pytorch.org/whl/cpu"
      cpu_type: ${{ matrix.cpu_type }}

  test-cuda:
    name: CUDA
    if: github.repository == 'bitsandbytes-foundation/bitsandbytes'
    strategy:
      fail-fast: false
      matrix:
        # Linux x64 cross-product
        platform: [linux-x64]
        gpu_type: [T4, A10, L40S]
        cuda_version: ["11.8.0", "12.6.3", "12.8.1", "13.0.2"]

        include:
          # Map CUDA version to torch version and PyPI index
          - cuda_version: "11.8.0"
            torch_version: "2.3.1"
            pypi_index: "https://download.pytorch.org/whl/cu118"
          - cuda_version: "12.6.3"
            torch_version: "2.8.0"
            pypi_index: "https://download.pytorch.org/whl/cu126"
          - cuda_version: "12.8.1"
            torch_version: "2.9.1"
            pypi_index: "https://download.pytorch.org/whl/cu128"
          - cuda_version: "13.0.2"
            torch_version: "2.10.0"
            pypi_index: "https://download.pytorch.org/whl/cu130"

          # Windows CUDA Tests - T4 GPU (CUDA 11.8 only, multiple torch versions)
          - platform: windows
            gpu_type: T4
            cuda_version: "11.8.0"
            torch_version: "2.3.1"
            pypi_index: "https://download.pytorch.org/whl/cu118"
          - platform: windows
            gpu_type: T4
            cuda_version: "11.8.0"
            torch_version: "2.6.0"
            pypi_index: "https://download.pytorch.org/whl/cu118"
          - platform: windows
            gpu_type: T4
            cuda_version: "11.8.0"
            torch_version: "2.7.1"  # Note: this is the last PyTorch release supporting CUDA 11.8.
            pypi_index: "https://download.pytorch.org/whl/cu118"

    uses: ./.github/workflows/test-runner.yml
    with:
      platform: ${{ matrix.platform }}
      backend: cuda
      cuda_version: ${{ matrix.cuda_version }}
      gpu_type: ${{ matrix.gpu_type }}
      torch_version: ${{ matrix.torch_version }}
      pypi_index: ${{ matrix.pypi_index }}


================================================
FILE: .github/workflows/tests-pr.yml
================================================
name: PR Tests

on:
  pull_request:
    types: [opened, synchronize, reopened]
    branches: [main]
    paths:
      - ".github/workflows/test-runner.yml"
      - ".github/workflows/tests-pr.yml"
      - ".github/scripts/build-cpu.sh"
      - ".github/scripts/build-cuda.sh"
      - "bitsandbytes/**"
      - "csrc/**"
      - "include/**"
      - "tests/**"
      - "CMakeLists.txt"
      - "setup.py"
      - "pyproject.toml"

concurrency:
  group: ${{ github.workflow }}-${{ github.event.pull_request.number }}
  cancel-in-progress: true

jobs:
  test-cpu:
    name: CPU
    if: github.repository == 'bitsandbytes-foundation/bitsandbytes'
    strategy:
      fail-fast: false
      matrix:
        platform: [linux-x64, linux-aarch64, macos]
        # default runners don't have AVX-512 support, but icelake does
        cpu_type: ["", icelake]
        torch_version: ["2.3.1", "2.10.0"]

        exclude:
          # aarch64 minimum torch version is 2.5.1
          - platform: linux-aarch64
            torch_version: "2.3.1"
          # icelake only applies to linux-x64
          - platform: linux-aarch64
            cpu_type: icelake
          - platform: macos
            cpu_type: icelake

        include:
          # Add aarch64 with torch 2.5.1 instead of 2.3.1
          - platform: linux-aarch64
            cpu_type: ""
            torch_version: "2.5.1"

    uses: ./.github/workflows/test-runner.yml
    with:
      platform: ${{ matrix.platform }}
      backend: cpu
      torch_version: ${{ matrix.torch_version }}
      pypi_index: "https://download.pytorch.org/whl/cpu"
      cpu_type: ${{ matrix.cpu_type }}

  test-cuda:
    name: CUDA
    if: github.repository == 'bitsandbytes-foundation/bitsandbytes'
    strategy:
      fail-fast: false
      matrix:
        platform: [linux-x64]
        gpu_type: [T4, A10, L40S]
        cuda_version: ["11.8.0", "12.8.1", "13.0.2"]

        include:
          # Map CUDA version to torch version and PyPI index
          - cuda_version: "11.8.0"
            torch_version: "2.3.1"
            pypi_index: "https://download.pytorch.org/whl/cu118"
          - cuda_version: "12.8.1"
            torch_version: "2.9.1"
            pypi_index: "https://download.pytorch.org/whl/cu128"
          - cuda_version: "13.0.2"
            torch_version: "2.10.0"
            pypi_index: "https://download.pytorch.org/whl/cu130"

          # Windows CUDA test - single configuration
          - platform: windows
            gpu_type: T4
            cuda_version: "11.8.0"
            torch_version: "2.7.1"
            pypi_index: "https://download.pytorch.org/whl/cu118"

    uses: ./.github/workflows/test-runner.yml
    with:
      platform: ${{ matrix.platform }}
      backend: cuda
      cuda_version: ${{ matrix.cuda_version }}
      gpu_type: ${{ matrix.gpu_type }}
      torch_version: ${{ matrix.torch_version }}
      pypi_index: ${{ matrix.pypi_index }}


================================================
FILE: .github/workflows/upload_pr_documentation.yml
================================================
name: Upload PR Documentation

on:
  workflow_run:
    workflows: ["Build PR Documentation"]
    types:
      - completed

permissions:
  contents: read
  pull-requests: write # Allows posting comments on pull requests

jobs:
  build:
    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
    with:
      package_name: bitsandbytes
    secrets:
      hf_token: ${{ secrets.HUGGINGFACE_PUSH }}
      comment_bot_token: ${{ secrets.GITHUB_TOKEN }}


================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
*.so
*.dll
*.dylib
*.o
*.obj
*.air
*.metallib

# CMake generated files
CMakeCache.txt
CMakeScripts/
cmake_install.cmake
Makefile
CMakeFiles/
*.sln
*.vcxproj*
*.xcodeproj/
bitsandbytes.dir/
Debug/
Release/
cmake-build-*/

# IDE local files
.vs/
.idea/

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# vim
*.swp

dependencies
cuda_build
output/
cuda-spec.md
cuda-spec-additions.md
agents/*_issues.json


================================================
FILE: .pre-commit-config.yaml
================================================
repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.14.3
    hooks:
      - id: ruff
        args:
          - --fix
      - id: ruff-format
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
    hooks:
      - id: check-merge-conflict
      - id: check-yaml
      - id: end-of-file-fixer
      - id: fix-byte-order-marker
      - id: trailing-whitespace
      - id: mixed-line-ending
        args:
          - --fix=lf
        exclude: '\.bat$'
  - repo: https://github.com/crate-ci/typos
    rev: v1.26.0
    hooks:
      - id: typos
  - repo: https://github.com/pre-commit/mirrors-clang-format
    rev: v20.1.6
    hooks:
    - id: clang-format
      types_or: [c++, c, cuda]
      files: ^csrc/


================================================
FILE: .vscode/extensions.json
================================================
{
    "recommendations": [
        "ms-python.python",
        "charliermarsh.ruff",
        "twxs.cmake"
    ]
}


================================================
FILE: .vscode/settings.json
================================================
{
    "ruff.fixAll": true,
    "ruff.lint.run": "onType",
    "editor.codeActionsOnSave": {
        "source.fixAll": "always"
    }
}


================================================
FILE: CHANGELOG.md
================================================
### v0.45.1

#### Improvements:

* Compatibility for `triton>=3.2.0`
* Moved package configuration to `pyproject.toml`
* Build system: initial support for NVIDIA Blackwell B100 GPUs, RTX 50 Blackwell series GPUs and Jetson Thor Blackwell.
  * Note: Binaries built for these platforms are not included in this release. They will be included in future releases upon the availability of the upcoming CUDA Toolkit 12.7 and 12.8.

#### Bug Fixes:
* Packaging: wheels will no longer include unit tests. (#1478)

#### Dependencies:
* Sets the minimum PyTorch version to 2.0.0.

### 0.45.0

This is a significant release, bringing support for LLM.int8() to NVIDIA Hopper GPUs such as the H100.

As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32 or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios.

#### Performance Improvements
This release includes broad performance improvements for a wide variety of inference scenarios. See [this X thread](https://x.com/Tim_Dettmers/status/1864706051171287069) for a detailed explanation.

#### Breaking Changes
🤗[PEFT](https://github.com/huggingface/peft) users wishing to merge adapters with 8-bit weights will need to upgrade to `peft>=0.14.0`.

#### Packaging Improvements
* The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB.
* Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution.
* The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1.


#### Deprecations
* A number of public API functions have been marked for deprecation and will emit `FutureWarning` when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users.
* The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using `block_wise=False` is not recommended and support will be removed in a future release.
* As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts.

#### Full Changelog

* refine docs for multi-backend alpha release by @Titus-von-Koeller in #1380
* README: Replace special Unicode text symbols with regular characters by @akx in #1385
* Update CI tools & fix typos by @akx in #1386
* Fix invalid escape sequence warning in Python 3.12 by @oshiteku in #1420
* [Build] Add CUDA 12.6.2 build; update 12.5.0 to 12.5.1 by @matthewdouglas in #1431
* LLM.int8() Refactoring: Part 1 by @matthewdouglas in #1401

### 0.44.1

#### Bug fixes:
* Fix optimizer support for Python <= 3.9 by @matthewdouglas in #1379

### 0.44.0

#### New: AdEMAMix Optimizer
The [AdEMAMix](https://hf.co/papers/2409.03137) optimizer is a modification to AdamW which proposes tracking two EMAs to better leverage past gradients. This allows for faster convergence with less training data and improved resistance to forgetting.

We've implemented 8bit and paged variations: `AdEMAMix`, `AdEMAMix8bit`, `PagedAdEMAMix`, and `PagedAdEMAMix8bit`. These can be used with a similar API to existing optimizers.

#### Improvements:
* **8-bit Optimizers**: The block size for all 8-bit optimizers has been reduced from 2048 to 256 in this release. This is a change from the original implementation proposed in [the paper](https://hf.co/papers/2110.02861) which improves accuracy.
* **CUDA Graphs support**: A fix to enable [CUDA Graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) capture of kernel functions was made in #1330. This allows for performance improvements with inference frameworks like vLLM. Thanks @jeejeelee!

#### Full Changelog:
* Embedding4bit and Embedding8bit implementation by @galqiwi in #1292
* Bugfix: Load correct nocublaslt library variant when BNB_CUDA_VERSION override is set by @matthewdouglas in #1318
* Enable certain CUDA kernels to accept specified cuda stream by @jeejeelee in #1330
* Initial support for ppc64le by @mgiessing in #1316
* Cuda source cleanup , refactor and fixes by @abhilash1910 in #1328
* Update for VS2022 17.11 compatibility with CUDA < 12.4 by @matthewdouglas in #1341
* Bump the minor-patch group with 3 updates by @dependabot in #1362
* Update matplotlib requirement from ~=3.9.1 to ~=3.9.2 in the major group by @dependabot in #1361
* docs: add internal reference to multi-backend guide by @Titus-von-Koeller in #1352
* Add move_to_device kwarg to the optimizer's load_state_dict by @koute in #1344
* Add AdEMAMix optimizer by @matthewdouglas in #1360
* Change 8bit optimizer blocksize 2048->256; additional bf16 support by @matthewdouglas in #1365

### 0.43.3

#### Improvements:

- FSDP: Enable loading prequantized weights with bf16/fp16/fp32 quant_storage
    - Background: This update, linked to [Transformer PR #32276](https://github.com/huggingface/transformers/pull/32276), allows loading prequantized weights with alternative storage formats. Metadata is tracked similarly to `Params4bit.__new__` post PR #970. It supports models exported with non-default `quant_storage`, such as [this NF4 model with BF16 storage](https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-BNB-NF4-BF16).
    - Special thanks to @winglian and @matthewdouglas for enabling FSDP+QLoRA finetuning of Llama 3.1 405B on a single 8xH100 or 8xA100 node with as little as 256GB system RAM.


### 0.43.2

This release is quite significant as the QLoRA bug fix big implications for higher `seqlen` and batch sizes.

For each sequence (i.e. batch size increase of one) we expect memory savings of:
- 405B: 39GB for `seqlen=1024`, and 4888GB for `seqlen=128,00`
- 70B: 10.1GB for `seqlen=1024` and  1258GB for `seqlen=128,00`

This was due to activations being unnecessary for frozen parameters, yet the memory for them was still erroneously allocated due to the now fixed bug.

#### Improvements:

- docs: FSDP+QLoRA and CPU install guide (#1211 #1227, thanks @stevhliu)
- Add CUDA 12.5 and update 12.4 builds (#1284)

#### Bug Fixes

- 4bit getstate and 8bit deepcopy (#1230 #1231, thanks @BenjaminBossan)
- missing optimizers in `str2optimizer32bit` (#1222, thanks @EtienneDosSantos)
- CUDA 12.5 build issue (#1273, thanks @HennerM)
- fix for min_8bit_size functionality in Optimizer base classes (#1286, thanks @Edenzzzz)
- QLoRA mem bug (#1270, thanks @Ther-nullptr)
- tests for cpu only platforms (#1259, thanks @galqiwi)
- restoration of quant_storage for CPU offloading (#1279)
- optim update error with non-contiguous grads/params (deepspeed) (#1187)

### 0.43.1

#### Improvements:

- Improved the serialization format for 8-bit weights; this change is fully backwards compatible. (#1164, thanks to @younesbelkada for the contributions and @akx for the review).
- Added CUDA 12.4 support to the Linux x86-64 build workflow, expanding the library's compatibility with the latest CUDA versions. (#1171, kudos to @matthewdouglas for this addition).
- Docs enhancement: Improved the instructions for installing the library from source. (#1149, special thanks to @stevhliu for the enhancements).

#### Bug Fixes

- Fix 4bit quantization with blocksize = 4096, where an illegal memory access was encountered. (#1160, thanks @matthewdouglas for fixing and @YLGH for reporting)

#### Internal Improvements:

- Tests: improve memory usage (#1147, thanks @matthewdouglas)
- Add CUDA 12.4 to docs/install helper (#1136, thanks @matthewdouglas)
- Minor type/doc fixes (#1128, thanks @akx)
- Reformat Python code with Ruff (#1081, thanks @akx)
- Rework of CUDA/native-library setup and diagnostics (#1041, thanks @akx)

### 0.43.0

#### Improvements and New Features:

- QLoRA + FSDP official support is now live! https://github.com/TimDettmers/bitsandbytes/pull/970 by @warner-benjamin and team - with FSDP you can train very large models (70b scale) on multiple 24GB consumer-type GPUs. See https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html for more details.
- Introduced improvements to the CI process for enhanced performance and efficiency during builds, specifically enabling more effective cross-compilation on Linux platforms. This was accomplished by deprecating Make and migrating to Cmake, as well as implementing new corresponding workflows. Huge thanks go to @wkpark, @rickardp, @matthewdouglas and @younesbelkada; #1055, #1050, #1111.
- Windows should be officially supported in bitsandbytes if you install the library from source. See: https://huggingface.co/docs/bitsandbytes/main/en/index for more details
- Updated installation instructions to provide more comprehensive guidance for users. This includes clearer explanations and additional tips for various setup scenarios, making the library more accessible to a broader audience (@rickardp, #1047).
- Enhanced the library's compatibility and setup process, including fixes for CPU-only installations and improvements in CUDA setup error messaging. This effort aims to streamline the installation process and improve user experience across different platforms and setups (@wkpark, @akx, #1038, #996, #1012).
- Setup a new documentation at https://huggingface.co/docs/bitsandbytes/main with extensive new sections and content to help users better understand and utilize the library. Especially notable are the new API docs. (big thanks to @stevhliu and @mishig25 from HuggingFace #1012). The API docs have been also addressed in #1075.

#### Bug Fixes:

- Addressed a race condition in kEstimateQuantiles, enhancing the reliability of quantile estimation in concurrent environments (@pnunna93, #1061).
- Fixed various minor issues, including typos in code comments and documentation, to improve code clarity and prevent potential confusion (@Brian Vaughan, #1063).

#### Backwards Compatibility

- After upgrading from `v0.42` to `v0.43`, when using 4bit quantization, models may generate slightly different outputs (approximately up to the 2nd decimal place) due to a fix in the code. For anyone interested in the details, [see this comment](https://github.com/TimDettmers/bitsandbytes/discussions/1094#discussioncomment-8984069).

#### Internal and Build System Enhancements:

- Implemented several enhancements to the internal and build systems, including adjustments to the CI workflows, portability improvements, and build artifact management. These changes contribute to a more robust and flexible development process, ensuring the library's ongoing quality and maintainability (@rickardp, @akx, @wkpark, @matthewdouglas; #949, #1053, #1045, #1037).

#### Contributors:

This release is made possible thanks to the many active contributors that submitted PRs and many others who contributed to discussions, reviews, and testing. Your efforts greatly enhance the library's quality and user experience. It's truly inspiring to work with such a dedicated and competent group of volunteers and professionals!

We give a special thanks to @TimDettmers for managing to find a little bit of time for valuable consultations on critical topics, despite preparing for and touring the states applying for professor positions. We wish him the utmost success!

We also extend our gratitude to the broader community for your continued support, feedback, and engagement, which play a crucial role in driving the library's development forward.

### 0.42.0

Features:

- 4-bit serialization now supported. This enables 4-bit load/store. Thank you @poedator #753
- the bitsandbytes library now has a version attribute: `bitsandbytes.__version__` @rasbt #710

Bug fixes:

- Fixed bugs in dynamic exponent data type creation. Thank you @RossM, @KohakuBlueleaf, @ArrowM #659 #227 #262 #152
- Fixed an issue where 4-bit serialization would fail for layers without double quantization #868. Thank you, @poedator
- Fixed an issue where calling .to() or .cuda() on a 4-bit layer twice would result in an error #867. Thank you, @jph00
- Fixed a bug where a missing access permission in a path searched for CUDA would lead to an error @osma #677
- Fixed a bug where the GOOGLE_VM_CONFIG_LOCK_FILE variable could cause errors in colab environments @akrentsel @xaptronic #715 #883 #622
- Fixed a bug where kgetColRowStats (LLM.int8()) would fail for certain dimensions @LucQueen @905
- Fixed a bug where the adjusted regular Embedding layer was not available via bnb.nn.Embedding @neel04 #563
- Fixed added missing scipy requirement @dulalbert #525

### 0.41.3

Bug fixes:

- Fixed an issue where 4-bit serialization would fail for layers without double quantization #868. Thank you, @poedator
- Fixed an issue where calling .to() or .cuda() on a 4-bit layer twice would result in an error #867. Thank you, @jph00

### 0.41.2

Feature:

- 4-bit serialization now supported. This enables 4-bit load/store. Thank you @poedator #753

### 0.41.1

Bug fixes:

- Fixed bugs in dynamic exponent data type creation. Thank you @RossM, @KohakuBlueleaf, @ArrowM #659 #227 #262 #152

### 0.41.0

Features:

- Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571
- CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk

Bug fixes:

- Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553
- Fixed a missing scipy dependency in requirements.txt. #544
- Fixed a bug, where a view operation could cause an error in 8-bit layers.
- Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani
- Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588
- Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk
- Fixed bug where read-permission was assumed for a file. #497
- Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro

Documentation:

- Improved documentation for GPUs that do not support 8-bit matmul. #529
- Added description and pointers for the NF4 data type. #543

User experience:

- Improved handling of default compute_dtype for Linear4bit Layers, so that compute_dtype = input_dtype if the input data type is stable enough (float32, bfloat16, but not float16).

Performance:

- improved 4-bit inference performance for A100 GPUs. This degraded performance for A40/RTX3090 and RTX 4090 GPUs slightly.

### 0.40.2

Bug fixes:

- Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588
- Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk
- Fixed bug where read-permission was assumed for a file. #497
- Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro

### 0.40.1

Features:

- Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571
- CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk

Bug fixes:

- Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553
- Fixed a missing scipy dependency in requirements.txt. #544
- Fixed a bug, where a view operation could cause an error in 8-bit layers.
- Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani

Documentation:

- Improved documentation for GPUs that do not support 8-bit matmul. #529
- Added description and pointers for the NF4 data type. #543

### 0.40.0

Features:

- Added 4-bit inference kernels for batch size=1. Currently support are the NF4, FP4 data types.
- Added support for quantizations of bfloat16 input data.

Bug fixes:

- Added `device` variable for bitsandbytes layers to be compatible with PyTorch layers.

Deprecated:

- Binaries for CUDA 11.2, 11.6 no longer ship with `pip install bitsandbytes` and need to be compiled from source.

### 0.39.0

Features:

- 4-bit matrix multiplication for Float4 and NormalFloat4 data types.
- Added 4-bit quantization routines
- Doubled quantization routines for 4-bit quantization
- Paged optimizers for Adam and Lion.
- bfloat16 gradient / weight support for Adam and Lion with 8 or 32-bit states.

Bug fixes:

- Fixed a bug where 8-bit models consumed twice the memory as expected after serialization

Deprecated:

- Kepler binaries (GTX 700s and Tesla K40/K80) are not longer provided via pip and need to be compiled from source. Kepler support might be fully removed in the future.

### 0.38.1

Features:

- Added Int8 SwitchBack layers
- Added Fake FP8 layers for research purposes (available under `bnb.research.nn. ...`)

### 0.38.0

#### 8-bit Lion, Load/Store 8-bit Models directly from/to HF Hub

Features:

- Support for 32 and 8-bit Lion has been added. Thank you @lucidrains
- Support for serialization of Linear8bitLt layers (LLM.int8()). This allows to store and load 8-bit weights directly from the HuggingFace Hub. Thank you @myrab
- New bug report features `python -m bitsandbytes` now gives extensive debugging details to debug CUDA setup failures.

Bug fixes:

- Fixed a bug where some bitsandbytes methods failed in a model-parallel setup on multiple GPUs. Thank you @tonylins
- Fixed a bug where cudart.so libraries could not be found in newer PyTorch releases.

Improvements:

- Improved the CUDA Setup procedure by doing a more extensive search for CUDA libraries

Deprecated:

- Devices with compute capability 3.0 (GTX 700s, K10) and 3.2 (Tegra K1, Jetson TK1) are now deprecated and support will be removed in 0.39.0.
- Support for CUDA 10.0 and 10.2 will be removed in bitsandbytes 0.39.0

### 0.37.0

#### Int8 Matmul + backward support for all GPUs

Features:

- Int8 MatmulLt now supports backward through inversion of the ColTuring/ColAmpere format. Slow, but memory efficient. Big thanks to @borzunov
- Int8 now supported on all GPUs. On devices with compute capability \< 7.5, the Int weights are cast to 16/32-bit for the matrix multiplication. Contributed by @borzunov

Improvements:

- Improved logging for the CUDA detection mechanism.

### 0.36.0

#### Improvements, Ada/Hopper support, fake k-bit quantization.

Features:

- CUDA 11.8 and 12.0 support added
- support for Ada and Hopper GPUs added (compute capability 8.9 and 9.0)
- support for fake k-bit block-wise quantization for Int, Float, quantile quantization, and dynamic exponent data types added
- Added CUDA instruction generator to fix some installations.
- Added additional block sizes for quantization {64, 128, 256, 512, 1024}
- Added SRAM Quantile algorithm to quickly estimate less than 256 quantiles
- Added option to suppress the bitsandbytes welcome message (@Cyberes)

Regression:

- Compute capability 3.0 removed: GTX 600s and 700s series is no longer supported (except GTX 780 and GTX 780 Ti)

Bug fixes:

- fixed a bug where too long directory names would crash the CUDA SETUP #35 (@tomaarsen)
- fixed a bug where CPU installations on Colab would run into an error  #34 (@tomaarsen)
- fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52
- fixed a bug where the CUDA setup failed due to a wrong function call.
- fixed a bug in the CUDA Setup which led to an incomprehensible error if no GPU was detected.
- fixed a bug in the CUDA Setup failed with the cuda runtime was found, but not the cuda library.
- fixed a bug where not finding the cuda runtime led to an incomprehensible error.
- fixed a bug where with missing CUDA the default was an error instead of the loading the CPU library
- fixed a bug where the CC version of the GPU was not detected appropriately (@BlackHC)
- fixed a bug in CPU quantization which lead to errors when the input buffer exceeded 2^31 elements

Improvements:

- multiple improvements in formatting, removal of unused imports, and slight performance improvements (@tomaarsen)
- StableEmbedding layer now has device and dtype parameters to make it 1:1 replaceable with regular Embedding layers (@lostmsu)
- runtime performance of block-wise quantization slightly improved
- added error message for the case multiple libcudart.so are installed and bitsandbytes picks the wrong one

### 0.35.4

Bug fixes:

- Fixed a bug in the CUDA Setup failed with the cuda runtime was found, but not the cuda library.
- Fixed a bug where not finding the cuda runtime led to an incomprehensible error.

### 0.35.3

Bug fixes:

- Fixed a bug in the CUDA Setup which led to an incomprehensible error if no GPU was detected.

### 0.35.2

Bug fixes:

- Fixed a bug where the CUDA setup failed due to a wrong function call.

### 0.35.1

Features:

- Added CUDA instruction generator to fix some installations.

Bug fixes:

- Fixed a problem where warning messages would be displayed even though everything worked correctly.

### 0.35.0

#### CUDA 11.8 support and bug fixes

Features:

- CUDA 11.8 support added and binaries added to the PyPI release.

Bug fixes:

- fixed a bug where too long directory names would crash the CUDA SETUP #35 (thank you @tomaarsen)
- fixed a bug where CPU installations on Colab would run into an error  #34 (thank you @tomaarsen)
- fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52

### 0.34.0

#### Bug fixes and memory efficient backprop

Features:

- Linear8bitLt layer now supports `memory_efficient_backward=True` which enables backprop of gradients through frozen weights.

Bug fixes:

- fixed an issue where too many threads were created in blockwise quantization on the CPU for large tensors

### 0.33.0

#### Various bug fixes

Features:

- CPU quantization now supports a variable `blocksize` variable to enhance quantization speed or precision.

Bug fixes:

- fixed an issue in CPU quantization where tensors with more than 2^31 elements would fail 19a7adca7a6c9bf7061a384d7e9d9b13676a1a88
- fixed a bug where cpu binaries would fail if no GPU would be detected eab4d8232d558f2e6bd7f7cc3d00e2e6e94f4e80
- fixed an issue where cpu binaries cause additional stdout messages 92a3363096e10ad6a5c4e944af898bd1186d806a
- fixed an import of bnb.utils 2e630b55f51d454f3bd723dffda68a07ef93190c

We thank @mryab, @mbrukman, @chessgecko, @dbaranchuk for pull request with bug fixes and new features.

### 0.32.0

#### 8-bit Inference Performance Enhancements

We added performance enhancements for small models. This makes small models about 2x faster for LLM.int8() inference.

Features:

- Int32 dequantization now supports fused biases.
- Linear8bitLt now uses a fused bias implementation.
- Change `.data.storage().data_ptr()` to `.data.data_ptr()` to enhance inference performance.

Bug fixes:

- Now throws and error if LLM.int8() is used on a GPU that is not supported.
- Enhances error messaging if CUDA SETUP fails.

### 0.31.0

#### 8-bit Inference and Packaging Update

Features:

- added direct outlier extraction. This enables outlier extraction without fp16 weights without performance degradation.
- Added automatic CUDA SETUP procedure and packaging all binaries into a single bitsandbytes package.

### 0.30.0

#### 8-bit Inference Update

Features:

- Added 8-bit matrix multiplication form cuBLAS,  and cuBLASLt as well as multiple GEMM kernels (GEMM, GEMMEx, GEMMLt)
- Added 8-bit Linear layers with 8-bit Params that perform memory efficient inference with an option for 8-bit mixed precision matrix decomposition for inference without performance degradation
- Added quantization methods for "fake" quantization as well as optimized kernels vector-wise quantization and equalization as well as optimized cuBLASLt transformations
- CPU only build now available (Thank you, @mryab)

Deprecated:

- Pre-compiled release for CUDA 9.2, 10.0, 10.2 no longer available

### 0.26.0:

Features:

- Added Adagrad (without grad clipping) as 32-bit and 8-bit block-wise optimizer.
- Added AdamW (copy of Adam with weight decay init 1e-2). #10
- Introduced ModuleConfig overrides which can be seamlessly be used at initialization time of a module.
- Added `bnb.nn.Embedding` layer which runs at 32-bit but without the layernorm. This works well if you need to fine-tune pretrained models that do not have a embedding layer norm. #19

Bug fixes:

- Fixed a bug where weight decay was incorrectly applied to 32-bit Adam. #13
- Fixed an unsafe use of eval. #8
- Fixed a bug where the StableEmbedding layer 32-bit optimizer override would not work without registering the whole model first (`bnb.optim.GlobalOptimManager.get_instance().register_parameters(model.parameters())`).  #13 #15

Docs:

- Added instructions how to solve "\_\_fatbinwrap\_" errors.

### 0.0.25:

Features:

- Added `skip_zeros` for block-wise and 32-bit optimizers. This ensures correct updates for sparse gradients and sparse models.
- Added support for Kepler GPUs. (#4)
- Added Analysis Adam to track 8-bit vs 32-bit quantization errors over time.
- Make compilation more user friendly.

Bug fixes:

- fixed "undefined symbol: \_\_fatbinwrap_38" error for P100 GPUs on CUDA 10.1 (#5)

Docs:

- Added docs with instructions to compile from source.

### 0.0.24:

- Fixed a bug where a float/half conversion led to a compilation error for CUDA 11.1 on Turning GPUs.
- removed Apex dependency for bnb LAMB

### 0.0.23:

Bugs:

- Unified quantization API: each quantization function now returns `Q, S` where `Q` is the quantized tensor and `S` the quantization state which may hold absolute max values, a quantization map or more. For dequantization all functions now accept the inputs `Q, S` so that `Q` is dequantized with the quantization state `S`.
- Fixed an issue where the CUDA 11.1 binary was not compiled with the right headers

API changes:

- Block-wise quantization for optimizers now enabled by default

Features:

- Block-wise quantization routines now support CPU Tensors.

### 0.0.22:

- Fixed an error where a `reset_parameters()` call on the `StableEmbedding` would lead to an error in older PyTorch versions (from 1.7.0).

### 0.0.21

- Ampere, RTX 30 series GPUs now compatible with the library.


================================================
FILE: CLAUDE.md
================================================
# MANDATORY: Use git worktrees for all branch work

NEVER work on a fix or feature branch inside the main `~/git/bitsandbytes` checkout. Always create a worktree first:

```bash
cd ~/git/bitsandbytes
git worktree add ~/git/bnb-fix-<NUMBER> -b fix/issue-<NUMBER>
cd ~/git/bnb-fix-<NUMBER>
```

This keeps the main checkout clean and allows parallel sessions. If you are already inside a worktree directory, you do not need to create another one.

**Before creating a worktree**, check the worktree registry for existing ones — see the Git Worktrees section in `~/.claude/CLAUDE.md`. Bitsandbytes-specific naming conventions: `agents/worktree_guide.md`. General worktree guide: `~/git/lab_tools/worktree_guide.md`.

# MANDATORY: Check for existing PRs before starting work

Before working on any issue, check whether a PR already exists:

```bash
gh pr list --search "issue-number OR keyword" --state open
```

If a PR exists, review and build on it instead of starting from scratch. Do not create duplicate work.

# MANDATORY: Run linting before every pull request

Before pushing a PR branch, you MUST run the full pre-commit suite. CI will reject PRs that fail any check:

```bash
pre-commit run --all-files
```

This runs ruff, ruff format, typos, trailing-whitespace, clang-format, and all other CI lint hooks. Review and commit any changes it makes. Do NOT run only `ruff check` and `ruff format` — those are just 2 of 10 hooks. Full details: `agents/linting_guide.md`

# Testing: only run relevant tests

Do NOT run the full test suite — it takes 10+ minutes. Instead, run only the tests that cover the code you changed:

```bash
pytest tests/test_relevant_file.py -v --tb=short -k "relevant_test_name"
```

The full suite will be run separately. Best practices and known issues: `agents/testing_guide.md`

# Agent Dispatch (the "Dispatcher" role)

To triage open GitHub issues, generate prompt files, and launch parallel worker agents, read `agents/dispatch_guide.md`. If told "you're the Dispatcher" or "please read the Dispatch Guide," that's what this refers to. The dispatch workflow uses the GitHub issue tools in `agents/` — see `agents/github_tools_guide.md` for the bitsandbytes-specific reference.

# Issue maintenance and triage

To identify and close stale, duplicate, or resolved issues: `agents/issue_maintenance_guide.md`. Common closeable patterns (old CUDA setup, Windows pre-support, third-party app issues, etc.) are cataloged in `agents/issue_patterns.md`.

# Pull request review

When tasked with reviewing a pull request, you MUST read these guides before starting the review:

1. `agents/pr_review_guide.md` — The complete review workflow (classification, checklists, verdict format, and posting instructions). This is the primary guide; follow its steps sequentially.
2. `agents/architecture_guide.md` — Codebase architecture and patterns
3. `agents/code_standards.md` — Code quality expectations
4. `agents/api_surface.md` — Public API catalog (for detecting breaking changes)
5. `agents/downstream_integrations.md` — How Transformers, PEFT, Accelerate, TGI, and vLLM depend on bitsandbytes (for assessing downstream impact)
6. `agents/security_guide.md` — Trust model and security checklist (especially for external contributor PRs)

For CUDA kernel changes, also read `agents/kbit_gemm_context.md`. The PR review guide references all of these at the appropriate steps.


================================================
FILE: CMakeLists.txt
================================================
# This CMake config hopefully makes it easier to compile.
# Ensure the CUDA Toolkit is available on your path. Then run:
#   For  GCC: `cmake -B build . && cmake --build build`
#   For MSVC: `cmake -B build . && cmake --build build --config Release`
# You can also use the following options and variables
#  - COMPUTE_BACKEND: Set to `cpu`, `cuda`, or `mps` to select the backend
#  - CUDA_VERSION: The expected CUDA version, for sanity checking. The actual version
#                  is whatever CMake finds on your path.
#  - COMPUTE_CAPABILITY: Which GPU Arch/Compute codes to provide to NVCC.
#                        Separate by semicolons, i.e. `-DCOMPUTE_CAPABILITY=89;90;100;120`
#                        Check your compute capability here: https://developer.nvidia.com/cuda-gpus
#  - PTXAS_VERBOSE: Pass the `-v` option to the PTX Assembler
#  - ROCM_VERSION: Override the ROCm version shortcode used in the output library name.
#                  Useful when PyTorch was built against a different ROCm version than the
#                  system install. For example, `-DROCM_VERSION=70` produces
#                  libbitsandbytes_rocm70.so even if the system has ROCm 7.2.
cmake_minimum_required(VERSION 3.22.1)

# On Windows with HIP backend, auto-detect compilers from ROCM_PATH before project()
if(WIN32 AND COMPUTE_BACKEND STREQUAL "hip")
    if(DEFINED ENV{ROCM_PATH})
        set(ROCM_PATH $ENV{ROCM_PATH})
    endif()
    if(ROCM_PATH AND NOT DEFINED CMAKE_CXX_COMPILER)
        set(CMAKE_CXX_COMPILER "${ROCM_PATH}/lib/llvm/bin/clang++.exe")
    endif()
    if(ROCM_PATH AND NOT DEFINED CMAKE_HIP_COMPILER)
        set(CMAKE_HIP_COMPILER "${ROCM_PATH}/lib/llvm/bin/clang++.exe")
    endif()
    # On Windows, the HIP compiler needs explicit paths to find device libraries.
    if(ROCM_PATH)
        find_path(ROCM_DEVICE_LIB_PATH
            NAMES oclc_abi_version_400.bc ocml.bc
            PATHS "${ROCM_PATH}/amdgcn/bitcode"
                  "${ROCM_PATH}/lib/llvm/amdgcn/bitcode"
            NO_DEFAULT_PATH
        )
        set(CMAKE_HIP_FLAGS "--rocm-path=${ROCM_PATH}")
        if(ROCM_DEVICE_LIB_PATH)
            set(CMAKE_HIP_FLAGS "${CMAKE_HIP_FLAGS} --rocm-device-lib-path=${ROCM_DEVICE_LIB_PATH}")
        endif()
    endif()
endif()

project(bitsandbytes LANGUAGES CXX)

# If run without specifying a build type, default to using the Release configuration:
#    optimizing the generated binaries for performance and also adds the `-DNDEBUG` flag,
#    which turns off a bunch of asserts which seem to link to new symbols in libstdc++,
#    worsening our many_linux compliance..
if(NOT CMAKE_BUILD_TYPE)
    set(CMAKE_BUILD_TYPE Release)
endif()

# Define included source files
set(CPP_FILES csrc/cpu_ops.cpp csrc/pythonInterface.cpp)
set(GPU_FILES csrc/ops.cu csrc/kernels.cu)
set(MPS_FILES csrc/mps_ops.mm)
set(METAL_FILES csrc/mps_kernels.metal)
set(XPU_FILES csrc/xpu_ops.cpp csrc/xpu_kernels.cpp)
# C++ sources are always included
list(APPEND SRC_FILES ${CPP_FILES})

set(COMPUTE_BACKEND "cpu" CACHE STRING "The compute backend to use (cpu, cuda, hip, mps, xpu)")
set_property(CACHE COMPUTE_BACKEND PROPERTY STRINGS cpu cuda hip mps xpu)
option(PTXAS_VERBOSE "Pass through -v flag to PTX Assembler" OFF)

if(APPLE)
  set(CMAKE_OSX_DEPLOYMENT_TARGET 14.0)
endif()

set(BNB_OUTPUT_NAME "bitsandbytes")

message(STATUS "Configuring ${PROJECT_NAME} (Backend: ${COMPUTE_BACKEND})")

if(${COMPUTE_BACKEND} STREQUAL "cuda")
    if(APPLE)
        message(FATAL_ERROR "CUDA is not supported on macOS" )
    endif()
    set(BUILD_CUDA ON)
    set(BUILD_HIP OFF)
    set(BUILD_MPS OFF)
elseif(${COMPUTE_BACKEND} STREQUAL "hip")
    if(APPLE)
        message(FATAL_ERROR "HIP is not supported on macOS" )
    endif()
    set(BUILD_CUDA OFF)
    set(BUILD_HIP ON)
    set(BUILD_MPS OFF)
elseif(${COMPUTE_BACKEND} STREQUAL "mps")
    if(NOT APPLE)
        message(FATAL_ERROR "MPS is only supported on macOS" )
    endif()
    set(BUILD_CUDA OFF)
    set(BUILD_HIP OFF)
    set(BUILD_MPS ON)
elseif(${COMPUTE_BACKEND} STREQUAL "xpu")
    if(APPLE)
        message(FATAL_ERROR "XPU is not supported on macOS" )
    endif()
    set(BUILD_CUDA OFF)
    set(BUILD_HIP OFF)
    set(BUILD_MPS OFF)
    set(BUILD_XPU ON)
else()
    set(BUILD_CUDA OFF)
    set(BUILD_HIP OFF)
    set(BUILD_MPS OFF)
    set(BUILD_XPU OFF)
    set(BUILD_CPU ON)
endif()


if (BUILD_CPU)
    set(CMAKE_CXX_STANDARD 17)
    set(CMAKE_CXX_STANDARD_REQUIRED ON)
    string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" HOST_ARCH)
    find_package(OpenMP)
endif()

if(BUILD_CUDA)
    # NVCC normally will only work with MSVC up to 1939. VS2022 17.10+ starts using versions 1940+.
    # Workaround: use --allow-unsupported-compiler
    # This needs to be added *before* we try to enable the CUDA language so CMake's compiler check passes.
    if(MSVC AND MSVC_VERSION VERSION_GREATER_EQUAL 1940)
        string(APPEND CMAKE_CUDA_FLAGS " --allow-unsupported-compiler")

        # This is needed to build with VS2022 17.11+ and CUDA < 12.4.
        if (MSVC_VERSION VERSION_GREATER_EQUAL 1941)
            string(APPEND CMAKE_CUDA_FLAGS " -D_ALLOW_COMPILER_AND_STL_VERSION_MISMATCH")
        endif()
    endif()

    enable_language(CUDA) # This will fail if CUDA is not found
    find_package(CUDAToolkit REQUIRED)

    # Convert the CUDA version from X.Y.z to XY. There's probably a shorter way of doing this
    string(REGEX MATCH "^[0-9]+.[0-9]+" _CUDA_VERSION_FIRST_TWO "${CMAKE_CUDA_COMPILER_VERSION}")
    string(REPLACE "." "" CUDA_VERSION_SHORT "${_CUDA_VERSION_FIRST_TWO}")

    # Expose a cache variable that the user can set to ensure the correct version of CUDA is found
    set(CUDA_VERSION "${CUDA_VERSION_SHORT}" CACHE STRING "Expected CUDA Version Shortcode")

    message(STATUS "CUDA Version: ${CUDA_VERSION_SHORT} (${CMAKE_CUDA_COMPILER_VERSION})")
    message(STATUS "CUDA Compiler: ${CMAKE_CUDA_COMPILER}")

    # It should match the discovered version
    if(NOT CUDA_VERSION STREQUAL "${CUDA_VERSION_SHORT}")
        message(FATAL_ERROR "You've specified CUDA version ${CUDA_VERSION} however the CUDA compiler found is ${CUDA_VERSION_SHORT}."
            " Ensure the desired CUDA compiler is the first one available on your PATH."
        )
    endif()

    if(CMAKE_CUDA_COMPILER_VERSION VERSION_LESS "11.8")
        message(FATAL_ERROR "CUDA Version < 11.8 is not supported")
    elseif(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "14.0")
        message(FATAL_ERROR "CUDA Version > 13 is not supported")
    endif()

    # CMake < 3.23.0 does not define CMAKE_CUDA_ARCHITECTURES_ALL.
    if(CMAKE_VERSION VERSION_LESS "3.23.0")
        message(STATUS "CMake < 3.23.0; determining CUDA architectures supported...")

        if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "13.0")
            # Starting in CUDA 13.0, Thor Blackwell is renamed to SM110.
            # Support for architectures older than Turing (SM75) is removed.
            list(APPEND CMAKE_CUDA_ARCHITECTURES_ALL 75 80 86 87 88 89 90 100 103 110 120 121)
            list(APPEND CMAKE_CUDA_ARCHITECTURES_ALL_MAJOR 80 90 100 110 120)
        else()
            # 11.8-12.9 supports these at a minimum.
            set(CMAKE_CUDA_ARCHITECTURES_ALL 50 52 53 60 61 62 70 72 75 80 86 87 89 90)
            set(CMAKE_CUDA_ARCHITECTURES_ALL_MAJOR 50 60 70 80 90)

            # CUDA 12.8 adds support for Blackwell.
            if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.8")
                list(APPEND CMAKE_CUDA_ARCHITECTURES_ALL 100 101 120 121)
                list(APPEND CMAKE_CUDA_ARCHITECTURES_ALL_MAJOR 100 120)
            endif()

            # CUDA 12.9 adds SM103 (Blackwell B300).
            if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.9")
                list(APPEND CMAKE_CUDA_ARCHITECTURES_ALL 103)
            endif()
        endif()
    endif()

    string(APPEND CMAKE_CUDA_FLAGS " --use_fast_math")

    # It's safe for us to enable more aggressive compression for 13.0+
    if (CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "13.0")
        string(APPEND CMAKE_CUDA_FLAGS " --compress-mode=size")
    endif()

    if(PTXAS_VERBOSE)
        string(APPEND CMAKE_CUDA_FLAGS " -Xptxas=-v")
    endif()

    foreach(capability ${CMAKE_CUDA_ARCHITECTURES_ALL})
        # Most of the items here are like: `xx-real`, so we just extract the `xx` portion
        string(REGEX MATCH "[0-9]+" capability_id "${capability}")
        if(capability_id GREATER 0)
            list(APPEND POSSIBLE_CAPABILITIES ${capability_id})
        endif()
    endforeach()

    # This can be changed via -D argument to CMake
    # By default all possible capabilities are compiled
    set(COMPUTE_CAPABILITY "${POSSIBLE_CAPABILITIES}" CACHE STRING "Compute Capabilities Targeted")

    message(STATUS "CUDA Capabilities Available: ${POSSIBLE_CAPABILITIES}")
    message(STATUS "CUDA Capabilities  Selected: ${COMPUTE_CAPABILITY}")

    # Use the "real" option to build native cubin for all selections.
    # Ensure we build the PTX for the latest version.
    # This behavior of adding a PTX (virtual) target for the highest architecture
    # is similar to how the "all" and "all-major" options would behave in CMake >= 3.23.
    # TODO: Consider bumping CMake requirement and using CMAKE_CUDA_ARCHITECTURES=[all | native] by default
    list(REMOVE_DUPLICATES COMPUTE_CAPABILITY)
    list(SORT COMPUTE_CAPABILITY COMPARE NATURAL)
    list(POP_BACK COMPUTE_CAPABILITY _LATEST_CAPABILITY)
    list(TRANSFORM COMPUTE_CAPABILITY APPEND "-real" OUTPUT_VARIABLE CMAKE_CUDA_ARCHITECTURES)
    list(APPEND CMAKE_CUDA_ARCHITECTURES ${_LATEST_CAPABILITY})

    message(STATUS "CUDA Targets: ${CMAKE_CUDA_ARCHITECTURES}")
    message(STATUS "CUDA NVCC Flags: ${CMAKE_CUDA_FLAGS}")

    list(APPEND SRC_FILES ${GPU_FILES})

    string(APPEND BNB_OUTPUT_NAME "_cuda${CUDA_VERSION_SHORT}")
    add_compile_definitions(BUILD_CUDA)
elseif(BUILD_HIP)
    # Set target architectures before enable_language(HIP), which would otherwise
    # auto-detect a single GPU and override the defaults.
    if(DEFINED BNB_ROCM_ARCH)
      set(CMAKE_HIP_ARCHITECTURES ${BNB_ROCM_ARCH})
    elseif(AMDGPU_TARGETS AND NOT CMAKE_HIP_ARCHITECTURES)
      set(CMAKE_HIP_ARCHITECTURES ${AMDGPU_TARGETS})
    elseif(NOT CMAKE_HIP_ARCHITECTURES)
      set(CMAKE_HIP_ARCHITECTURES "gfx90a;gfx942;gfx1100;gfx1101;gfx1102;gfx1103;gfx1150;gfx1151;gfx1152;gfx1153;gfx1200;gfx1201")
    endif()

    enable_language(HIP)
    message(STATUS "HIP Compiler: ${CMAKE_HIP_COMPILER}")
    message(STATUS "HIP Targets: ${CMAKE_HIP_ARCHITECTURES}")

    list(APPEND SRC_FILES ${GPU_FILES})

    string(APPEND BNB_OUTPUT_NAME "_rocm")

    # get hip version
    execute_process(COMMAND hipconfig --version OUTPUT_VARIABLE HIP_CONFIG_VERSION)
    string(REGEX MATCH "[0-9]+\\.[0-9]+" HIP_VERSION "${HIP_CONFIG_VERSION}")
    string(REPLACE "." "" HIP_VERSION_SHORT "${HIP_VERSION}")

    # Expose a cache variable that the user can set to override the ROCm version in the library name
    set(ROCM_VERSION "${HIP_VERSION_SHORT}" CACHE STRING "Expected ROCm Version Shortcode")

    message(STATUS "ROCm Version: ${HIP_VERSION_SHORT} (from hipconfig)")
    if(NOT ROCM_VERSION STREQUAL "${HIP_VERSION_SHORT}")
        message(WARNING "Overriding ROCm version in library name: ${HIP_VERSION_SHORT} -> ${ROCM_VERSION}")
    endif()

    string(APPEND BNB_OUTPUT_NAME "${ROCM_VERSION}")
    add_compile_definitions(__HIP_PLATFORM_AMD__)
    add_compile_definitions(__HIP_PLATFORM_HCC__)
    add_compile_definitions(BUILD_HIP)
elseif(BUILD_MPS)
    if(NOT APPLE)
        message(FATAL_ERROR "MPS is only supported on macOS" )
    endif()

    enable_language(OBJCXX)

    list(APPEND SRC_FILES ${MPS_FILES})

    string(APPEND BNB_OUTPUT_NAME "_mps")
    add_compile_definitions(BUILD_MPS)
    file(MAKE_DIRECTORY "build")
    add_custom_command(OUTPUT "bitsandbytes/bitsandbytes.metallib"
                COMMAND xcrun metal -c -o "build/bitsandbytes.air" ${METAL_FILES}
                COMMAND xcrun metallib "build/bitsandbytes.air" -o "bitsandbytes/bitsandbytes.metallib"
                DEPENDS "${METAL_FILES}"
                COMMENT "Compiling Metal kernels"
                VERBATIM)
    add_custom_target(metallib DEPENDS "bitsandbytes/bitsandbytes.metallib")
elseif(BUILD_XPU)
    list(APPEND SRC_FILES ${XPU_FILES})
    string(APPEND BNB_OUTPUT_NAME "_xpu")
    add_compile_definitions(BUILD_XPU)
    set(CMAKE_C_COMPILER icx)
    set(CMAKE_CXX_COMPILER icpx)
    if(WIN32)
        set(CMAKE_CXX_COMPILER icx)
    endif()
else()
    string(APPEND BNB_OUTPUT_NAME "_cpu")
    set(GPU_SOURCES)
endif()


if(WIN32)
    # Export all symbols
    set(CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS ON)
    # Prevent Windows SDK min/max macros from conflicting with std::min/std::max
    add_compile_definitions(NOMINMAX)
endif()

if(MSVC)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX2 /fp:fast")
endif()

set_source_files_properties(${CPP_FILES} PROPERTIES LANGUAGE CXX)
add_library(bitsandbytes SHARED ${SRC_FILES})
target_compile_features(bitsandbytes PUBLIC cxx_std_17)
target_include_directories(bitsandbytes PUBLIC csrc)

if (BUILD_CPU)
    if (OpenMP_CXX_FOUND)
        target_link_libraries(bitsandbytes PRIVATE OpenMP::OpenMP_CXX)
        add_definitions(-DHAS_OPENMP)
    endif()

    if ((HOST_ARCH MATCHES "x86_64|amd64") AND (NOT MSVC))
        include(CheckCXXCompilerFlag)
        check_cxx_compiler_flag(-mavx512f HAS_AVX512F_FLAG)
        check_cxx_compiler_flag(-mavx512bf16 HAS_AVX512BF16_FLAG)
        if (HAS_AVX512F_FLAG)
            target_compile_options(bitsandbytes PRIVATE -mavx512f)
            target_compile_options(bitsandbytes PRIVATE -mavx512dq)
            target_compile_options(bitsandbytes PRIVATE -mavx512bw)
            target_compile_options(bitsandbytes PRIVATE -mavx512vl)
        endif()
        if (HAS_AVX512BF16_FLAG)
            target_compile_options(bitsandbytes PRIVATE -mavx512bf16)
        endif()
        target_compile_options(
            bitsandbytes PRIVATE
            -mprefer-vector-width=256
            -mfma
            -mavx2
            -mlzcnt
            -mbmi
            -mbmi2
        )
    endif()
endif()


if(BUILD_CUDA)
    target_include_directories(bitsandbytes PUBLIC ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})
    target_link_libraries(bitsandbytes PUBLIC CUDA::cudart CUDA::cublas CUDA::cublasLt)
    set_target_properties(bitsandbytes
        PROPERTIES
            CUDA_SEPARABLE_COMPILATION ON
    )
endif()
if(BUILD_HIP)
    # Determine ROCM_PATH from environment variable, fallback to /opt/rocm on Linux
    if(DEFINED ENV{ROCM_PATH})
      set(ROCM_PATH $ENV{ROCM_PATH})
    else()
      set(ROCM_PATH /opt/rocm)
    endif()
    list(APPEND CMAKE_PREFIX_PATH ${ROCM_PATH})
    macro(find_package_and_print_version PACKAGE_NAME)
      find_package("${PACKAGE_NAME}" ${ARGN})
      message("${PACKAGE_NAME} VERSION: ${${PACKAGE_NAME}_VERSION}")
    endmacro()
    find_package_and_print_version(hipblas REQUIRED)
    find_package_and_print_version(hiprand REQUIRED)

    ## hacky way of excluding hip::amdhip64 (with it linked many tests unexpectedly fail e.g. adam8bit because of inaccuracies)
    ## On Windows, we need to link amdhip64 explicitly
    if(NOT WIN32)
        set_target_properties(hip::host PROPERTIES INTERFACE_LINK_LIBRARIES "")
        set_target_properties(hip-lang::host PROPERTIES INTERFACE_LINK_LIBRARIES "")
        set(CMAKE_HIP_IMPLICIT_LINK_LIBRARIES "")
    endif()

    target_include_directories(bitsandbytes PRIVATE ${CMAKE_SOURCE_DIR} ${CMAKE_SOURCE_DIR}/include ${ROCM_PATH}/include /include)
    target_link_directories(bitsandbytes PRIVATE ${ROCM_PATH}/lib /lib)
    target_link_libraries(bitsandbytes PUBLIC roc::hipblas hip::hiprand)

    # On Windows, rocblas is not pulled in transitively by roc::hipblas
    # and is needed because ops_hip.cuh uses rocblas_handle directly.
    if(WIN32)
        target_link_libraries(bitsandbytes PUBLIC rocblas)
    endif()

    target_compile_definitions(bitsandbytes PUBLIC BNB_USE_HIP)
    set_source_files_properties(${GPU_FILES} PROPERTIES LANGUAGE HIP)
    set_target_properties(bitsandbytes PROPERTIES LINKER_LANGUAGE CXX)

    if(HIP_VERSION VERSION_LESS "6.1")
	target_compile_definitions(bitsandbytes PUBLIC NO_HIPBLASLT)
    else()
	find_package(hipblaslt)
        target_link_libraries(bitsandbytes PUBLIC roc::hipblaslt)
    endif()
endif()
if(BUILD_MPS)
    add_dependencies(bitsandbytes metallib)
    target_link_libraries(bitsandbytes objc "-framework Foundation" "-framework Metal" "-framework MetalPerformanceShaders" "-framework MetalPerformanceShadersGraph")
endif()
if(BUILD_XPU)
    set(SYCL_LINK_FLAGS "-fsycl;--offload-compress;-fsycl-targets=spir64_gen,spir64;-Xs;-device pvc,xe-lpg,ats-m150 -options ' -cl-intel-enable-auto-large-GRF-mode -cl-poison-unsupported-fp64-kernels -cl-intel-greater-than-4GB-buffer-required'")
    set(SYCL_COMPILE_FLAGS "-fsycl;-fhonor-nans;-fhonor-infinities;-fno-associative-math;-fno-approx-func;-fno-sycl-instrument-device-code;--offload-compress;-fsycl-targets=spir64_gen,spir64;")

    set_property(TARGET bitsandbytes PROPERTY CXX_STANDARD 20)
    target_compile_options(bitsandbytes PRIVATE ${SYCL_COMPILE_FLAGS})
    target_link_options(bitsandbytes PRIVATE ${SYCL_LINK_FLAGS})

endif()

if(WIN32)
    set_target_properties(bitsandbytes PROPERTIES PREFIX "lib")
endif()
set_target_properties(bitsandbytes PROPERTIES OUTPUT_NAME ${BNB_OUTPUT_NAME})
if(MSVC)
    set_target_properties(bitsandbytes PROPERTIES LIBRARY_OUTPUT_DIRECTORY_RELEASE "${PROJECT_SOURCE_DIR}/bitsandbytes")
    set_target_properties(bitsandbytes PROPERTIES LIBRARY_OUTPUT_DIRECTORY_DEBUG "${PROJECT_SOURCE_DIR}/bitsandbytes")
    set_target_properties(bitsandbytes PROPERTIES RUNTIME_OUTPUT_DIRECTORY_RELEASE "${PROJECT_SOURCE_DIR}/bitsandbytes")
    set_target_properties(bitsandbytes PROPERTIES RUNTIME_OUTPUT_DIRECTORY_DEBUG "${PROJECT_SOURCE_DIR}/bitsandbytes")
endif()

set_target_properties(bitsandbytes PROPERTIES LIBRARY_OUTPUT_DIRECTORY "${PROJECT_SOURCE_DIR}/bitsandbytes")


================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to make participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
  advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
  address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
  professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.

This Code of Conduct also applies outside the project spaces when there is a
reasonable belief that an individual's behavior may have a negative impact on
the project or its community.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at <opensource-conduct@fb.com>. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq


================================================
FILE: COMPILE_H100_L40.md
================================================
# Compiling bitsandbytes for H100 and L40 GPUs

This guide shows how to compile bitsandbytes from source specifically optimized for NVIDIA H100 and L40 GPUs.

## Prerequisites

- CMake >= 3.22.1
- Python >= 3.9
- GCC (version 9+ recommended)
- CUDA Toolkit (11.8+)
- PyTorch with CUDA support

Verify your system:
```bash
cmake --version
python3 --version
gcc --version
nvcc --version
```

## GPU Compute Capabilities

- **L40**: Compute Capability 8.9 (sm_89)
- **H100**: Compute Capability 9.0 (sm_90)

## Compilation Steps

### 1. Clean any previous build configuration

```bash
cd /path/to/bitsandbytes
rm -rf CMakeCache.txt CMakeFiles/ build/
```

### 2. Configure CMake for H100 and L40

```bash
cmake -DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY="89;90" -S .
```

This configures the build to target only compute capabilities 89 (L40) and 90 (H100), significantly reducing compilation time compared to building for all architectures.

### 3. Compile the library

```bash
make -j$(nproc)
```

This will create `bitsandbytes/libbitsandbytes_cuda<VERSION>.so` where `<VERSION>` matches your CUDA Toolkit version (e.g., `cuda124` for CUDA 12.4).

### 4. Install the package

```bash
pip install -e .
```

Use `-e` flag for editable/development install, or omit it for regular installation.

### 5. Handle PyTorch CUDA version mismatch (if needed)

If your PyTorch was compiled with a different CUDA version than your Toolkit, you may need to create a symlink:

```bash
# Example: PyTorch uses CUDA 12.8, but you compiled with CUDA 12.4
ln -sf libbitsandbytes_cuda124.so bitsandbytes/libbitsandbytes_cuda128.so
```

Alternatively, set the environment variable:
```bash
export BNB_CUDA_VERSION=124  # Use your compiled CUDA version
```

### 6. Verify installation

```bash
python3 -c "import bitsandbytes as bnb; print(f'bitsandbytes version: {bnb.__version__}'); print('Success!')"
```

## Expected Output

After compilation, you should see:
- Binary file: `bitsandbytes/libbitsandbytes_cuda<VERSION>.so` (approximately 7MB when targeting only sm_89 and sm_90)
- Successful import in Python with no errors

## Compilation Time

Building for only H100/L40 (2 architectures) takes approximately **1-2 minutes** compared to **5+ minutes** when building for all 14+ compute capabilities.

## Troubleshooting

### Warning messages during compilation
Warnings like "variable declared but never referenced" are harmless and can be ignored.

### Wrong CUDA binary error
If you see `Configured CUDA binary not found`, check:
1. The compiled `.so` file exists in `bitsandbytes/` directory
2. The CUDA version matches or create a symlink as shown in step 5
3. Use `BNB_CUDA_VERSION` environment variable to override

### CUDA version check
```bash
# Check your CUDA Toolkit version
nvcc --version

# Check PyTorch CUDA version
python3 -c "import torch; print(torch.version.cuda)"
```

## Notes

- The compiled library will **only work on GPUs with compute capability 8.9 or 9.0** (L40 and H100)
- For other GPUs, you'll need to recompile with appropriate compute capabilities
- The `-DCOMPUTE_CAPABILITY` flag accepts a semicolon-separated list: e.g., `"75;80;89;90"` for T4, A100, L40, and H100


================================================
FILE: CONTRIBUTING.md
================================================
# Contributing to bitsandbytes
We want to make contributing to this project as easy and transparent as
possible.

## Pull Requests
We actively welcome your pull requests.

1. Fork the repo and create your branch from `main`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints, install the [pre-commit hooks as documented here](https://huggingface.co/docs/bitsandbytes/main/en/contributing#setup-pre-commit-hooks).

## Issues
We use GitHub issues to track public bugs. Please ensure your description is
clear and has sufficient instructions to be able to reproduce the issue.

## License
By contributing to bitsandbytes, you agree that your contributions will be licensed
under the LICENSE file in the root directory of this source tree.


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) Facebook, Inc. and its affiliates.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: MANIFEST.in
================================================
include CMakeLists.txt
graft csrc
graft include


================================================
FILE: NOTICE.md
================================================
The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: PyTorch is licensed under the BSD license.


================================================
FILE: README.md
================================================
<p align="center"><img src="https://avatars.githubusercontent.com/u/175231607?s=200&v=4" alt=""></p>
<h1 align="center">bitsandbytes</h1>
<p align="center">
    <a href="https://github.com/bitsandbytes-foundation/bitsandbytes/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/bitsandbytes-foundation/bitsandbytes.svg?color=blue"></a>
    <a href="https://pepy.tech/project/bitsandbytes"><img alt="Downloads" src="https://static.pepy.tech/badge/bitsandbytes/month"></a>
    <a href="https://github.com/bitsandbytes-foundation/bitsandbytes/actions/workflows/tests-nightly.yml"><img alt="Nightly Unit Tests" src="https://img.shields.io/github/actions/workflow/status/bitsandbytes-foundation/bitsandbytes/tests-nightly.yml?logo=github&label=Nightly%20Tests"></a>
    <a href="https://github.com/bitsandbytes-foundation/bitsandbytes/releases"><img alt="GitHub Release" src="https://img.shields.io/github/v/release/bitsandbytes-foundation/bitsandbytes"></a>
    <a href="https://pypi.org/project/bitsandbytes/"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/bitsandbytes"></a>
</p>

`bitsandbytes` enables accessible large language models via k-bit quantization for PyTorch. We provide three main features for dramatically reducing memory consumption for inference and training:

* 8-bit optimizers uses block-wise quantization to maintain 32-bit performance at a small fraction of the memory cost.
* LLM.int8() or 8-bit quantization enables large language model inference with only half the required memory and without any performance degradation. This method is based on vector-wise quantization to quantize most features to 8-bits and separately treating outliers with 16-bit matrix multiplication.
* QLoRA or 4-bit quantization enables large language model training with several memory-saving techniques that don't compromise performance. This method quantizes a model to 4-bits and inserts a small set of trainable low-rank adaptation (LoRA) weights to allow training.

The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8-bit optimizers through `bitsandbytes.optim` module.

## System Requirements
bitsandbytes has the following minimum requirements for all platforms:

* Python 3.10+
* [PyTorch](https://pytorch.org/get-started/locally/) 2.3+
  * _Note: While we aim to provide wide backwards compatibility, we recommend using the latest version of PyTorch for the best experience._

#### Accelerator support:

<small>Note: this table reflects the status of the current development branch. For the latest stable release, see the
[document in the 0.49.2 tag](https://github.com/bitsandbytes-foundation/bitsandbytes/blob/0.49.2/README.md#accelerator-support).
</small>

##### Legend:
🚧 = In Development,
〰️ = Partially Supported,
✅ = Supported,
🐢 = Slow Implementation Supported,
❌ = Not Supported

<table>
  <thead>
    <tr>
      <th>Platform</th>
      <th>Accelerator</th>
      <th>Hardware Requirements</th>
      <th>LLM.int8()</th>
      <th>QLoRA 4-bit</th>
      <th>8-bit Optimizers</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td colspan="6">🐧 <strong>Linux, glibc >= 2.24</strong></td>
    </tr>
    <tr>
      <td align="right">x86-64</td>
      <td>◻️ CPU</td>
      <td>Minimum: AVX2<br>Optimized: AVX512F, AVX512BF16</td>
      <td>✅</td>
      <td>✅</td>
      <td>❌</td>
    </tr>
    <tr>
      <td></td>
      <td>🟩 NVIDIA GPU <br><code>cuda</code></td>
      <td>SM60+ minimum<br>SM75+ recommended</td>
      <td>✅</td>
      <td>✅</td>
      <td>✅</td>
    </tr>
    <tr>
      <td></td>
      <td>🟥 AMD GPU <br><code>cuda</code></td>
      <td>
        CDNA: gfx90a, gfx942, gfx950<br>
        RDNA: gfx1100, gfx1101, gfx1102, gfx1103, gfx1150, gfx1151, gfx1152, gfx1153, gfx1200, gfx1201
      </td>
      <td>✅</td>
      <td>✅</td>
      <td>✅</td>
    </tr>
    <tr>
      <td></td>
      <td>🟦 Intel GPU <br><code>xpu</code></td>
      <td>
        Data Center GPU Max Series<br>
        Arc A-Series (Alchemist)<br>
        Arc B-Series (Battlemage)
      </td>
      <td>✅</td>
      <td>✅</td>
      <td>〰️</td>
    </tr>
    <tr>
      <td></td>
      <td>🟪 Intel Gaudi <br><code>hpu</code></td>
      <td>Gaudi2, Gaudi3</td>
      <td>✅</td>
      <td>〰️</td>
      <td>❌</td>
    </tr>
    <tr>
      <td align="right">aarch64</td>
      <td>◻️ CPU</td>
      <td></td>
      <td>✅</td>
      <td>✅</td>
      <td>❌</td>
    </tr>
    <tr>
      <td></td>
      <td>🟩 NVIDIA GPU <br><code>cuda</code></td>
      <td>SM75+</td>
      <td>✅</td>
      <td>✅</td>
      <td>✅</td>
    </tr>
    <tr>
      <td colspan="6">🪟 <strong>Windows 11 / Windows Server 2022+</strong></td>
    </tr>
    <tr>
      <td align="right">x86-64</td>
      <td>◻️ CPU</td>
      <td>AVX2</td>
      <td>✅</td>
      <td>✅</td>
      <td>❌</td>
    </tr>
    <tr>
      <td></td>
      <td>🟩 NVIDIA GPU <br><code>cuda</code></td>
      <td>SM60+ minimum<br>SM75+ recommended</td>
      <td>✅</td>
      <td>✅</td>
      <td>✅</td>
    </tr>
    <tr>
      <td></td>
      <td>🟦 Intel GPU <br><code>xpu</code></td>
      <td>
        Arc A-Series (Alchemist) <br>
        Arc B-Series (Battlemage)
      </td>
      <td>✅</td>
      <td>✅</td>
      <td>〰️</td>
    </tr>
    <tr>
      <td colspan="6">🍎 <strong>macOS 14+</strong></td>
    </tr>
    <tr>
      <td align="right">arm64</td>
      <td>◻️ CPU</td>
      <td>Apple M1+</td>
      <td>✅</td>
      <td>✅</td>
      <td>❌</td>
    </tr>
    <tr>
      <td></td>
      <td>⬜ Metal <br><code>mps</code></td>
      <td>Apple M1+</td>
      <td>🐢</td>
      <td>🐢</td>
      <td>❌</td>
  </tbody>
</table>

## :book: Documentation
* [Official Documentation](https://huggingface.co/docs/bitsandbytes/main)
* 🤗 [Transformers](https://huggingface.co/docs/transformers/quantization/bitsandbytes)
* 🤗 [Diffusers](https://huggingface.co/docs/diffusers/quantization/bitsandbytes)
* 🤗 [PEFT](https://huggingface.co/docs/peft/developer_guides/quantization#quantize-a-model)

## :heart: Sponsors
The continued maintenance and development of `bitsandbytes` is made possible thanks to the generous support of our sponsors. Their contributions help ensure that we can keep improving the project and delivering valuable updates to the community.

<kbd><a href="https://hf.co" target="_blank"><img width="100" src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.svg" alt="Hugging Face"></a></kbd>
&nbsp;
<kbd><a href="https://intel.com" target="_blank"><img width="100" src="https://avatars.githubusercontent.com/u/17888862?s=100&v=4" alt="Intel"></a></kbd>

## License
`bitsandbytes` is MIT licensed.

## How to cite us
If you found this library useful, please consider citing our work:

### QLoRA

```bibtex
@article{dettmers2023qlora,
  title={Qlora: Efficient finetuning of quantized llms},
  author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
  journal={arXiv preprint arXiv:2305.14314},
  year={2023}
}
```

### LLM.int8()

```bibtex
@article{dettmers2022llmint8,
  title={LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
  author={Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke},
  journal={arXiv preprint arXiv:2208.07339},
  year={2022}
}
```

### 8-bit Optimizers

```bibtex
@article{dettmers2022optimizers,
  title={8-bit Optimizers via Block-wise Quantization},
  author={Dettmers, Tim and Lewis, Mike and Shleifer, Sam and Zettlemoyer, Luke},
  journal={9th International Conference on Learning Representations, ICLR},
  year={2022}
}
```


================================================
FILE: SECURITY.md
================================================
# Security Policy

## Supported Versions

We provide security updates for the latest stable minor release line.

| Version  | Supported |
| -------- | --------- |
| 0.49.x   | ✅        |
| < 0.49.x | ❌        |

> Note: Pre-releases, development builds, and commits on `main` are not considered supported release versions. If you believe you have found a vulnerability in unreleased code, please still report it following the process below.

## Reporting a Vulnerability

Please report security issues **privately** using the GitHub Security Advisory tool to create a new draft advisory:

- https://github.com/bitsandbytes-foundation/bitsandbytes/security/advisories/new

Do not open a public GitHub issue for security-sensitive reports.

### What to include

To help us triage and respond quickly, please include:

- A clear description of the issue and potential impact
- Affected version(s) and environment details (OS, GPU type, CUDA version, Python version, PyTorch version, etc)
- Steps to reproduce (ideally a minimal proof of concept)
- Any relevant logs, crash traces, or screenshots
- Any known mitigations or workarounds

## Response process

We will review reports filed via GitHub Security Advisories and collaborate with the reporter in the advisory thread to:

- Confirm and reproduce the report
- Assess severity and affected versions
- Identify mitigations and/or prepare a fix
- Coordinate any follow-up needed prior to broader communication


================================================
FILE: _typos.toml
================================================
[files]
# Skip these files in typo checks
extend-exclude = [
    "agents/*.md",
    "csrc/xpu_ops.h",
    "csrc/xpu_ops.cpp",
    "csrc/xpu_kernels.h",
    "csrc/xpu_kernels.cpp"
]

[default]
extend-ignore-re = [
    "@Ther-nul",  # valid Github user
]
extend-ignore-identifiers-re = [
    ".*arange.*",
    ".*ARANGE.*",
]

[type.py.extend-words]
"BA" = "BA"  # used as a commented-out variable in tests

[type.cuda.extend-words]
"subtile" = "subtile"
"subtiles" = "subtiles"
"transation" = "transation"  # TODO: is this transition, transaction, translation..?


================================================
FILE: agents/api_surface.md
================================================
# bitsandbytes Public API Surface

This document catalogs every public symbol in the bitsandbytes library, organized by
subsystem. For each symbol it lists: the module path, what it is, its stability status,
and its signature or key attributes. A reviewer can use this to quickly check whether a
PR is adding, removing, or modifying public API correctly.

**Version at time of writing:** 0.49.2.dev0

---

## Table of Contents

1. [Top-Level Exports (`bitsandbytes`)](#1-top-level-exports)
2. [Neural Network Modules (`bitsandbytes.nn`)](#2-neural-network-modules)
3. [Optimizers (`bitsandbytes.optim`)](#3-optimizers)
4. [Functional API (`bitsandbytes.functional`)](#4-functional-api)
5. [Autograd Functions (`bitsandbytes.autograd._functions`)](#5-autograd-functions)
6. [Torch Custom Ops (`bitsandbytes._ops`)](#6-torch-custom-ops)
7. [Research / Experimental (`bitsandbytes.research`)](#7-research--experimental)
8. [Utilities (`bitsandbytes.utils`)](#8-utilities)
9. [Native Library Interface (`bitsandbytes.cextension`)](#9-native-library-interface)
10. [Backend System (`bitsandbytes.backends`)](#10-backend-system)
11. [Deprecated Symbols](#11-deprecated-symbols)
12. [Downstream Integration Points](#12-downstream-integration-points)
13. [Stability Tiers](#13-stability-tiers)

---

## 1. Top-Level Exports

These are available directly as `import bitsandbytes as bnb; bnb.<symbol>`.

### Re-exported from submodules

| Symbol | Origin | Type | Notes |
|--------|--------|------|-------|
| `bnb.MatmulLtState` | `autograd._functions` | dataclass | State container for 8-bit matmul |
| `bnb.matmul` | `autograd._functions` | function | 8-bit matrix multiplication |
| `bnb.matmul_4bit` | `autograd._functions` | function | 4-bit matrix multiplication |
| `bnb.modules` | `nn.modules` | module | nn module namespace |
| `bnb.adam` | `optim.adam` | module | Adam optimizer namespace |
| `bnb.research` | `research` | module | Research/experimental namespace |
| `bnb.utils` | `utils` | module | Utilities namespace |

### Module-level attributes

| Symbol | Type | Value/Description |
|--------|------|-------------------|
| `bnb.__version__` | `str` | `"0.49.2.dev0"` |
| `bnb.features` | `set` | `{"multi_backend"}` — Integration signal for transformers/diffusers |
| `bnb.supported_torch_devices` | `set` | `{"cpu", "cuda", "xpu", "hpu", "npu", "mps"}` |
| `bnb.__pdoc__` | `dict` | Controls pdoc visibility for internal classes |

### Backend auto-loading

On import, bitsandbytes conditionally imports backend modules based on device availability:

- `backends.cpu.ops` — Always loaded
- `backends.default.ops` — Always loaded
- `backends.cuda.ops` — Loaded if `torch.cuda.is_available()`
- `backends.xpu.ops` — Loaded if `torch.xpu.is_available()`
- `backends.hpu.ops` — Loaded if `habana_frameworks` is importable and `torch.hpu.is_available()`

Additionally, `_import_backends()` discovers external packages with `bitsandbytes.backends`
entry points (pip-installed backend plugins).

---

## 2. Neural Network Modules

**Import path:** `from bitsandbytes.nn import <Class>`

All modules are in `bitsandbytes.nn.modules` and re-exported through `bitsandbytes.nn.__init__`.

### 2.1 Linear Layers

#### `Linear4bit` — 4-bit quantized linear layer (QLoRA)

```
bitsandbytes.nn.Linear4bit(
    input_features: int,
    output_features: int,
    bias: bool = True,
    compute_dtype: Optional[torch.dtype] = None,
    compress_statistics: bool = True,
    quant_type: str = "fp4",
    quant_storage: torch.dtype = torch.uint8,
    device = None,
)
```

**Parent:** `torch.nn.Linear`
**Stability:** Stable — Core API, used extensively by transformers and PEFT.
**Behavior:**
- Weights are stored as `Params4bit` (quantized on `.to(device)`)
- Forward: dequantizes, computes matmul via `bnb.matmul_4bit`
- `compute_dtype` controls the dtype used for the matmul computation
- `compress_statistics` enables double quantization of absmax values (saves memory)
- `quant_type` selects the 4-bit quantization scheme: `"fp4"` or `"nf4"`
- `quant_storage` controls the packed storage dtype (default: `torch.uint8`)
- State dict serialization includes packed `QuantState` for safetensors compatibility
- CPU inference path supports AVX512BF16 acceleration via packed weight format

#### `LinearFP4` — Convenience wrapper for FP4

```
bitsandbytes.nn.LinearFP4(
    input_features, output_features, bias=True,
    compute_dtype=None, compress_statistics=True,
    quant_storage=torch.uint8, device=None,
)
```

**Parent:** `Linear4bit` with `quant_type="fp4"` hardcoded.
**Stability:** Stable.

#### `LinearNF4` — Convenience wrapper for NF4

```
bitsandbytes.nn.LinearNF4(
    input_features, output_features, bias=True,
    compute_dtype=None, compress_statistics=True,
    quant_storage=torch.uint8, device=None,
)
```

**Parent:** `Linear4bit` with `quant_type="nf4"` hardcoded.
**Stability:** Stable.

#### `Linear8bitLt` — 8-bit linear layer (LLM.int8())

```
bitsandbytes.nn.Linear8bitLt(
    input_features: int,
    output_features: int,
    bias: bool = True,
    has_fp16_weights: bool = True,
    threshold: float = 0.0,
    index = None,
    device = None,
)
```

**Parent:** `torch.nn.Linear`
**Stability:** Stable — Core API for LLM.int8().
**Behavior:**
- Weights stored as `Int8Params` (quantized on `.to(device)` if `has_fp16_weights=False`)
- `has_fp16_weights=True`: weights stay in fp16, quantized on-the-fly each forward pass
- `has_fp16_weights=False`: weights quantized once on `.to(device)`, stored as int8
- `threshold > 0.0`: enables mixed-precision decomposition (outlier columns in fp16, rest in int8)
- `threshold == 0.0`: all columns quantized to int8
- Forward: calls `bnb.matmul(x, self.weight, bias, state)`
- State dict includes SCB (column scaling factors) and weight_format metadata

#### `OutlierAwareLinear` — Base class for outlier-aware quantization

```
bitsandbytes.nn.OutlierAwareLinear(
    input_features, output_features, bias=True, device=None,
)
```

**Parent:** `torch.nn.Linear`
**Stability:** Experimental / semi-public.
**Notes:** Requires `OutlierTracer.initialize(model)` before use. Abstract methods
`forward_with_outliers` and `quantize_weight` must be overridden.

#### `SwitchBackLinearBnb` — SwitchBack linear using bnb backend

```
bitsandbytes.nn.SwitchBackLinearBnb(
    input_features, output_features, bias=True,
    has_fp16_weights=True, memory_efficient_backward=False,
    threshold=0.0, index=None, device=None,
)
```

**Parent:** `torch.nn.Linear`
**Stability:** Experimental.
**Notes:** Uses `Int8Params` + `MatmulLtState`. Calls `bnb.matmul_mixed` for int8 matmul with mixed precision in forward.

### 2.2 Triton-Based Linear Layers

These require triton to be installed. Import from `bitsandbytes.nn`.

#### `SwitchBackLinear` — Triton-based SwitchBack

```
bitsandbytes.nn.SwitchBackLinear(
    in_features: int, out_features: int, bias: bool = True,
    device=None, dtype=None,
    vector_wise_quantization: bool = False,
    mem_efficient: bool = False,
)
```

**Parent:** `torch.nn.Linear`
**Stability:** Experimental — requires triton.
**Notes:** Has a `prepare_for_eval()` method that pre-quantizes weights.

#### `SwitchBackLinearGlobal`

`functools.partial(SwitchBackLinear, vector_wise_quantization=False)`
**Stability:** Experimental.

#### `SwitchBackLinearVectorwise`

`functools.partial(SwitchBackLinear, vector_wise_quantization=True)`
**Stability:** Experimental.

#### `StandardLinear` — Standard linear with explicit autograd

```
bitsandbytes.nn.StandardLinear
```

**Parent:** `torch.nn.Linear`
**Stability:** Experimental — utility/baseline.

### 2.3 Embedding Layers

#### `StableEmbedding` — Embedding with 32-bit optimizer states

```
bitsandbytes.nn.StableEmbedding(
    num_embeddings: int, embedding_dim: int,
    padding_idx=None, max_norm=None, norm_type=2.0,
    scale_grad_by_freq=False, sparse=False,
    _weight=None, device=None, dtype=None,
)
```

**Parent:** `torch.nn.Embedding`
**Stability:** Stable.
**Notes:** Xavier uniform init + LayerNorm applied after embedding lookup. Automatically
registers 32-bit optimizer override via `GlobalOptimManager`.

#### `Embedding` — Embedding with 32-bit optimizer states

```
bitsandbytes.nn.Embedding(
    num_embeddings: int, embedding_dim: int,
    padding_idx=None, max_norm=None, norm_type=2.0,
    scale_grad_by_freq=False, sparse=False,
    _weight=None, device=None,
)
```

**Parent:** `torch.nn.Embedding`
**Stability:** Stable.
**Notes:** Like StableEmbedding but without LayerNorm. Xavier uniform init. Registers
32-bit optimizer override.

#### `Embedding8bit` — Int8 quantized embedding

```
bitsandbytes.nn.Embedding8bit(
    num_embeddings, embedding_dim, device=None, dtype=None,
)
```

**Parent:** `torch.nn.Embedding`
**Stability:** Stable.
**Notes:** Weight stored as `Int8Params`. Saving (`_save_to_state_dict`) is NOT implemented
(raises `NotImplementedError`).

#### `Embedding4bit` — 4-bit quantized embedding

```
bitsandbytes.nn.Embedding4bit(
    num_embeddings, embedding_dim, dtype=None,
    quant_type="fp4", quant_storage=torch.uint8, device=None,
)
```

**Parent:** `torch.nn.Embedding`
**Stability:** Stable.
**Notes:** Weight stored as `Params4bit`. Uses partial dequantization when
`embedding_dim % blocksize == 0`. Saving is NOT implemented.

#### `EmbeddingFP4` — Convenience wrapper

```
bitsandbytes.nn.EmbeddingFP4(num_embeddings, embedding_dim, dtype=None, quant_storage=torch.uint8, device=None)
```

**Parent:** `Embedding4bit` with `quant_type="fp4"`.

#### `EmbeddingNF4` — Convenience wrapper

```
bitsandbytes.nn.EmbeddingNF4(num_embeddings, embedding_dim, dtype=None, quant_storage=torch.uint8, device=None)
```

**Parent:** `Embedding4bit` with `quant_type="nf4"`.

### 2.4 Parameter Types

#### `Params4bit` — 4-bit quantized parameter

```
bitsandbytes.nn.Params4bit(
    data: Optional[torch.Tensor] = None,
    requires_grad: bool = False,
    quant_state: Optional[QuantState] = None,
    blocksize: Optional[int] = None,        # default: 64 (128 on ROCm)
    compress_statistics: bool = True,
    quant_type: str = "fp4",
    quant_storage: torch.dtype = torch.uint8,
    module: Optional[Linear4bit] = None,
    bnb_quantized: bool = False,
)
```

**Parent:** `torch.nn.Parameter`
**Stability:** Stable — essential for 4-bit workflows.
**Key behaviors:**
- `.to(device)` triggers quantization on first move to non-meta device
- `_quantize(device)` calls `bnb.functional.quantize_4bit`
- Custom `__torch_function__` for `torch.chunk` and `torch.split` to preserve quant state
- `from_prequantized(data, quantized_stats, ...)` class method for loading pre-quantized weights
- Custom `__deepcopy__`, `__copy__`, `__getstate__`, `__setstate__` for serialization
- `.cpu()`, `.cuda()`, `.xpu()` handle CPU packing format conversion

#### `Int8Params` — 8-bit quantized parameter

```
bitsandbytes.nn.Int8Params(
    data: Optional[torch.Tensor] = None,
    requires_grad: bool = True,
    has_fp16_weights: bool = False,
    CB: Optional[torch.Tensor] = None,
    SCB: Optional[torch.Tensor] = None,
)
```

**Parent:** `torch.nn.Parameter`
**Stability:** Stable — essential for 8-bit workflows.
**Key behaviors:**
- `.to(device)` triggers quantization if moving from CPU to non-meta device and not already quantized
- `_quantize(device)` calls `bnb.functional.int8_vectorwise_quant`
- `.CB` stores the int8 quantized data
- `.SCB` stores the per-row scaling factors
- `has_fp16_weights=True` skips quantization entirely

---

## 3. Optimizers

**Import path:** `from bitsandbytes.optim import <Class>`

All optimizers follow the same pattern: a base class that accepts `optim_bits` to control
32-bit vs 8-bit state, and concrete classes that fix the bit width. All support
`is_paged=True` for paged optimizers (offloading state to CPU via managed memory).

### 3.1 Base Classes

#### `GlobalOptimManager` — Singleton for per-parameter optimizer config overrides

```
bitsandbytes.optim.GlobalOptimManager.get_instance()
```

**Methods:**
- `register_parameters(params)` — Register parameters for config lookup
- `override_config(parameters, key=None, value=None, key_value_dict=None)` — Override optimizer hyperparams per parameter
- `register_module_override(module, param_name, config)` — Register module-level overrides

**Stability:** Stable — used by StableEmbedding, Embedding to force 32-bit states.

#### `Optimizer8bit` — Base class for all bnb optimizers

```
bitsandbytes.optim.optimizer.Optimizer8bit(params, defaults, optim_bits=32, is_paged=False)
```

**Parent:** `torch.optim.Optimizer`
**Stability:** Semi-public — users don't instantiate directly.
**Key features:**
- Custom `state_dict()` / `load_state_dict()` for FSDP compatibility
  (wraps quant state tensors in nested dict to prevent FSDP gather failures)
- `non_castable_tensor_keys`: set of state keys that should not be dtype-cast during load
- `is_paged`: enables CUDA managed memory for optimizer states
- `fill_qmap()`: initializes dynamic quantization maps

#### `Optimizer2State` — Base for 2-state optimizers (Adam, AdamW, LAMB, AdEMAMix)

```
bitsandbytes.optim.optimizer.Optimizer2State(
    optimizer_name, params, lr=1e-3, betas=(0.9, 0.999),
    eps=1e-8, weight_decay=0.0, optim_bits=32, args=None,
    min_8bit_size=4096, max_unorm=0.0, skip_zeros=False,
    is_paged=False, alpha=0.0, t_alpha=None, t_beta3=None,
)
```

**Parent:** `Optimizer8bit`
**Stability:** Semi-public.

#### `Optimizer1State` — Base for 1-state optimizers (SGD, Adagrad, RMSprop, LARS, Lion)

```
bitsandbytes.optim.optimizer.Optimizer1State(
    optimizer_name, params, lr=1e-3, betas=(0.9, 0.0),
    eps=1e-8, weight_decay=0.0, optim_bits=32, args=None,
    min_8bit_size=4096, max_unorm=0.0, skip_zeros=False,
    is_paged=False,
)
```

**Parent:** `Optimizer8bit`
**Stability:** Semi-public.

### 3.2 Concrete Optimizer Classes

All follow the naming pattern: `Name` (configurable bits), `Name8bit` (fixed 8-bit state),
`Name32bit` (fixed 32-bit state), `PagedName` (paged, configurable), `PagedName8bit`, `PagedName32bit`.

#### Adam Family (2-state, `optimizer_name="adam"`)

| Class | Parent | `optim_bits` | `is_paged` |
|-------|--------|-------------|------------|
| `Adam` | `Optimizer2State` | configurable (default 32) | `False` |
| `Adam8bit` | `Optimizer2State` | 8 (hardcoded) | `False` |
| `Adam32bit` | `Optimizer2State` | 32 (hardcoded) | `False` |
| `PagedAdam` | `Optimizer2State` | configurable (default 32) | `True` |
| `PagedAdam8bit` | `Optimizer2State` | 8 (hardcoded) | `True` |
| `PagedAdam32bit` | `Optimizer2State` | 32 (hardcoded) | `True` |

**Stability:** Stable.

#### AdamW Family (2-state, `optimizer_name="adam"`, decoupled weight decay)

| Class | Parent | `optim_bits` | `is_paged` |
|-------|--------|-------------|------------|
| `AdamW` | `Optimizer2State` | configurable | `False` |
| `AdamW8bit` | `Optimizer2State` | 8 | `False` |
| `AdamW32bit` | `Optimizer2State` | 32 | `False` |
| `PagedAdamW` | `Optimizer2State` | configurable | `True` |
| `PagedAdamW8bit` | `Optimizer2State` | 8 | `True` |
| `PagedAdamW32bit` | `Optimizer2State` | 32 | `True` |

**Stability:** Stable.

#### AdEMAMix Family (2-state, `optimizer_name="ademamix"`)

| Class | Parent | `optim_bits` | `is_paged` |
|-------|--------|-------------|------------|
| `AdEMAMix` | `Optimizer2State` | configurable | `False` |
| `AdEMAMix8bit` | `AdEMAMix` | 8 | `False` |
| `AdEMAMix32bit` | `Optimizer2State` | 32 | `False` |
| `PagedAdEMAMix` | `AdEMAMix` | configurable | `True` |
| `PagedAdEMAMix8bit` | `AdEMAMix8bit` | 8 | `True` |
| `PagedAdEMAMix32bit` | `AdEMAMix32bit` | 32 | `True` |

**Stability:** Stable.
**Notes:** Takes additional `betas=(beta1, beta2, beta3)`, `alpha`, `t_alpha`, `t_beta3` params.

#### LAMB Family (2-state, `optimizer_name="lamb"`)

| Class | Parent | `optim_bits` | `is_paged` |
|-------|--------|-------------|------------|
| `LAMB` | `Optimizer2State` | configurable | `False` |
| `LAMB8bit` | `Optimizer2State` | 8 | `False` |
| `LAMB32bit` | `Optimizer2State` | 32 | `False` |

**Stability:** Stable.

#### SGD Family (1-state, `optimizer_name="momentum"`)

| Class | Parent | `optim_bits` | `is_paged` |
|-------|--------|-------------|------------|
| `SGD` | `Optimizer1State` | configurable | `False` |
| `SGD8bit` | `Optimizer1State` | 8 | `False` |
| `SGD32bit` | `Optimizer1State` | 32 | `False` |

**Stability:** Stable.

#### Adagrad Family (1-state, `optimizer_name="adagrad"`)

| Class | Parent | `optim_bits` | `is_paged` |
|-------|--------|-------------|------------|
| `Adagrad` | `Optimizer1State` | configurable | `False` |
| `Adagrad8bit` | `Optimizer1State` | 8 | `False` |
| `Adagrad32bit` | `Optimizer1State` | 32 | `False` |

**Stability:** Stable.

#### RMSprop Family (1-state, `optimizer_name="rmsprop"`)

| Class | Parent | `optim_bits` | `is_paged` |
|-------|--------|-------------|------------|
| `RMSprop` | `Optimizer1State` | configurable | `False` |
| `RMSprop8bit` | `Optimizer1State` | 8 | `False` |
| `RMSprop32bit` | `Optimizer1State` | 32 | `False` |

**Stability:** Stable.

#### LARS Family (1-state, `optimizer_name="lars"`)

| Class | Parent | `optim_bits` | `is_paged` |
|-------|--------|-------------|------------|
| `LARS` | `Optimizer1State` | configurable | `False` |
| `LARS8bit` | `Optimizer1State` | 8 | `False` |
| `LARS32bit` | `Optimizer1State` | 32 | `False` |
| `PytorchLARS` | `torch.optim.Optimizer` | N/A | N/A |

**Stability:** Stable.
**Notes:** `PytorchLARS` is a pure-PyTorch reference implementation (not quantized).

#### Lion Family (1-state, `optimizer_name="lion"`)

| Class | Parent | `optim_bits` | `is_paged` |
|-------|--------|-------------|------------|
| `Lion` | `Optimizer1State` | configurable | `False` |
| `Lion8bit` | `Optimizer1State` | 8 | `False` |
| `Lion32bit` | `Optimizer1State` | 32 | `False` |
| `PagedLion` | `Optimizer1State` | configurable | `True` |
| `PagedLion8bit` | `Optimizer1State` | 8 | `True` |
| `PagedLion32bit` | `Optimizer1State` | 32 | `True` |

**Stability:** Stable.

### 3.3 Common Optimizer Parameters

All bnb optimizers share these parameters beyond the standard PyTorch ones:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `optim_bits` | `int` | 32 | 32 for full precision state, 8 for quantized state |
| `min_8bit_size` | `int` | 4096 | Parameters smaller than this use 32-bit state even in 8-bit mode |
| `max_unorm` | `float` | 0.0 | Maximum update norm relative to weight norm. 0 = disabled |
| `skip_zeros` | `bool` | `False` | Skip zero gradients in sparse models |
| `is_paged` | `bool` | `False` | Use CUDA managed memory for state offloading |

---

## 4. Functional API

**Import path:** `import bitsandbytes.functional as F` or `from bitsandbytes.functional import <symbol>`

### 4.1 4-Bit Quantization

#### `quantize_4bit`

```python
F.quantize_4bit(
    A: torch.Tensor,
    absmax: Optional[torch.Tensor] = None,
    out: Optional[torch.Tensor] = None,
    blocksize: Optional[int] = None,         # default: 64 (128 on ROCm)
    compress_statistics: bool = False,
    quant_type: str = "fp4",
    quant_storage: torch.dtype = torch.uint8,
) -> tuple[torch.Tensor, QuantState]
```

**Stability:** Stable.
**Supported dtypes:** float16, bfloat16, float32.
**Valid blocksizes:** 32, 64, 128, 256, 512, 1024, 2048, 4096.
**Quant types:** `"fp4"`, `"nf4"`.

#### `dequantize_4bit`

```python
F.dequantize_4bit(
    A: torch.Tensor,
    quant_state: Optional[QuantState] = None,
    absmax: Optional[torch.Tensor] = None,
    out: Optional[torch.Tensor] = None,
    blocksize: Optional[int] = None,
    quant_type: str = "fp4",
) -> torch.Tensor
```

**Stability:** Stable.

#### `quantize_fp4` / `quantize_nf4`

Convenience wrappers that call `quantize_4bit` with the quant_type fixed.
**Stability:** Stable.

#### `dequantize_fp4` / `dequantize_nf4`

Convenience wrappers that call `dequantize_4bit` with the quant_type fixed.
**Stability:** Stable.

#### `get_4bit_type`

```python
F.get_4bit_type(typename: str, device=None, blocksize=64) -> torch.Tensor
```

Returns a 16-element codebook tensor for the given type name.
**Valid typenames:** `"nf4"`, `"fp4"`, `"int4"`, `"af4"` (af4 only supports blocksize 64).
**Stability:** Stable.

### 4.2 Blockwise (8-bit) Quantization

#### `quantize_blockwise`

```python
F.quantize_blockwise(
    A: torch.Tensor,
    code: Optional[torch.Tensor] = None,
    absmax: Optional[torch.Tensor] = None,
    out: Optional[torch.Tensor] = None,
    blocksize: int = 4096,
    nested: bool = False,
) -> tuple[torch.Tensor, QuantState]
```

**Stability:** Stable.
**Supported dtypes:** float16, bfloat16, float32.
**Valid blocksizes:** 64, 128, 256, 512, 1024, 2048, 4096.

#### `dequantize_blockwise`

```python
F.dequantize_blockwise(
    A: torch.Tensor,
    quant_state: Optional[QuantState] = None,
    absmax: Optional[torch.Tensor] = None,
    code: Optional[torch.Tensor] = None,
    out: Optional[torch.Tensor] = None,
    blocksize: int = 4096,
    nested: bool = False,
) -> torch.Tensor
```

**Stability:** Stable.

### 4.3 Int8 Operations

#### `int8_vectorwise_quant`

```python
F.int8_vectorwise_quant(
    A: torch.Tensor,
    threshold: float = 0.0,
) -> tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
```

Returns `(quantized_int8, row_stats, outlier_cols_or_None)`.
**Stability:** Stable.
**Notes:** When `threshold > 0.0`, returns outlier column indices. This is the core of LLM.int8() decomposition.

#### `int8_vectorwise_dequant`

```python
F.int8_vectorwise_dequant(
    A: torch.Tensor,         # int8
    stats: torch.Tensor,     # float32 row stats
) -> torch.Tensor            # float32
```

**Stability:** Stable.

#### `int8_double_quant`

```python
F.int8_double_quant(
    A: torch.Tensor,
    threshold: float = 0.0,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
```

Returns `(out_row, out_col, row_stats, col_stats, outlier_cols)`.
Performs both row-wise and column-wise int8 quantization simultaneously.
**Stability:** Stable.
**Notes:** Used in the backward pass of MatMul8bitLt when weight gradients are needed.

#### `int8_linear_matmul`

```python
F.int8_linear_matmul(
    A: torch.Tensor,
    B: torch.Tensor,
    out: Optional[torch.Tensor] = None,
    dtype: torch.dtype = torch.int32,
) -> torch.Tensor
```

Int8 matrix multiplication: `A @ B.T` where both A and B are int8.
Returns int32 result.
**Stability:** Stable.

#### `int8_mm_dequant`

```python
F.int8_mm_dequant(
    A: torch.Tensor,              # int32 matmul result
    row_stats: torch.Tensor,      # float32
    col_stats: torch.Tensor,      # float32
    dtype: torch.dtype = torch.float16,
    bias: Optional[torch.Tensor] = None,
) -> torch.Tensor
```

Dequantizes the int32 result of int8 matmul using row and column statistics.
**Stability:** Stable.

### 4.4 QuantState

```python
class F.QuantState:
    valid_quant_types = ("fp4", "nf4")

    def __init__(self, absmax, shape=None, code=None, blocksize=None,
                 quant_type=None, dtype=None, offset=None, state2=None): ...

    @classmethod
    def from_dict(cls, qs_dict: dict, device: torch.device) -> QuantState: ...

    def as_dict(self, packed=False) -> dict: ...

    def to(self, device): ...

    def __eq__(self, other) -> bool: ...

    def __getitem__(self, idx): ...    # backward compatibility with list-based state
```

**Stability:** Stable — essential for serialization of quantized weights.
**Key attributes:**
- `absmax` — Per-block scaling factors
- `shape` — Original tensor shape
- `code` — Quantization codebook (16 values for 4-bit)
- `blocksize` — Block size used for quantization
- `quant_type` — `"fp4"` or `"nf4"`
- `dtype` — Original tensor dtype
- `offset` — Mean of absmax (used in double quantization / `compress_statistics`)
- `state2` — Nested QuantState for doubly-quantized absmax
- `nested` — `True` if `state2` is not None

### 4.5 Quantization Map Constructors

#### `create_dynamic_map`

```python
F.create_dynamic_map(signed=True, max_exponent_bits=7, total_bits=8) -> torch.Tensor
```

Creates a 256-element dynamic quantization codebook. This is the default
codebook used by blockwise quantization.
**Stability:** Stable.

#### `create_normal_map`

```python
F.create_normal_map(offset=0.9677083, use_extra_value=True) -> torch.Tensor
```

Creates the NF4 quantization codebook (16 values + padding to 256).
**Stability:** Stable.
**Notes:** Requires scipy for the `norm.ppf` call. The hardcoded NF4 values in
`get_4bit_type("nf4")` avoid this dependency at runtime.

#### `create_fp8_map`

```python
F.create_fp8_map(signed=True, exponent_bits=5, precision_bits=2, total_bits=8) -> torch.Tensor
```

Creates a floating-point quantization codebook. Despite the name, works for
any `total_bits` (including FP4 with `total_bits=4`).
**Stability:** Stable.

#### `create_linear_map`

```python
F.create_linear_map(signed=True, total_bits=8, add_zero=True) -> torch.Tensor
```

Creates a uniform linear quantization codebook.
**Stability:** Stable.

### 4.6 4-Bit GEMV

#### `gemv_4bit`

```python
F.gemv_4bit(
    A: torch.Tensor,
    B: torch.Tensor,
    out: Optional[torch.Tensor] = None,
    transposed_A: bool = False,
    transposed_B: bool = False,
    state: QuantState = None,              # required
) -> torch.Tensor
```

Efficient matrix-vector product with 4-bit quantized weight matrix.
Used for single-batch inference in `matmul_4bit`.
**Stability:** Stable.
**Supported dtypes for A:** float16, bfloat16, float32.

### 4.7 Optimizer Update Functions

#### `optimizer_update_32bit`

```python
F.optimizer_update_32bit(
    optimizer_name: str, g: Tensor, p: Tensor, state1: Tensor,
    beta1: float, eps: float, step: int, lr: float,
    state2: Optional[Tensor] = None, beta2: float = 0.0,
    beta3: float = 0.0, alpha: float = 0.0,
    weight_decay: float = 0.0, gnorm_scale: float = 1.0,
    unorm_vec: Optional[Tensor] = None, max_unorm: float = 0.0,
    skip_zeros: bool = False,
) -> None
```

In-place optimizer step with 32-bit state.
**Stability:** Stable.
**Valid optimizer names:** `"adam"`, `"momentum"`, `"rmsprop"`, `"lion"`, `"adagrad"`, `"ademamix"`, `"lamb"`, `"lars"`.

#### `optimizer_update_8bit_blockwise`

```python
F.optimizer_update_8bit_blockwise(
    optimizer_name: str, g: Tensor, p: Tensor,
    state1: Tensor, state2: Optional[Tensor],
    beta1: float, beta2: float, beta3: float, alpha: float,
    eps: float, step: int, lr: float,
    qmap1: Tensor, qmap2: Optional[Tensor],
    absmax1: Tensor, absmax2: Optional[Tensor],
    weight_decay: float = 0.0, gnorm_scale: float = 1.0,
    skip_zeros: bool = False,
) -> None
```

In-place optimizer step with 8-bit blockwise-quantized state.
**Stability:** Stable.

### 4.8 Integer GEMM

#### `igemm`

```python
F.igemm(
    A: Tensor, B: Tensor, out: Optional[Tensor] = None,
    transposed_A: bool = False, transposed_B: bool = False,
) -> torch.Tensor
```

Int8 matrix multiplication via cuBLAS igemm.
**Stability:** Stable (internal, used by the library).

#### `batched_igemm`

```python
F.batched_igemm(
    A: Tensor, B: Tensor, out: Optional[Tensor] = None,
    transposed_A: bool = False, transposed_B: bool = False,
) -> torch.Tensor
```

Batched int8 matrix multiplication.
**Stability:** Stable (internal).

### 4.9 Paged Memory

#### `get_paged`

```python
F.get_paged(*shape, dtype=torch.float32, device=FIRST_CUDA_DEVICE) -> torch.Tensor
```

Allocates a CUDA managed-memory tensor.
**Stability:** Stable (internal, used by paged optimizers).

#### `prefetch_tensor`

```python
F.prefetch_tensor(A: torch.Tensor, to_cpu: bool = False) -> None
```

Prefetch a paged tensor to GPU or CPU.
**Stability:** Stable (internal).

### 4.10 CPU-Specific Functions

#### `_convert_weight_packed_for_cpu`

```python
F._convert_weight_packed_for_cpu(
    qweight: torch.Tensor, quant_state: QuantState, block_n: int = 32,
) -> tuple[torch.Tensor, QuantState]
```

Converts 4-bit quantized weights to a packed format optimized for CPU AVX512BF16 inference.
**Stability:** Internal (prefixed with `_`).

#### `_convert_weight_packed_for_cpu_inverse`

```python
F._convert_weight_packed_for_cpu_inverse(
    qweight: torch.Tensor, quant_state: QuantState,
) -> tuple[torch.Tensor, QuantState]
```

Reverses the CPU packing format.
**Stability:** Internal (prefixed with `_`).

#### `has_avx512bf16`

```python
F.has_avx512bf16() -> bool
```

Detects AVX512BF16 CPU support.
**Stability:** Internal but may be useful externally.

### 4.11 Utility Functions

#### `is_on_gpu`

```python
F.is_on_gpu(tensors: Iterable[Optional[torch.Tensor]]) -> bool
```

Verifies all tensors are on the same GPU. Raises RuntimeError if not.
**Stability:** Stable (internal validation).

#### `get_ptr`

```python
F.get_ptr(A: Optional[Tensor]) -> Optional[ct.c_void_p]
```

Gets the data pointer of a tensor for ctypes calls.
**Stability:** Internal.

### 4.12 Singleton Managers

#### `GlobalPageManager`

```python
F.GlobalPageManager.get_instance() -> GlobalPageManager
```

Manages paged tensors for prefetching.
**Stability:** Internal.

#### `CUBLAS_Context`

```python
F.CUBLAS_Context.get_instance() -> CUBLAS_Context
```

Manages cuBLAS context handles per device.
**Stability:** Internal.

---

## 5. Autograd Functions

**Import path:** `from bitsandbytes.autograd._functions import <symbol>`

Top-level re-exports: `bnb.matmul`, `bnb.matmul_4bit`, `bnb.MatmulLtState`.

### `MatmulLtState` — State container for 8-bit matmul

```python
@dataclass
class MatmulLtState:
    CB: Optional[torch.Tensor] = None
    SCB: Optional[torch.Tensor] = None
    threshold: float = 0.0
    has_fp16_weights: bool = True
    is_training: bool = True
    use_pool: bool = False
    ...
```

**Stability:** Stable.
**Key fields:**
- `CB` / `SCB` — Quantized weight and scale columns
- `threshold` — Outlier threshold for mixed-precision decomposition
- `has_fp16_weights` — Whether weights are stored in fp16 or int8
- `is_training` — Switches between training and inference code paths

### `matmul` — 8-bit matrix multiplication

```python
bnb.matmul(
    A: torch.Tensor,
    B: torch.Tensor,
    out: Optional[torch.Tensor] = None,
    state: Optional[MatmulLtState] = None,
    threshold: float = 0.0,
    bias: Optional[torch.Tensor] = None,
) -> torch.Tensor
```

**Stability:** Stable.
**Dispatches to:**
- `MatMul8bitFp` on CPU/XPU during training (faster path, no quantized grad computation)
- `MatMul8bitLt` elsewhere (full quantized matmul with backward support)

### `matmul_4bit` — 4-bit matrix multiplication

```python
bnb.matmul_4bit(
    A: torch.Tensor,
    B: torch.Tensor,
    quant_state: F.QuantState,
    out: Optional[torch.Tensor] = None,
    bias: Optional[torch.Tensor] = None,
) -> torch.Tensor
```

**Stability:** Stable.
**Dispatches to:**
- `F.gemv_4bit` for single-batch inference (fast path, no autograd)
- `MatMul4Bit.apply` for batched/training (autograd-enabled, dequant + torch.matmul)
- CPU path supports packed weight format for AVX512BF16

### Internal autograd classes

| Class | Description | Stability |
|-------|-------------|-----------|
| `MatMul8bitLt` | Full 8-bit matmul with backward for weight and input grad | Internal |
| `MatMul8bitFp` | Dequant + matmul path for CPU/XPU training | Internal |
| `MatMul4Bit` | Dequant + matmul with backward for 4-bit weights | Internal |
| `GlobalOutlierPooler` | Pools outlier dimensions across layers | Internal |

---

## 6. Torch Custom Ops

**Module:** `bitsandbytes._ops`

These are defined via `torch.library.define` and provide the contract between
the functional API and backend implementations. Each op has a `register_fake`
implementation for `torch.compile` / FX tracing.

### Op Schema Table

| Op Name | Signature | Description |
|---------|-----------|-------------|
| `bitsandbytes::int8_mixed_scaled_mm` | `(A, CA, CB, SCA, SCB, outlier_cols?, bias?) -> (Tensor, Tensor?)` | Int8 matmul with mixed-precision outlier handling |
| `bitsandbytes::int8_scaled_mm` | `(A, B, row_stats, col_stats, bias?, dtype?) -> Tensor` | Int8 matmul + dequant + bias |
| `bitsandbytes::int8_linear_matmul` | `(A, B) -> Tensor` | Raw int8 matmul (A, B are int8, result is int32) |
| `bitsandbytes::int8_linear_matmul.out` | `(A, B, out!) -> ()` | In-place variant |
| `bitsandbytes::int8_vectorwise_quant` | `(A, threshold=0.0) -> (Tensor, Tensor, Tensor?)` | Row-wise int8 quantization with optional outlier extraction |
| `bitsandbytes::int8_vectorwise_dequant` | `(A, stats) -> Tensor` | Row-wise int8 dequantization |
| `bitsandbytes::int8_mm_dequant` | `(A, row_stats, col_stats, dtype?, bias?) -> Tensor` | Dequantize int32 matmul result |
| `bitsandbytes::int8_double_quant` | `(A, threshold=0.0) -> (Tensor, Tensor, Tensor, Tensor, Tensor?)` | Simultaneous row and column quantization |
| `bitsandbytes::quantize_4bit` | `(A, blocksize, quant_type, quant_storage) -> (Tensor, Tensor)` | 4-bit blockwise quantization |
| `bitsandbytes::dequantize_4bit` | `(A, absmax, blocksize, quant_type, shape, dtype) -> Tensor` | 4-bit blockwise dequantization |
| `bitsandbytes::dequantize_4bit.out` | `(A, absmax, blocksize, quant_type, shape, dtype, out!) -> ()` | In-place variant |
| `bitsandbytes::quantize_blockwise` | `(A, code, blocksize) -> (Tensor, Tensor)` | 8-bit blockwise quantization |
| `bitsandbytes::dequantize_blockwise` | `(A, absmax, code, blocksize, dtype) -> Tensor` | 8-bit blockwise dequantization |
| `bitsandbytes::dequantize_blockwise.out` | `(A, absmax, code, blocksize, dtype, out!) -> ()` | In-place variant |
| `bitsandbytes::gemv_4bit` | `(A, B, shapeB, absmax, code, blocksize) -> Tensor` | 4-bit GEMV (matrix-vector product) |
| `bitsandbytes::gemv_4bit.out` | `(A, B, shapeB, absmax, code, blocksize, out!) -> ()` | In-place variant |
| `bitsandbytes::optimizer_update_32bit` | `(name, g!, p!, state1!, state2!?, ...) -> ()` | 32-bit optimizer step |
| `bitsandbytes::optimizer_update_8bit_blockwise` | `(name, g!, p!, state1!, state2!?, ...) -> ()` | 8-bit blockwise optimizer step |

**Stability:** Semi-public. The op schemas are the most important stability contract in
the codebase — changing a schema breaks all backend implementations.

### Default Implementations

`int8_vectorwise_dequant` has a default PyTorch-native implementation registered in `_ops.py`
itself (simple `A * stats * (1/127)`). All other ops must be implemented by backends.

---

## 7. Research / Experimental

**Import path:** `from bitsandbytes.research import <symbol>`

### Research Functions

```python
from bitsandbytes.research import matmul_fp8_global, matmul_fp8_mixed, switchback_bnb
```

#### `matmul_fp8_global`

```python
bitsandbytes.research.matmul_fp8_global(
    A, B, fw_code, bw_code, bsz, bsz2,
) -> torch.Tensor
```

FP8 matmul with global quantization.
**Stability:** Experimental.

#### `matmul_fp8_mixed`

```python
bitsandbytes.research.matmul_fp8_mixed(
    A, B, fw_code, bw_code, bsz, bsz2,
) -> torch.Tensor
```

FP8 matmul with mixed (row-wise) quantization.
**Stability:** Experimental.

#### `switchback_bnb`

```python
bitsandbytes.research.switchback_bnb(
    A, B, out=None, bias=None, state=MatmulLtState,
) -> torch.Tensor
```

SwitchBack-style matmul using bnb backend.
**Stability:** Experimental.

### Research NN Modules

```python
from bitsandbytes.research.nn import LinearFP8Mixed, LinearFP8Global
```

#### `LinearFP8Mixed` / `LinearFP8Global`

```python
bitsandbytes.research.nn.LinearFP8Mixed(input_features, output_features, bias=True)
bitsandbytes.research.nn.LinearFP8Global(input_features, output_features, bias=True)
```

**Parent:** `torch.nn.Linear`
**Stability:** Experimental.
**Notes:** Automatically select block sizes based on feature dimensions. Use FP8
quantization maps created via `create_fp8_map`.

---

## 8. Utilities

**Import path:** `from bitsandbytes.utils import <symbol>`

| Symbol | Type | Description | Stability |
|--------|------|-------------|-----------|
| `replace_linear` | function | Recursively replace `nn.Linear` modules in a model | Stable |
| `OutlierTracer` | class (singleton) | Traces outlier dimensions across linear layers | Experimental |
| `find_outlier_dims` | function | Find outlier dimensions via z-score or top-k | Experimental |
| `outlier_hook` | function | Forward pre-hook for `OutlierTracer` | Internal |
| `pack_dict_to_tensor` | function | Pack a dict into a uint8 tensor (for safetensors) | Stable (internal) |
| `unpack_tensor_to_dict` | function | Unpack uint8 tensor back to dict | Stable (internal) |
| `execute_and_return` | function | Run a shell command and return stdout/stderr | Internal |
| `sync_gpu` | function | Synchronize CUDA/XPU device | Internal |
| `LINEAR_8BIT_WEIGHTS_FORMAT_MAPPING` | dict | Maps format names to int codes | Stable (internal) |
| `INVERSE_LINEAR_8BIT_WEIGHTS_FORMAT_MAPPING` | dict | Reverse mapping | Stable (internal) |

### `replace_linear`

```python
bitsandbytes.utils.replace_linear(
    model: torch.nn.Module,
    linear_replacement: type,
    skip_modules: tuple = ("lm_head",),
    copy_weights: bool = False,
    post_processing_function: Optional[str] = None,
) -> torch.nn.Module
```

**Stability:** Stable — commonly used by integrations.

---

## 9. Native Library Interface

**Module:** `bitsandbytes.cextension`

### Classes

| Class | Description |
|-------|-------------|
| `BNBNativeLibrary` | Base wrapper for the ctypes-loaded native library |
| `CudaBNBNativeLibrary` | CUDA-specific subclass (sets up context/managed ptr) |
| `ErrorHandlerMockBNBNativeLibrary` | Fallback mock that defers error messages to call time |

### Module-level symbols

| Symbol | Type | Description |
|--------|------|-------------|
| `lib` | `BNBNativeLibrary` | The loaded native library instance |
| `BNB_BACKEND` | `str` | `"CUDA"`, `"ROCm"`, `"XPU"`, or `"CPU"` |
| `HIP_ENVIRONMENT` | `bool` | `True` if running on ROCm |
| `ROCM_GPU_ARCH` | `str` or `None` | e.g., `"gfx90a"` |
| `ROCM_WARP_SIZE_64` | `bool` | `True` if ROCm warp size is 64 |

**Stability:** Internal — but `lib` is used extensively by `functional.py` for ctypes calls.

---

## 10. Backend System

**Module:** `bitsandbytes.backends`

Backends provide device-specific implementations of the ops defined in `_ops.py`.
Each backend registers kernels via `@register_kernel("bitsandbytes::<op_name>", "<device>")`.

### Backend → Op Coverage Matrix

| Op | `default` | `cuda` | `cpu` | `xpu` | `hpu` | `triton` |
|----|-----------|--------|-------|-------|-------|----------|
| `int8_linear_matmul` | Yes | Yes | Yes | Yes | — | — |
| `int8_linear_matmul.out` | Yes | Yes | — | — | — | — |
| `int8_vectorwise_quant` | Yes | Yes | — | — | — | — |
| `int8_vectorwise_dequant` | (in _ops.py) | — | — | — | — | — |
| `int8_mm_dequant` | Yes | Yes | — | — | — | — |
| `int8_mixed_scaled_mm` | Yes | — | — | — | — | — |
| `int8_scaled_mm` | Yes | — | — | — | — | — |
| `int8_double_quant` | — | Yes | — | — | — | — |
| `quantize_blockwise` | Yes | Yes | Yes | Yes | — | Yes |
| `dequantize_blockwise` | Yes | Yes | Yes | Yes | — | Yes |
| `dequantize_blockwise.out` | — | Yes | — | Yes | — | — |
| `quantize_4bit` | Yes | Yes | — | Yes | — | Yes |
| `dequantize_4bit` | Yes | Yes | Yes | Yes | Yes | Yes |
| `dequantize_4bit.out` | — | Yes | — | Yes | — | Yes |
| `gemv_4bit` | Yes | Yes | Yes | Yes | — | Yes |
| `gemv_4bit.out` | — | Yes | — | Yes | — | — |
| `optimizer_update_32bit` | Yes | Yes | — | Yes | — | Yes |
| `optimizer_update_8bit_blockwise` | — | Yes | — | Yes | — | Yes |

**Notes:**
- `default` backend is pure PyTorch (no native code), registered for any device
- `cuda` backend uses ctypes calls to the native CUDA/HIP library
- `cpu` backend uses ctypes calls to the CPU native library (limited coverage)
- `xpu` backend uses triton kernels when available, ctypes fallback otherwise
- `hpu` backend only covers `dequantize_4bit` (Intel Gaudi)
- `triton` backend is not registered directly; XPU imports its implementations

### External Backend Entry Points

Third-party packages can register backends via the `bitsandbytes.backends` entry point
group in their `pyproject.toml`. This is how the MPS (Apple Silicon) backend is expected
to be distributed.

---

## 11. Deprecated Symbols

These symbols are marked with `@deprecated` and emit `FutureWarning`. They will be
removed in a future release.

| Symbol | Module | Replacement |
|--------|--------|-------------|
| `quantize` | `functional` | `quantize_blockwise` |
| `dequantize` | `functional` | `dequantize_blockwise` |
| `quantize_no_absmax` | `functional` | `quantize_blockwise` |
| `dequantize_no_absmax` | `functional` | `dequantize_blockwise` |
| `optimizer_update_8bit` | `functional` | `optimizer_update_8bit_blockwise` |

---

## 12. Downstream Integration Points

These are the specific API surfaces that downstream libraries (transformers, PEFT,
accelerate, etc.) depend on. Changes here have the highest breakage risk.

### Used by HuggingFace `transformers`

- `bnb.nn.Linear4bit` — Instantiated by `BitsAndBytesConfig(load_in_4bit=True)`
- `bnb.nn.Linear8bitLt` — Instantiated by `BitsAndBytesConfig(load_in_8bit=True)`
- `bnb.nn.Params4bit` — Used for weight loading and quantization
- `bnb.nn.Int8Params` — Used for weight loading and quantization
- `bnb.nn.Params4bit.from_prequantized()` — Loading pre-quantized weights
- `bnb.functional.QuantState` — Serialization/deserialization of quant states
- `bnb.functional.QuantState.from_dict()` / `.as_dict()` — State dict handling
- `bnb.features` — Feature detection (`"multi_backend"` in `bnb.features`)
- `bnb.supported_torch_devices` — Device support detection
- `bnb.__version__` — Version checks
- `bnb.utils.replace_linear` — Model conversion

### Used by PEFT / LoRA

- `bnb.nn.Linear4bit` — Base layer for QLoRA adapters
- `bnb.nn.Params4bit` — Parameter type checks
- `bnb.nn.Linear8bitLt` — Base layer for 8-bit LoRA

### Used by `accelerate`

- `bnb.optim.*` — Paged optimizers for DeepSpeed/FSDP
- `Optimizer8bit.state_dict()` / `load_state_dict()` — FSDP compatibility

### Integration Contract Summary

A PR that changes any of these symbols MUST consider downstream impact:

1. **`Linear4bit` constructor signature** — changing defaults breaks `BitsAndBytesConfig`
2. **`Params4bit.__new__` signature** — changing parameter order breaks weight loading
3. **`QuantState` serialization format** — changes break loading saved models
4. **Op schemas in `_ops.py`** — changes break ALL backend implementations
5. **`features` / `supported_torch_devices`** — changes break feature detection in transformers

---

## 13. Stability Tiers

### Tier 1: Stable Public API (breaking changes require deprecation cycle)

- `bnb.nn.Linear4bit`, `LinearFP4`, `LinearNF4`
- `bnb.nn.Linear8bitLt`
- `bnb.nn.Params4bit`, `Int8Params`
- `bnb.nn.Embedding`, `StableEmbedding`, `Embedding4bit`, `Embedding8bit`, `EmbeddingFP4`, `EmbeddingNF4`
- `bnb.functional.quantize_4bit`, `dequantize_4bit`
- `bnb.functional.quantize_blockwise`, `dequantize_blockwise`
- `bnb.functional.QuantState` (including serialization format)
- `bnb.functional.int8_vectorwise_quant`, `int8_double_quant`, `int8_mm_dequant`
- `bnb.matmul`, `bnb.matmul_4bit`, `bnb.MatmulLtState`
- All optimizer classes in `bnb.optim.*`
- `bnb.optim.GlobalOptimManager`
- `bnb.utils.replace_linear`
- `bnb.features`, `bnb.supported_torch_devices`, `bnb.__version__`

### Tier 2: Semi-Public (may change between minor versions)

- Op schemas in `_ops.py` (stable within a minor version, but may evolve)
- `bnb.functional.create_*_map` functions
- `bnb.functional.get_4bit_type`
- `bnb.functional.gemv_4bit`
- `bnb.functional.int8_linear_matmul`
- `bnb.functional.igemm`, `batched_igemm`
- Backend registration system (`register_kernel` pattern)
- `Optimizer8bit`, `Optimizer1State`, `Optimizer2State` base classes

### Tier 3: Experimental (may change or be removed at any time)

- Everything in `bitsandbytes.research.*`
- `bnb.nn.SwitchBackLinear*` (triton-based)
- `bnb.nn.SwitchBackLinearBnb`
- `bnb.nn.OutlierAwareLinear`
- `bnb.nn.StandardLinear`
- `bnb.utils.OutlierTracer`, `find_outlier_dims`

### Tier 4: Internal (not part of public API, may change freely)

- `bitsandbytes.cextension.*` (native library loading)
- `bitsandbytes.functional.get_ptr`, `is_on_gpu`, `_get_tensor_stream`
- `bitsandbytes.functional.GlobalPageManager`, `CUBLAS_Context`
- `bitsandbytes.functional._convert_weight_packed_for_cpu*`
- `bitsandbytes.functional.check_matmul`, `elementwise_func`, `fill`, `_mul`
- `bitsandbytes.utils.pack_dict_to_tensor`, `unpack_tensor_to_dict`
- `bitsandbytes.utils.execute_and_return`, `sync_gpu`
- `bitsandbytes.optim.optimizer.MockArgs`
- All backend implementation files (`backends/*/ops.py`)
- All CUDA/C++ code (`csrc/*`)


================================================
FILE: agents/architecture_guide.md
================================================
# bitsandbytes Architecture Guide

This document provides a comprehensive architecture reference for agents reviewing pull requests
or writing code for the bitsandbytes library. It describes every layer of the codebase, how data
flows through the system, how backends are dispatched, and how the build system produces native
libraries. Read this before reviewing any PR — it replaces the need to read the whole codebase.

---

## Table of Contents

1. [Project Overview](#1-project-overview)
2. [Directory Layout](#2-directory-layout)
3. [Layer Architecture](#3-layer-architecture)
4. [The Op Registry (`_ops.py`)](#4-the-op-registry-_opspy)
5. [Backend Dispatch System](#5-backend-dispatch-system)
6. [Native Library Loading (`cextension.py`)](#6-native-library-loading-cextensionpy)
7. [The Functional Layer (`functional.py`)](#7-the-functional-layer-functionalpy)
8. [Quantization Data Types and QuantState](#8-quantization-data-types-and-quantstate)
9. [Autograd Functions (`autograd/_functions.py`)](#9-autograd-functions-autograd_functionspy)
10. [Neural Network Modules (`nn/modules.py`)](#10-neural-network-modules-nnmodulespy)
11. [Optimizer System (`optim/`)](#11-optimizer-system-optim)
12. [CUDA/C++ Native Code (`csrc/`)](#12-cudac-native-code-csrc)
13. [Build System (`CMakeLists.txt`)](#13-build-system-cmakeliststxt)
14. [Data Flow: End-to-End Traces](#14-data-flow-end-to-end-traces)
15. [Key Design Patterns](#15-key-design-patterns)
16. [Cross-Cutting Concerns](#16-cross-cutting-concerns)
17. [Test Structure](#17-test-structure)

---

## 1. Project Overview

bitsandbytes is a library for quantized operations on neural network models. It provides:

- **8-bit matrix multiplication** (LLM.int8() algorithm) for inference and training
- **4-bit quantization** (QLoRA / NF4 / FP4) for memory-efficient inference and fine-tuning
- **8-bit optimizers** (Adam, AdamW, SGD, Lion, AdEMAMix, etc.) that compress optimizer state
- **Quantized `nn.Module` replacements** (`Linear8bitLt`, `Linear4bit`, `Embedding4bit`, etc.)

The library supports multiple backends: CUDA (primary), ROCm/HIP, CPU, XPU (Intel), MPS (Apple
Silicon), HPU (Gaudi), and Triton. CUDA is by far the most complete and optimized backend.

---

## 2. Directory Layout

```
bitsandbytes/
├── __init__.py              # Top-level exports, re-exports from functional, autograd, nn
├── _ops.py                  # torch.library.define() op schemas + register_fake + register_kernel helper
├── functional.py            # Stateless Python API: quantize, dequantize, matmul, optimizer updates
├── cextension.py            # Native library loader (ctypes), detects CUDA/ROCm/CPU
├── cuda_specs.py            # CUDA version detection utilities
├── consts.py                # Constants (PACKAGE_DIR, DYNAMIC_LIBRARY_SUFFIX)
├── utils.py                 # OutlierTracer, weight format mappings, sync_gpu
│
├── autograd/
│   ├── __init__.py
│   └── _functions.py        # MatMul8bitLt, MatMul8bitFp, MatMul4Bit autograd functions
│
├── nn/
│   ├── __init__.py           # Re-exports all nn modules
│   ├── modules.py            # Linear8bitLt, Linear4bit, Int8Params, Params4bit, Embeddings
│   └── triton_based_modules.py  # SwitchBackLinear (triton-based)
│
├── optim/
│   ├── __init__.py           # Re-exports all optimizer classes
│   ├── optimizer.py          # Base classes: Optimizer8bit, Optimizer1State, Optimizer2State, GlobalOptimManager
│   ├── adam.py               # Adam, Adam8bit, Adam32bit, PagedAdam, PagedAdam8bit, PagedAdam32bit
│   ├── adamw.py              # Same pattern for AdamW
│   ├── ademamix.py           # AdEMAMix variants
│   ├── lion.py               # Lion variants
│   ├── sgd.py                # SGD variants
│   ├── rmsprop.py            # RMSprop variants
│   ├── adagrad.py            # Adagrad variants
│   ├── lamb.py               # LAMB variants
│   └── lars.py               # LARS variants + PytorchLARS
│
├── backends/
│   ├── __init__.py           # Empty (backends auto-register via imports)
│   ├── utils.py              # Shared: NF4/FP4 lookup tables (CODE dict), triton_available flag, Gaudi version
│   ├── default/
│   │   └── ops.py            # Pure PyTorch fallback implementations (all ops)
│   ├── cuda/
│   │   └── ops.py            # CUDA implementations via ctypes calls to lib.*
│   ├── cpu/
│   │   └── ops.py            # CPU-optimized implementations (AVX512, torch._int_mm)
│   ├── triton/
│   │   ├── ops.py            # Triton kernel registrations
│   │   ├── kernels_4bit.py   # Triton 4-bit dequant kernels
│   │   ├── kernels_8bit_quant.py  # Triton 8-bit quant kernels
│   │   └── kernels_optim.py  # Triton optimizer kernels
│   ├── xpu/                  # Intel XPU backend
│   └── hpu/                  # Habana Gaudi backend
│
csrc/
├── pythonInterface.cpp       # C++ wrapper: unmangled functions callable via ctypes
├── ops.cu                    # CUDA op dispatch: launches kernels with grid/block configs
├── kernels.cu                # CUDA kernel implementations (__global__ functions)
├── ops.cuh                   # CUDA op declarations + error checking macros + context classes
├── kernels.cuh               # CUDA kernel declarations
├── common.cuh                # Compute capability macros (BNB_CC_VOLTA, etc.)
├── common.h                  # Shared C header
├── cpu_ops.cpp               # CPU-native C++ kernels (blockwise quant, etc.)
├── cpu_ops.h                 # CPU op declarations
├── ops.hip / kernels.hip     # ROCm/HIP equivalents
├── ops_hip.cuh / kernels_hip.cuh / common_hip.cuh
├── mps_ops.mm                # Apple MPS Objective-C++ ops
├── mps_kernels.metal         # Apple Metal shader kernels
├── xpu_ops.cpp / xpu_kernels.cpp  # Intel XPU ops
└── xpu_ops.h / xpu_kernels.h

CMakeLists.txt                # Build system: compiles csrc/ into libbitsandbytes_*.so
pyproject.toml                # Package metadata, build config

tests/
├── conftest.py               # Shared fixtures (device parametrize, etc.)
├── helpers.py                # Test utility functions
├── test_functional.py        # Tests for functional.py ops
├── test_ops.py               # Tests for torch.ops.bitsandbytes.* dispatch
├── test_linear4bit.py        # Tests for Linear4bit / Params4bit
├── test_linear8bitlt.py      # Tests for Linear8bitLt / Int8Params
├── test_modules.py           # Tests for nn modules
├── test_autograd.py          # Tests for autograd correctness
├── test_optim.py             # Tests for all optimizers
├── test_triton.py            # Tests for triton kernels
├── test_deprecated.py        # Tests that deprecated APIs warn/error properly
├── test_parametrize.py       # Tests for weight parametrization
├── test_generation.py        # Integration: text generation with quantized models
└── test_cuda_setup_evaluator.py  # Tests for CUDA detection/setup
```

---

## 3. Layer Architecture

The codebase is organized into **five distinct layers**, from lowest to highest:

```
┌──────────────────────────────────────────────────────────────────────┐
│  Layer 5: nn.Modules (Linear4bit, Linear8bitLt, Embedding4bit)     │
│  → User-facing PyTorch modules that wrap everything below           │
├──────────────────────────────────────────────────────────────────────┤
│  Layer 4: Autograd Functions (MatMul4Bit, MatMul8bitLt)            │
│  → Custom backward passes for quantized matmul                     │
├──────────────────────────────────────────────────────────────────────┤
│  Layer 3: Functional API (functional.py)                           │
│  → Stateless Python functions: quantize_4bit, dequantize_4bit,     │
│    optimizer_update_32bit, etc. Calls torch.ops.bitsandbytes.*     │
├──────────────────────────────────────────────────────────────────────┤
│  Layer 2: Op Registry (_ops.py) + Backend Dispatch                 │
│  → torch.library.define() schemas, register_fake(),                │
│    register_kernel() per device (cuda, cpu, default, triton, etc.) │
├──────────────────────────────────────────────────────────────────────┤
│  Layer 1: Native Kernels (csrc/)                                   │
│  → CUDA kernels, ctypes interface, cuBLAS calls                    │
│  → Loaded via cextension.py → ct.cdll.LoadLibrary()               │
└──────────────────────────────────────────────────────────────────────┘
```

**Important**: Not all paths go through all layers. For example:
- Optimizers: `optim/*.py` → `functional.py` → `torch.ops.bitsandbytes.*` → backend kernel
- Direct quantization: User calls `bnb.functional.quantize_4bit()` → same path but no nn.Module

---

## 4. The Op Registry (`_ops.py`)

This is the central contract layer. Every operation in bitsandbytes is defined here as a
`torch.library` op, which enables:
- **torch.compile** compatibility (via `register_fake` providing shape/dtype metadata)
- **Multi-backend dispatch** (each backend registers its kernel for the same op name)
- **Consistent API** across CUDA, CPU, Triton, etc.

### How it works

```python
# _ops.py defines ops and their schemas:
torch.library.define("bitsandbytes::quantize_4bit", "(Tensor A, int blocksize, str quant_type, ScalarType quant_storage) -> (Tensor, Tensor)")

# register_fake provides shape inference for torch.compile:
@torch.library.register_fake("bitsandbytes::quantize_4bit")
def _(A, blocksize, quant_type, quant_storage):
    # Returns tensors with correct shapes but no real data
    ...

# Each backend registers its implementation:
# In backends/cuda/ops.py:
@register_kernel("bitsandbytes::quantize_4bit", "cuda")
def _(A, blocksize, quant_type, quant_storage):
    # Actual CUDA implementation via ctypes
    ...

# In backends/default/ops.py:
@register_kernel("bitsandbytes::quantize_4bit", "default")
def _(A, blocksize, quant_type, quant_storage):
    # Pure PyTorch fallback
    ...
```

### `register_kernel` helper

The `register_kernel` function in `_ops.py` is a wrapper around
`torch.library.register_kernel`. It handles the `"default"` dispatch key specially — for
`"default"`, it uses `torch.library.impl` with `"default"` which serves as a fallback when no
device-specific kernel is registered for the given device type.

### Current op catalog

All ops are defined with the namespace `bitsandbytes::`:

**Quantization ops:**
- `quantize_blockwise` — 8-bit blockwise quantization (codebook-based)
- `dequantize_blockwise` / `dequantize_blockwise.out` — inverse
- `quantize_4bit` — 4-bit quantization (NF4 or FP4)
- `dequantize_4bit` / `dequantize_4bit.out` — inverse

**Int8 matmul ops:**
- `int8_linear_matmul` / `int8_linear_matmul.out` — int8 x int8 → int32 via cuBLASLt
- `int8_mm_dequant` — dequantize int32 matmul result to fp16/bf16
- `int8_scaled_mm` — fused int8 matmul + dequant (composes the above two)
- `int8_vectorwise_quant` — row-wise int8 quantization with optional outlier detection
- `int8_vectorwise_dequant` — inverse
- `int8_double_quant` — both row-wise and column-wise quantization (for LLM.int8())
- `int8_mixed_scaled_mm` — int8 matmul with outlier decomposition (mixed-precision)

**4-bit inference ops:**
- `gemv_4bit` / `gemv_4bit.out` — fused 4-bit dequant + matmul (single-batch inference)

**Optimizer ops:**
- `optimizer_update_32bit` — 32-bit optimizer step (Adam, Lion, SGD, etc.)
- `optimizer_update_8bit_blockwise` — 8-bit blockwise optimizer step

---

## 5. Backend Dispatch System

### How backends are loaded

When Python imports `bitsandbytes`, the following happens:

1. `__init__.py` imports `functional.py`
2. `functional.py` imports from `_ops.py` (registers op schemas and fake kernels)
3. `functional.py` imports the backends module
4. Each backend module (`backends/cuda/ops.py`, etc.) calls `@register_kernel(op_name, device)`
   at module level, registering implementations for their device type

The import chain in `functional.py`:
```python
import bitsandbytes.backends.default.ops      # Always loaded — pure PyTorch fallback
import bitsandbytes.backends.cuda.ops         # Loaded only if CUDA available
import bitsandbytes.backends.cpu.ops          # Always loaded (some ops conditional)
import bitsandbytes.backends.triton.ops       # Loaded only if triton installed
# etc.
```

### Dispatch precedence

When you call `torch.ops.bitsandbytes.quantize_4bit(tensor_on_cuda, ...)`:

1. PyTorch dispatches to the kernel registered for the tensor's device type
2. If `"cuda"` kernel exists → use it
3. If not → fall back to `"default"` kernel (pure PyTorch implementation)

This means:
- CUDA tensors use CUDA kernels (fast, ctypes → native CUDA)
- CPU tensors use CPU kernels if registered, otherwise default (pure PyTorch)
- Any new device automatically gets the `default` fallback

### Backend capabilities matrix

| Op Category | CUDA | CPU | Default | Triton | XPU | HPU | MPS |
|---|---|---|---|---|---|---|---|
| 8-bit quantize/dequant | ctypes | C++/partial | PyTorch | Triton kernels | SYCL | partial | partial |
| 4-bit quantize/dequant | ctypes | partial | PyTorch | Triton kernels | SYCL | partial | — |
| int8 matmul (cuBLASLt) | ctypes | torch._int_mm | PyTorch fp32 fallback | — | — | — | — |
| gemv_4bit (fused) | ctypes | — | PyTorch | — | — | — | — |
| Optimizer 32-bit | ctypes | — | torch.compile | Triton | — | — | — |
| Optimizer 8-bit blockwise | ctypes | — | — | Triton | — | — | — |

---

## 6. Native Library Loading (`cextension.py`)

This module handles discovering and loading the compiled C/CUDA shared library via ctypes.

### Loading process

1. `get_cuda_specs()` detects the CUDA version from PyTorch
2. `get_cuda_bnb_library_path()` constructs the expected library filename:
   - CUDA: `libbitsandbytes_cuda{VERSION}.so` (e.g., `libbitsandbytes_cuda124.so`)
   - ROCm: `libbitsandbytes_rocm{VERSION}.so`
   - CPU-only: `libbitsandbytes_cpu.so`
   - XPU: `libbitsandbytes_xpu.so`
   - MPS: `libbitsandbytes_mps.dylib`
3. `ct.cdll.LoadLibrary(path)` loads the shared library
4. The loaded library is wrapped in either:
   - `CudaBNBNativeLibrary` — if `get_context` symbol exists (CUDA/ROCm build)
   - `BNBNativeLibrary` — for CPU-only builds
   - `ErrorHandlerMockBNBNativeLibrary` — if loading fails (defers errors to call time)

### The `lib` global

```python
# cextension.py — at module level:
lib = get_native_library()  # This is the global used everywhere
```

All CUDA backend ops access native code through this `lib` object:
```python
from ...cextension import lib

# In backends/cuda/ops.py:
lib.cquantize_blockwise_fp16(code_ptr, A_ptr, absmax_ptr, out_ptr, blocksize, n)
```

### `BNBNativeLibrary.__getattr__`

The library wrapper uses `__getattr__` with caching. If a function is not found in the loaded
library, it returns a stub that raises `RuntimeError` when called (rather than at attribute
access time). This allows CPU-only installations to import successfully and only error when
GPU-specific functions are actually invoked.

### Environment variables

- `BNB_CUDA_VERSION` — Override the auto-detected CUDA version for library selection
    - `BNB_ROCM_VERSION` is the ROCm equivalent
- Standard CUDA env vars (`CUDA_HOME`, `LD_LIBRARY_PATH`) affect library discovery

---

## 7. The Functional Layer (`functional.py`)

This is the stateless Python API layer. It contains:

### Quantization codebook infrastructure

```python
# Pre-computed quantization maps:
create_dynamic_map(signed=True, total_bits=8)  # Creates 256-entry dynamic quantization codebook
create_normal_map(offset=0.9677083, symmetric=False)  # NF4 codebook from normal distribution
create_fp4_map()  # FP4 codebook

# These are stored as:
# - torch.Tensor of shape (256,) for 8-bit
# - torch.Tensor of shape (16,) for 4-bit
```

### QuantState class

```python
@dataclass
class QuantState:
    absmax: torch.Tensor          # Per-block absolute maximum values
    shape: torch.Size             # Original tensor shape before quantization
    dtype: torch.dtype            # Original tensor dtype
    blocksize: int                # Block size used for quantization (default 64)
    quant_type: str               # "nf4" or "fp4"
    code: torch.Tensor            # 16-element quantization codebook
    nested: bool = False          # Whether double quantization is used
    # If nested=True, the absmax values are themselves quantized:
    state2: Optional[QuantState]  # Nested quantization state for absmax
    offset: Optional[torch.Tensor]  # Offset for nested quantization
```

The `QuantState` is the metadata container that travels with every quantized tensor. It stores
everything needed to dequantize: the scaling factors (absmax), the codebook, the original shape,
and optionally a nested quantization state for the absmax values themselves ("double quantization").

### Key functions

**4-bit quantization (the QLoRA path):**
```python
def quantize_4bit(A, blocksize=64, compress_statistics=True, quant_type="fp4", quant_storage=torch.uint8):
    """Quantizes tensor A to 4-bit. Returns (packed_4bit_tensor, QuantState)."""
    # 1. Calls torch.ops.bitsandbytes.quantize_4bit → dispatched to backend
    # 2. If compress_statistics=True, also quantizes the absmax values (double quant)
    # 3. Returns QuantState with all metadata

def dequantize_4bit(A, quant_state, absmax=None, out=None, blocksize=64, quant_type="fp4"):
    """Dequantizes 4-bit tensor back to float. Uses QuantState for metadata."""
    # 1. If double quantization, first dequantize the absmax
    # 2. Calls torch.ops.bitsandbytes.dequantize_4bit → dispatched to backend
```

**8-bit quantization:**
```python
def int8_vectorwise_quant(A, threshold=0.0):
    """Row-wise int8 quantization. Returns (quantized, row_stats, outlier_cols)."""
    # If threshold > 0: identifies outlier columns (for LLM.int8())
    # Calls torch.ops.bitsandbytes.int8_vectorwise_quant

def int8_double_quant(A, threshold=0.0):
    """Both row-wise and column-wise int8 quantization."""
    # Used by the backward pass of LLM.int8()
    # Returns (quant_row, quant_col, row_stats, col_stats, outlier_cols)
```

**Blockwise 8-bit quantization (for optimizers):**
```python
def quantize_blockwise(A, code=None, absmax=None, out=None, blocksize=4096):
    """Blockwise quantization using a 256-entry codebook."""
    # Used for optimizer state compression
    # Default blocksize=4096 for optimizers (larger blocks = less memory overhead)

def dequantize_blockwise(A, quant_state=None, absmax=None, code=None, out=None, blocksize=4096, ...):
    """Inverse of quantize_blockwise."""
```

**Optimizers:**
```python
def optimizer_update_32bit(optimizer_name, grad, param, state1, beta1, eps, step, lr, state2=None, ...):
    """Dispatches 32-bit optimizer update to the appropriate backend kernel."""
    # Calls torch.ops.bitsandbytes.optimizer_update_32bit

def optimizer_update_8bit_blockwise(optimizer_name, grad, param, state1, state2, ...):
    """Dispatches 8-bit blockwise optimizer update."""
    # Calls torch.ops.bitsandbytes.optimizer_update_8bit_blockwise
```

**Inference (4-bit GEMV):**
```python
def gemv_4bit(A, B, out=None, transposed_A=False, transposed_B=False, state=None):
    """Fused 4-bit dequantize + matrix-vector multiply."""
    # Used when: single batch (A.numel() == A.shape[-1]) and inference mode
    # Much faster than separate dequant+matmul for single-token generation
    # Calls torch.ops.bitsandbytes.gemv_4bit
```

### CUBLAS_Context and utility classes

```python
class CUBLAS_Context:
    """Singleton managing cuBLAS handles per CUDA device."""
    # Used by int8 matmul to get cuBLASLt handle
    # get_instance().get_context(device) → cublasLtHandle_t

class GlobalPageManager:
    """Manages CUDA unified memory for paged optimizers."""
    # Paged optimizers use cudaMallocManaged for state tensors
    # Allows automatic CPU↔GPU migration
```

### Helper functions

```python
def get_ptr(tensor):
    """Gets raw pointer for ctypes calls. Returns None for None tensors."""

def _cuda_device_of(tensor):
    """Context manager that sets the correct CUDA device for the tensor."""

def _get_tensor_stream(tensor):
    """Gets the current CUDA stream for a tensor's device."""
```

---

## 8. Quantization Data Types and QuantState

### NF4 (Normal Float 4-bit)

NF4 is a 4-bit data type where each of the 16 quantization bins has equal probability under a
standard normal distribution N(0,1). This makes it optimal for normally-distributed weights
(which neural network weights approximately are).

The 16 NF4 values (normalized to [-1, 1]):
```
-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
 0.0796,  0.1609,  0.2461,  0.3379,  0.4407,  0.5626,  0.7230, 1.0
```

Note the asymmetry: there are 8 negative values and 8 non-negative values, with 0.0 as one of
the representable values.

### FP4 (Float Point 4-bit)

FP4 uses a 1-bit sign + 3-bit magnitude with a custom encoding:
```
Sign bit + 3-bit value:
0b000 = 0.0
0b001 = 0.005208 (subnormal)
0b010 = 0.6667
0b011 = 1.0
0b100 = 0.3333
0b101 = 0.5
0b110 = 0.1667
0b111 = 0.25
```

### 4-bit packing

Two 4-bit values are packed per byte:
```
packed_byte = (high_nibble << 4) | low_nibble
```

The packed tensor has shape `((n + 1) // 2, 1)` with `quant_storage` dtype (default `uint8`).
When `quant_storage` is not `uint8`, the packed bytes are viewed as the storage dtype.

### QuantState serialization

QuantState can serialize/deserialize for checkpointing via `as_dict(packed=True)` and
`from_dict()`. When saved to a state dict (e.g., in `Linear4bit._save_to_state_dict`), the
quant state components are stored alongside the weight with keys like:
```
weight.quant_state.bitsandbytes__nf4
weight.absmax
weight.quant_map
weight.nested_absmax
weight.nested_quant_map
weight.quant_state.nested_blocksize
weight.quant_state.nested_dtype
weight.quant_state.nested_offset
```

### Double quantization (compress_statistics)

When `compress_statistics=True` (default for 4-bit), the `absmax` values themselves are quantized
using 8-bit blockwise quantization. This reduces the memory overhead of storing scaling factors.
The nested quant state is stored inside `QuantState.state2`.

---

## 9. Autograd Functions (`autograd/_functions.py`)

### MatMul8bitLt (LLM.int8())

The core 8-bit matmul with custom forward and backward.

**Forward path:**
1. Quantize activations A to int8 (row-wise) via `int8_vectorwise_quant` or `int8_double_quant`
2. Quantize weights B to int8 (row-wise) if not already cached
3. If `threshold > 0`: identify outlier columns, use mixed-precision decomposition
   - Non-outlier part: int8 matmul via `int8_scaled_mm`
   - Outlier part: fp16 matmul on outlier columns only, added back to result
4. If `threshold == 0`: pure int8 matmul via `int8_scaled_mm`
5. Save quantized states for backward

**Backward path:**
- `grad_B`: Uses int8 matmul of grad_output^T × A^T (both quantized) + outlier correction
- `grad_A`: Dequantizes weights and does fp16 matmul: grad_output × W_dequant

**Key state object — `MatmulLtState`:**
```python
@dataclass
class MatmulLtState:
    CB: Optional[torch.Tensor] = None      # Quantized weight (int8)
    SCB: Optional[torch.Tensor] = None     # Weight row statistics (float32)
    threshold: float = 0.0                  # Outlier threshold for mixed-precision
    has_fp16_weights: bool = True           # Whether to keep fp16 weights
    is_training: bool = True
    # ... more fields for backward state
```

### MatMul8bitFp

A simpler 8-bit matmul for CPU/XPU that avoids the expensive int8 backward path:
- Forward: Dequantize weights to float, then `torch.nn.functional.linear`
- Backward: Standard fp16/fp32 matmul (no int8 in backward)
- ~3x faster on CPU/XPU because int8 quant/dequant kernels are slow on those platforms

### MatMul4Bit (QLoRA)

The 4-bit matmul autograd function.

**Forward path:**
1. Dequantize 4-bit weights B using `dequantize_4bit(B, quant_state)`
2. Cast to activation dtype
3. Standard `torch.nn.functional.linear(A, B_dequant, bias)`

**Backward path:**
- `grad_A`: Dequantize weights again, matmul with grad_output
- `grad_B`: **Not supported** (4-bit weights are frozen; this is by design for QLoRA)

### Dispatch logic

The top-level `matmul()` and `matmul_4bit()` functions choose which autograd class to use:

```python
def matmul(A, B, ...):
    if training and device in ("cpu", "xpu"):
        return MatMul8bitFp.apply(...)  # Faster on CPU/XPU
    return MatMul8bitLt.apply(...)      # Full LLM.int8()

def matmul_4bit(A, B, quant_state, ...):
    if A.numel() == A.shape[-1] and not requires_grad:
        return gemv_4bit(...)  # Fast path: fused kernel for single-token inference
    return MatMul4Bit.apply(...)  # General path: dequant + matmul
```

### GlobalOutlierPooler

A singleton that tracks outlier dimensions across layers:
```python
class GlobalOutlierPooler:
    """Pools outlier dimensions across layers for small models."""
    # Important for small models where outlier features are less systematic
    # Used when MatmulLtState.use_pool = True
```

---

## 10. Neural Network Modules (`nn/modules.py`)

### Linear4bit

The QLoRA module. This is the most widely used component via HuggingFace transformers integration.

```python
class Linear4bit(nn.Linear):
    def __init__(self, input_features, output_features, bias=True,
                 compute_dtype=None, compress_statistics=True,
                 quant_type="fp4", quant_storage=torch.uint8, device=None):
        # Weight is wrapped in Params4bit (quantizes on .to(device))
        self.weight = Params4bit(self.weight.data, ...)
```

**Quantization trigger:** Weights are quantized lazily — when you call `.to("cuda")` or `.cuda()`,
`Params4bit.to()` detects the device move and calls `_quantize()`.

**Forward pass:**
1. Fix quant state if lost (FSDP compatibility)
2. Auto-detect compute dtype from input if not set
3. Cast input to compute_dtype
4. Call `bnb.matmul_4bit(x, weight.t(), quant_state=...)`

**CPU inference path:** When `has_avx512bf16` and not training, weights are converted to a special
packed format optimized for CPU AVX512 inference.

### Params4bit

Custom `torch.nn.Parameter` subclass that carries quantization metadata:

```python
class Params4bit(torch.nn.Parameter):
    blocksize: int
    compress_statistics: bool
    quant_type: str          # "nf4" or "fp4"
    quant_state: QuantState
    quant_storage: torch.dtype
    bnb_quantized: bool
    module: Optional[Linear4bit]  # Back-reference to parent module
```

Key behaviors:
- `to(device)`: If not yet quantized and moving to a non-meta device → quantize
- `__torch_function__`: Handles `torch.chunk` and `torch.split` to preserve quant metadata
- `from_prequantized()`: Class method for loading already-quantized weights
- Supports `__getstate__`/`__setstate__` for pickling and `__deepcopy__`/`__copy__`

### Linear8bitLt

The LLM.int8() module.

```python
class Linear8bitLt(nn.Linear):
    def __init__(self, input_features, output_features, bias=True,
                 has_fp16_weights=True, threshold=0.0, ...):
        self.state = bnb.MatmulLtState()
        self.weight = Int8Params(self.weight.data, has_fp16_weights=...)
```

**`has_fp16_weights` modes:**
- `True` (default): Keeps fp16 weights, quantizes on every forward pass (training mode)
- `False`: Quantizes weights once on `.to(device)`, stores int8 permanently (inference mode)

**`threshold` parameter:**
- `0.0`: No outlier decomposition, pure int8 matmul
- `> 0.0` (e.g., 6.0): Mixed-precision decomposition — columns with activations exceeding
  threshold are computed in fp16

**State dict handling:**
- Saves `weight` (int8 data) + `SCB` (row statistics) + `weight_format` (always "row")
- Custom `_load_from_state_dict` to handle SCB restoration
- `_register_load_state_dict_pre_hook(maybe_rearrange_weight)` for format migration

### Int8Params

```python
class Int8Params(torch.nn.Parameter):
    CB: Optional[torch.Tensor]   # Quantized weight (same as .data when quantized)
    SCB: Optional[torch.Tensor]  # Row-wise scale factors
    has_fp16_weights: bool
```

Quantization trigger: Like Params4bit, quantizes on `to(device)` when moving from CPU to GPU.

### Embedding variants

- `StableEmbedding` — Adds LayerNorm + forces 32-bit optimizer states
- `Embedding` — Standard with 32-bit optimizer override
- `Embedding8bit` — Int8 quantized embeddings (dequant on lookup)
- `Embedding4bit` — 4-bit quantized with partial dequantization optimization
- `EmbeddingFP4`, `EmbeddingNF4` — Convenience subclasses

### Convenience aliases

```python
LinearFP4 = Linear4bit(quant_type="fp4")
LinearNF4 = Linear4bit(quant_type="nf4")
```

---

## 11. Optimizer System (`optim/`)

### Class hierarchy

```
torch.optim.Optimizer
└── Optimizer8bit
    ├── Optimizer1State    # SGD, Adagrad, RMSprop (1 moment)
    │   ├── SGD / SGD8bit / SGD32bit
    │   ├── Adagrad / Adagrad8bit / Adagrad32bit
    │   └── RMSprop / RMSprop8bit / RMSprop32bit
    └── Optimizer2State    # Adam, Lion, LAMB, LARS, AdEMAMix (2 moments)
        ├── Adam / Adam8bit / Adam32bit / PagedAdam / PagedAdam8bit / PagedAdam32bit
        ├── AdamW / AdamW8bit / AdamW32bit / PagedAdamW / PagedAdamW8bit / PagedAdamW32bit
        ├── Lion / Lion8bit / Lion32bit / PagedLion / PagedLion8bit / PagedLion32bit
        ├── LAMB / LAMB8bit / LAMB32bit
        ├── LARS / LARS8bit / LARS32bit / PytorchLARS
        └── AdEMAMix / AdEMAMix8bit / AdEMAMix32bit / PagedAdEMAMix*
```

### How optimizer dispatch works

Each concrete optimizer class (e.g., `Adam8bit`) is a thin wrapper that calls `super().__init__`
with the optimizer name string and the bit width:

```python
class Adam8bit(Optimizer2State):
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), ...):
        super().__init__("adam", params, lr, betas, ..., optim_bits=8, ...)
```

The base class `Optimizer2State.update_step()` then dispatches based on state dtype:

```python
def update_step(self, group, p, gindex, pindex):
    if state["state1"].dtype == torch.float:
        F.optimizer_update_32bit(self.optimizer_name, grad, p, state1, ...)
    elif state["state1"].dtype == torch.uint8:
        F.optimizer_update_8bit_blockwise(self.optimizer_name, grad, p, state1, ...)
```

### Optimizer state initialization

In `init_state()`:
- If parameter numel < `min_8bit_size` (default 4096): always use 32-bit state (too small for
  quantization to help)
- 32-bit state: `state1 = zeros_like(p, dtype=float32)`
- 8-bit state: `state1 = zeros_like(p, dtype=uint8)` + quantization maps + absmax buffers

### 8-bit optimizer state compression

For 8-bit optimizers, the optimizer states (momentum, variance) are stored as uint8 and
dynamically quantized/dequantized each step:

1. Each state tensor is divided into blocks of 256 elements
2. Per-block `absmax` values are maintained (float32)
3. A quantization map (`qmap`) maps 256 uint8 values to float32 values
4. The kernel reads uint8 state → dequantizes → applies update → re-quantizes → writes back

### Paged optimizers

Paged optimizers use CUDA unified memory (`cudaMallocManaged`) for state tensors > 100K elements.
This allows automatic CPU↔GPU page migration, reducing GPU memory pressure when many parameters
have inactive gradients:

```python
def get_state_buffer(self, p, dtype):
    if not self.is_paged or p.numel() < 1e5:
        return torch.zeros_like(p, dtype=dtype, device=p.device)
    else:
        buff = F.get_paged(*p.shape, dtype=dtype, device=p.device)  # cudaMallocManaged
        ...
```

### GlobalOptimManager

Singleton that allows per-parameter optimizer config overrides:

```python
mng = bnb.optim.GlobalOptimManager.get_instance()
mng.register_parameters(model.parameters())
mng.override_config(model.fc1.weight, 'optim_bits', 32)  # Force 32-bit for this param
```

Used by `StableEmbedding` and `Embedding` to force 32-bit optimizer states for embedding layers.

### FSDP compatibility

`Optimizer8bit` overrides `state_dict()` and `load_state_dict()` to wrap quantization-specific
tensors (state1, state2, absmax, qmap, etc.) in a nested dict. This prevents FSDP's
`full_optim_state_dict` from trying to gather these tensors across ranks (they have different
shapes than the parameter tensors, which would cause gather failures).

---

## 12. CUDA/C++ Native Code (`csrc/`)

### File organization

| File | Purpose |
|---|---|
| `kernels.cu` | `__global__` CUDA kernel functions (kQuantizeBlockwise, kOptimizer*, etc.) |
| `ops.cu` | Host-side dispatch functions that launch kernels with grid/block configs |
| `pythonInterface.cpp` | C-linkage wrappers for ctypes: unmangled function names, macro-expanded per dtype |
| `ops.cuh` | Declarations for ops.cu functions + cuBLAS/cuSPARSE context classes |
| `kernels.cuh` | Declarations for kernel functions |
| `common.cuh` | Compute capability macros and constants |
| `cpu_ops.cpp` / `cpu_ops.h` | CPU-native implementations (blockwise quant, etc.) |

### The call chain: Python → C

```
Python: lib.cquantize_blockwise_fp16(code_ptr, A_ptr, absmax_ptr, out_ptr, blocksize, n)
   ↓
pythonInterface.cpp: void cquantize_blockwise_fp16(...)
   calls → quantizeBlockwise<half, 0, 0>(code, A, absmax, out, NULL, 0, blocksize, n)
   ↓
ops.cu: template<T, STOCHASTIC, DATA_TYPE> void quantizeBlockwise(...)
   launches → kQuantizeBlockwise<half, 4096, 4, 0, 0><<<num_blocks, 1024>>>(...)
   ↓
kernels.cu: __global__ void kQuantizeBlockwise<T, BLOCK_SIZE, NUM_PER_TH, STOCHASTIC, DATA_TYPE>(...)
   actual CUDA computation
```

### Naming convention in pythonInterface.cpp

Functions are generated via macros to cover all dtype combinations:

```cpp
#define MAKE_FUNC_BLOCKWISE(fname, optim_name, gtype, gbits)
    void c##fname##_blockwise_##gbits(...)
    { fname##Blockwise<gtype, optim_name>(...); }

// Expands to:
// void cquantize_blockwise_fp16(...)
// void cquantize_blockwise_bf16(...)
// void cquantize_blockwise_fp32(...)
```

Similarly for optimizers:
```cpp
MAKE_FUNC32(cadam, ADAM, float, fp32)
MAKE_FUNC32(cadam, ADAM, half, fp16)
MAKE_FUNC32(cadam, ADAM, __nv_bfloat16, bf16)
// → cadam32bit_grad_fp32, cadam32bit_grad_fp16, cadam32bit_grad_bf16
```

4-bit functions use a separate naming pattern:
```cpp
// void cquantize_blockwise_fp16_nf4(...)  ← 4-bit NF4 with fp16 input
// void cquantize_blockwise_bf16_fp4(...)  ← 4-bit FP4 with bf16 input
```

### Optimizer kernel organization

The CUDA optimizer kernels handle all optimizer types via a single templated kernel, switched on
the `OPTIMIZER` template parameter:

```cpp
enum Optimizer_t {
    ADAM = 0,
    MOMENTUM = 1,
    RMSPROP = 2,
    LARS = 3,
    ADAGRAD = 4,
    LION = 5,
    ADEMAMIX = 6
};

template <typename T, int OPTIMIZER>
__global__ void kOptimizer32bit2State(...) {
    switch (OPTIMIZER) {
        case ADAM: ...
        case ADEMAMIX: ...
    }
}

template <typename T, int OPTIMIZER>
__global__ void kOptimizer32bit1State(...) {
    switch (OPTIMIZER) {
        case MOMENTUM: ...
        case LION: ...
        case RMSPROP: ...
        case ADAGRAD: ...
    }
}
```

### Compute capability handling

From `common.cuh`:
```cpp
#define BNB_CC_VOLTA 700
#define BNB_CC_TURING 750
#define BNB_CC_AMPERE 800
#define BNB_CC_ADA 890
#define BNB_CC_HOPPER 900
#define BNB_CC_BLACKWELL 1000

#define BNB_FP16_MMA_AVAILABLE (__CUDA_ARCH__ >= BNB_CC_VOLTA)      // sm_70+
#define BNB_INT8_MMA_AVAILABLE (__CUDA_ARCH__ >= BNB_CC_VOLTA_XAVIER) // sm_72+
#define BNB_BF16_AVAILABLE (__CUDA_ARCH__ >= BNB_CC_AMPERE)         // sm_80+
#define BNB_FP8_AVAILABLE (__CUDA_ARCH__ >= BNB_CC_ADA)             // sm_89+
```

Thread/block limits per architecture:
```cpp
// Turing (sm_75): 1024 max threads per SM
// Ampere (sm_80): 2048 max threads per SM
// Ada (sm_86-89): 1536 max threads per SM
// Others: 2048 max threads per SM
```

### int8 matmul via cuBLASLt

The `igemmlt` function in `ops.cu` calls cuBLASLt for int8 × int8 → int32 matmul:

```cpp
template <int DTYPE_OUT, int SCALE_ROWS>
int igemmlt(cublasLtHandle_t ltHandle, int m, int n, int k,
            const int8_t *A, const int8_t *B, void *C,
            float *row_scale, int lda, int ldb, int ldc, cudaStream_t stream);
```

This is the performance-critical path for LLM.int8(). When inner dimensions are not divisible
by 4, the CUDA backend falls back to fp32 matmul (cuBLASLt requirement).

### Quantization kernel design

The blockwise quantization kernels process data in blocks (typically 64-4096 elements):

1. Each CUDA block handles one quantization block
2. Shared memory is used for block-level reduction (finding absmax)
3. Each thread processes `NUM_PER_TH` elements (typically 2-8)
4. CUB block-level primitives are used for reductions (`BlockReduce`)

For 4-bit: two values are packed per byte. A specialized kernel `kQuantizeBlockwise32` handles
the smallest blocksize (32) by processing 2 quantization blocks per warp.

### ROCm/HIP support

ROCm uses separate source files (`ops.hip`, `kernels.hip`, etc.) that mirror the CUDA versions
with HIP API translations. Key difference: ROCm uses warp size 64 on some architectures
(vs CUDA's 32), tracked by `ROCM_WARP_SIZE_64`. This affects allowed blocksizes:
- CUDA: blocksizes 32, 64, 128, 256, 512, 1024, 2048, 4096
- ROCm (warp 64): blocksizes 64, 128, 256, 512, 1024, 2048, 4096 (no 32)

---

## 13. Build System (`CMakeLists.txt`)

### Build configurations

The `COMPUTE_BACKEND` CMake variable selects the target:

| Backend | Library name | Languages | Dependencies |
|---|---|---|---|
| `cpu` | `libbitsandbytes_cpu.so` | C++17 | OpenMP (optional) |
| `cuda` | `libbitsandbytes_cuda{VER}.so` | C++17 + CUDA | cudart, cublas, cublasLt |
| `hip` | `libbitsandbytes_rocm{VER}.so` | C++17 + HIP | hipblas, hiprand |
| `mps` | `libbitsandbytes_mps.dylib` | C++17 + ObjC++ | Metal framework |
| `xpu` | `libbitsandbytes_xpu.so` | C++20 + SYCL | Intel oneAPI |

### CUDA architecture targeting

By default, the build targets all architectures supported by the detected CUDA toolkit:

```cmake
# CUDA 12.8+: sm_50 through sm_121
# CUDA 13.0+: sm_75 through sm_121 (drops pre-Turing)
```

Users can override with `-DCOMPUTE_CAPABILITY="89;90;100"`.

The build generates native cubin for all selected architectures, plus PTX for the highest
(enabling forward compatibility with future GPUs).

### CPU-specific flags

For x86_64:
```cmake
-mavx512f -mavx512dq -mavx512bw -mavx512vl    # AVX-512 if supported
-mavx512bf16                                     # BF16 instructions if supported
-mprefer-vector-width=256 -mfma -mavx2          # Always
```

### Supported CUDA v
Download .txt
gitextract_l_yjgk53/

├── .clang-format
├── .editorconfig
├── .git-blame-ignore-revs
├── .gitattributes
├── .github/
│   ├── FUNDING.yml
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug-report.yml
│   │   └── feature-request.yml
│   ├── dependabot.yml.disabled
│   ├── scripts/
│   │   ├── auditwheel_show.py
│   │   ├── build-cpu.sh
│   │   ├── build-cuda.sh
│   │   ├── build-rocm.sh
│   │   ├── build-xpu-windows.bat
│   │   ├── build-xpu.sh
│   │   └── set_platform_tag.py
│   └── workflows/
│       ├── build_documentation.yml
│       ├── build_pr_documentation.yml
│       ├── lint.yml
│       ├── python-package.yml
│       ├── stale.yml.disabled
│       ├── test-runner.yml
│       ├── tests-nightly.yml
│       ├── tests-pr.yml
│       └── upload_pr_documentation.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .vscode/
│   ├── extensions.json
│   └── settings.json
├── CHANGELOG.md
├── CLAUDE.md
├── CMakeLists.txt
├── CODE_OF_CONDUCT.md
├── COMPILE_H100_L40.md
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── NOTICE.md
├── README.md
├── SECURITY.md
├── _typos.toml
├── agents/
│   ├── api_surface.md
│   ├── architecture_guide.md
│   ├── code_standards.md
│   ├── dispatch_guide.md
│   ├── downstream_integrations.md
│   ├── fetch_issues.py
│   ├── github_tools_guide.md
│   ├── issue_maintenance_guide.md
│   ├── issue_patterns.md
│   ├── issue_triage_workflow.md
│   ├── linting_guide.md
│   ├── pr_review_guide.md
│   ├── query_issues.py
│   ├── security_guide.md
│   ├── testing_guide.md
│   └── worktree_guide.md
├── benchmarking/
│   ├── README.md
│   ├── inference_benchmark.py
│   ├── int8/
│   │   ├── int8_benchmark.py
│   │   └── training_benchmark.py
│   ├── matmul_benchmark.py
│   ├── optimizer_benchmark.py
│   └── xpu/
│       └── inference_benchmark.py
├── bitsandbytes/
│   ├── __init__.py
│   ├── __main__.py
│   ├── _ops.py
│   ├── autograd/
│   │   ├── __init__.py
│   │   └── _functions.py
│   ├── backends/
│   │   ├── __init__.py
│   │   ├── cpu/
│   │   │   ├── __init__.py
│   │   │   └── ops.py
│   │   ├── cuda/
│   │   │   ├── __init__.py
│   │   │   └── ops.py
│   │   ├── default/
│   │   │   ├── __init__.py
│   │   │   └── ops.py
│   │   ├── hpu/
│   │   │   ├── __init__.py
│   │   │   └── ops.py
│   │   ├── mps/
│   │   │   ├── __init__.py
│   │   │   └── ops.py
│   │   ├── triton/
│   │   │   ├── __init__.py
│   │   │   ├── kernels_4bit.py
│   │   │   ├── kernels_8bit_quant.py
│   │   │   ├── kernels_optim.py
│   │   │   └── ops.py
│   │   ├── utils.py
│   │   └── xpu/
│   │       ├── __init__.py
│   │       └── ops.py
│   ├── cextension.py
│   ├── consts.py
│   ├── cuda_specs.py
│   ├── diagnostics/
│   │   ├── __init__.py
│   │   ├── cuda.py
│   │   ├── main.py
│   │   └── utils.py
│   ├── functional.py
│   ├── nn/
│   │   ├── __init__.py
│   │   ├── modules.py
│   │   └── parametrize.py
│   ├── optim/
│   │   ├── __init__.py
│   │   ├── adagrad.py
│   │   ├── adam.py
│   │   ├── adamw.py
│   │   ├── ademamix.py
│   │   ├── lamb.py
│   │   ├── lars.py
│   │   ├── lion.py
│   │   ├── optimizer.py
│   │   ├── rmsprop.py
│   │   └── sgd.py
│   ├── py.typed
│   └── utils.py
├── check_bnb_install.py
├── csrc/
│   ├── common.cuh
│   ├── common.h
│   ├── compat.cuh
│   ├── compat_device.cuh
│   ├── cpu_ops.cpp
│   ├── cpu_ops.h
│   ├── kernels.cu
│   ├── kernels.cuh
│   ├── mps_kernels.metal
│   ├── mps_ops.mm
│   ├── ops.cu
│   ├── ops.cuh
│   ├── pythonInterface.cpp
│   ├── xpu_kernels.cpp
│   ├── xpu_kernels.h
│   ├── xpu_ops.cpp
│   └── xpu_ops.h
├── docs/
│   └── source/
│       ├── _toctree.yml
│       ├── contributing.mdx
│       ├── errors.mdx
│       ├── explanations/
│       │   ├── optimizers.mdx
│       │   └── resources.mdx
│       ├── faqs.mdx
│       ├── fsdp_qlora.md
│       ├── index.mdx
│       ├── installation.mdx
│       ├── integrations.mdx
│       ├── optimizers.mdx
│       ├── quickstart.mdx
│       └── reference/
│           ├── functional.mdx
│           ├── nn/
│           │   ├── embeddings.mdx
│           │   ├── linear4bit.mdx
│           │   └── linear8bit.mdx
│           └── optim/
│               ├── adagrad.mdx
│               ├── adam.mdx
│               ├── adamw.mdx
│               ├── ademamix.mdx
│               ├── lamb.mdx
│               ├── lars.mdx
│               ├── lion.mdx
│               ├── optim_overview.mdx
│               ├── rmsprop.mdx
│               └── sgd.mdx
├── examples/
│   ├── compile_inference.py
│   ├── int8_inference_huggingface.py
│   └── xpu/
│       ├── benchmark_paged_memory.py
│       └── paged_xpu_training.py
├── install_cuda.py
├── install_cuda.sh
├── pyproject.toml
├── scripts/
│   └── stale.py
├── setup.py
└── tests/
    ├── __init__.py
    ├── conftest.py
    ├── fsdp_state_dict_save.py
    ├── helpers.py
    ├── test_autograd.py
    ├── test_cuda_setup_evaluator.py
    ├── test_functional.py
    ├── test_generation.py
    ├── test_linear4bit.py
    ├── test_linear8bitlt.py
    ├── test_modules.py
    ├── test_ops.py
    ├── test_optim.py
    └── test_parametrize.py
Download .txt
SYMBOL INDEX (830 symbols across 67 files)

FILE: .github/scripts/auditwheel_show.py
  function main (line 5) | def main():

FILE: .github/scripts/set_platform_tag.py
  function get_platform_tag (line 6) | def get_platform_tag(architecture):
  function main (line 21) | def main():

FILE: agents/fetch_issues.py
  function gh_graphql (line 69) | def gh_graphql(query: str, variables: dict) -> dict:
  function transform_reactions (line 84) | def transform_reactions(reaction_groups: list) -> dict:
  function transform_timeline_event (line 94) | def transform_timeline_event(event: dict) -> dict | None:
  function transform_issue (line 122) | def transform_issue(raw: dict) -> dict:
  function fetch_all_issues (line 161) | def fetch_all_issues(owner: str, repo: str, states: list[str] | None = N...
  function main (line 223) | def main():

FILE: agents/query_issues.py
  function load_data (line 162) | def load_data(path: str) -> dict:
  function all_issues (line 167) | def all_issues(data: dict) -> list[dict]:
  function format_compact (line 171) | def format_compact(issue: dict) -> str:
  function format_list_line (line 182) | def format_list_line(issue: dict) -> str:
  function format_detail (line 201) | def format_detail(issue: dict, brief: bool = False) -> str:
  function tokenize (line 274) | def tokenize(text: str) -> set[str]:
  function extract_signatures (line 285) | def extract_signatures(text: str) -> set[str]:
  function find_related (line 314) | def find_related(target: dict, issues: list[dict], state_filter: str | N...
  function format_related_result (line 345) | def format_related_result(score, issue, sig_ol, tok_ol, verbose=False):
  function cmd_list (line 367) | def cmd_list(args, data):
  function cmd_search (line 403) | def cmd_search(args, data):
  function cmd_related (line 436) | def cmd_related(args, data):
  function cmd_batch_related (line 458) | def cmd_batch_related(args, data):
  function cmd_show (line 486) | def cmd_show(args, data):
  function cmd_top (line 502) | def cmd_top(args, data):
  function cmd_stats (line 515) | def cmd_stats(args, data):
  function main (line 537) | def main():

FILE: benchmarking/inference_benchmark.py
  function parse_args (line 83) | def parse_args():
  function run_benchmark (line 120) | def run_benchmark(args, config, batch_size):

FILE: benchmarking/int8/training_benchmark.py
  function test_bench_8bit_training (line 28) | def test_bench_8bit_training(batch, seq, model, hidden):

FILE: benchmarking/matmul_benchmark.py
  function test_bench_matmul (line 30) | def test_bench_matmul(batch, seq, model, hidden):

FILE: benchmarking/optimizer_benchmark.py
  function test_stream_optimizer_bench (line 23) | def test_stream_optimizer_bench(dim1, gtype, optim_name, mode):

FILE: benchmarking/xpu/inference_benchmark.py
  function get_inputs (line 34) | def get_inputs(tokenizer):
  function get_streamer (line 45) | def get_streamer(tokenizer):
  class Streamer (line 51) | class Streamer:
    method __init__ (line 52) | def __init__(self, tokenizer, print_median=False):
    method put (line 57) | def put(self, t):
    method print_report (line 68) | def print_report(self):
    method end (line 78) | def end(self, *args):
  function parse_arguments (line 82) | def parse_arguments():

FILE: bitsandbytes/__init__.py
  function _import_backends (line 52) | def _import_backends():

FILE: bitsandbytes/_ops.py
  function _ (line 26) | def _(
  function _ (line 53) | def _(
  function _ (line 72) | def _(A: torch.Tensor, B: torch.Tensor):
  function _ (line 88) | def _(A: torch.Tensor, B: torch.Tensor, out: torch.Tensor):
  function _ (line 105) | def _(A: torch.Tensor, threshold=0.0):
  function _ (line 121) | def _(A: torch.Tensor, stats: torch.Tensor) -> torch.Tensor:
  function _ (line 128) | def _(A: torch.Tensor, stats: torch.Tensor):
  function _ (line 140) | def _(
  function _ (line 158) | def _(
  function _ (line 178) | def _(
  function _ (line 197) | def _(
  function _ (line 219) | def _(
  function _ (line 238) | def _(A: torch.Tensor, absmax: torch.Tensor, code: torch.Tensor, blocksi...
  function _ (line 251) | def _(
  function _ (line 265) | def _(A: torch.Tensor, code: torch.Tensor, blocksize: int) -> tuple[torc...
  function _ (line 281) | def _(
  function _ (line 305) | def _(
  function _ (line 339) | def _(
  function _ (line 382) | def _(

FILE: bitsandbytes/autograd/_functions.py
  class GlobalOutlierPooler (line 25) | class GlobalOutlierPooler:
    method __init__ (line 28) | def __init__(self):
    method initialize (line 31) | def initialize(self):
    method get_instance (line 36) | def get_instance(cls):
    method add_outliers (line 42) | def add_outliers(self, outlier_idx, feature_dim):
    method get_current_outlier_idx (line 50) | def get_current_outlier_idx(self):
  class MatmulLtState (line 58) | class MatmulLtState:
    method __getattr__ (line 82) | def __getattr__(self, name):
    method reset_grads (line 92) | def reset_grads(self):
  class MatMul8bitLt (line 101) | class MatMul8bitLt(torch.autograd.Function):
    method forward (line 103) | def forward(
    method backward (line 202) | def backward(ctx: torch.autograd.function.FunctionCtx, grad_output: to...
  class MatMul8bitFp (line 245) | class MatMul8bitFp(torch.autograd.Function):
    method forward (line 252) | def forward(ctx, A, B, out=None, bias=None, state=MatmulLtState):
    method backward (line 274) | def backward(ctx, grad_output):
  class MatMul4Bit (line 300) | class MatMul4Bit(torch.autograd.Function):
    method forward (line 304) | def forward(ctx, A, B, out=None, bias=None, quant_state: Optional[F.Qu...
    method backward (line 337) | def backward(ctx, grad_output):
  function matmul (line 359) | def matmul(
  function matmul_4bit (line 377) | def matmul_4bit(

FILE: bitsandbytes/backends/cpu/ops.py
  function _ (line 25) | def _(A: torch.Tensor, B: torch.Tensor):
  function _ (line 35) | def _(A: torch.Tensor, code: torch.Tensor, blocksize: int) -> tuple[torc...
  function _ (line 77) | def _(
  function _ (line 124) | def _(
  function _ (line 243) | def _(

FILE: bitsandbytes/backends/cuda/ops.py
  function _ (line 15) | def _(A: torch.Tensor, B: torch.Tensor):
  function _ (line 21) | def _(A: torch.Tensor, B: torch.Tensor, out: torch.Tensor):
  function _int8_linear_matmul_impl (line 25) | def _int8_linear_matmul_impl(A: torch.Tensor, B: torch.Tensor, out: torc...
  function _ (line 89) | def _(
  function _ (line 128) | def _(A: torch.Tensor, threshold=0.0):
  function _ (line 170) | def _(
  function _get_col_absmax (line 189) | def _get_col_absmax(
  function _ (line 211) | def _(A: torch.Tensor, code: torch.Tensor, blocksize: int) -> tuple[torc...
  function _ (line 247) | def _(A: torch.Tensor, absmax: torch.Tensor, code: torch.Tensor, blocksi...
  function _ (line 254) | def _(
  function _dequantize_blockwise_impl (line 267) | def _dequantize_blockwise_impl(
  function _ (line 299) | def _(
  function _ (line 346) | def _(
  function _ (line 360) | def _(
  function _dequantize_4bit_impl (line 374) | def _dequantize_4bit_impl(
  function _ (line 420) | def _(
  function _ (line 430) | def _(
  function _gemv_4bit_impl (line 447) | def _gemv_4bit_impl(
  function _optimizer_update_32bit_impl (line 609) | def _optimizer_update_32bit_impl(
  function _optimizer_update_8bit_blockwise_impl (line 668) | def _optimizer_update_8bit_blockwise_impl(

FILE: bitsandbytes/backends/default/ops.py
  function _try_torch_compile (line 12) | def _try_torch_compile(func=None, **compile_kwargs):
  function _ (line 39) | def _(
  function _ (line 62) | def _(
  function _ (line 101) | def _(
  function _ (line 120) | def _(A: torch.Tensor, B: torch.Tensor):
  function _ (line 125) | def _(A: torch.Tensor, B: torch.Tensor, out: torch.Tensor):
  function _int8_linear_matmul_impl (line 130) | def _int8_linear_matmul_impl(A: torch.Tensor, B: torch.Tensor, out: Opti...
  function _ (line 139) | def _(A: torch.Tensor, threshold=0.0):
  function _ (line 177) | def _(A: torch.Tensor, code: torch.Tensor, blocksize: int) -> tuple[torc...
  function _ (line 203) | def _(A: torch.Tensor, absmax: torch.Tensor, code: torch.Tensor, blocksi...
  function _ (line 220) | def _(
  function _dequantize_4bit_impl (line 265) | def _dequantize_4bit_impl(
  function _ (line 312) | def _(
  function _ (line 331) | def _(
  function _optimizer_precondition_32bit (line 369) | def _optimizer_precondition_32bit(
  function _optimizer_update_32bit (line 430) | def _optimizer_update_32bit(
  function _ (line 543) | def _(

FILE: bitsandbytes/backends/hpu/ops.py
  function _reverse_4bit_compress_format (line 12) | def _reverse_4bit_compress_format(weight: torch.Tensor):
  function _ (line 20) | def _(

FILE: bitsandbytes/backends/mps/ops.py
  function _get_kernel (line 21) | def _get_kernel():
  function _ (line 36) | def _(
  function _dequantize_4bit_impl (line 56) | def _dequantize_4bit_impl(
  function _ (line 74) | def _(
  function _ (line 88) | def _(
  function _gemv_4bit_impl (line 104) | def _gemv_4bit_impl(
  function _ (line 123) | def _(
  function _ (line 135) | def _(

FILE: bitsandbytes/backends/triton/kernels_4bit.py
  function quantize_fp4_blockwise_kernel (line 20) | def quantize_fp4_blockwise_kernel(
  function quantize_nf4_blockwise_kernel (line 87) | def quantize_nf4_blockwise_kernel(
  function quantize_4bit_blockwise_triton (line 157) | def quantize_4bit_blockwise_triton(A, blocksize, quant_type, blocks, abs...
  function dequant_4bit_body_util (line 183) | def dequant_4bit_body_util(a, offsets, quant_ptr, absmax_ptr, n_elems, Q...
  function dequantize_fp4_tree (line 205) | def dequantize_fp4_tree(val, absmax):
  function dequant_fp4_body_util (line 229) | def dequant_fp4_body_util(a, offsets, absmax_ptr, n_elems, QUANT_BLOCK: ...
  function dequantize_nf4_tree (line 245) | def dequantize_nf4_tree(val):
  function dequant_nf4_body_util (line 285) | def dequant_nf4_body_util(a, offsets, absmax_ptr, n_elems, QUANT_BLOCK: ...
  function dequant_4bit_kernel (line 334) | def dequant_4bit_kernel(
  function dequant_fp4_kernel (line 378) | def dequant_fp4_kernel(
  function dequant_nf4_kernel (line 420) | def dequant_nf4_kernel(
  function dequantize_4bit_impl (line 450) | def dequantize_4bit_impl(
  function dequantize_4bit_impl_passing_code (line 475) | def dequantize_4bit_impl_passing_code(
  function quantize_4bit_blockwise_kernel (line 519) | def quantize_4bit_blockwise_kernel(

FILE: bitsandbytes/backends/triton/kernels_8bit_quant.py
  function dequant_8bit_kernel (line 28) | def dequant_8bit_kernel(
  function dequant_8bit_blockwise (line 45) | def dequant_8bit_blockwise(
  function quantize_8bit_blockwise_kernel (line 84) | def quantize_8bit_blockwise_kernel(
  function quantize_blockwise_triton (line 107) | def quantize_blockwise_triton(A, code, blocksize, absmax=None, out=None):
  function quantize_8bit_blockwise_kernel_util (line 137) | def quantize_8bit_blockwise_kernel_util(
  function dequant_8bit_blockwise_kernel_util (line 180) | def dequant_8bit_blockwise_kernel_util(

FILE: bitsandbytes/backends/triton/kernels_optim.py
  function _optimizer_precondition_2state_32bit (line 36) | def _optimizer_precondition_2state_32bit(
  function _optimizer_precondition_1state_32bit (line 91) | def _optimizer_precondition_1state_32bit(
  function _optimizer_update_2state_32bit_triton_kernel (line 149) | def _optimizer_update_2state_32bit_triton_kernel(
  function _optimizer_update_1state_32bit_triton_kernel (line 234) | def _optimizer_update_1state_32bit_triton_kernel(
  function optimizer_update_32bit_impl (line 339) | def optimizer_update_32bit_impl(
  function _dequantize_blockwise_pytorch (line 488) | def _dequantize_blockwise_pytorch(
  function _quantize_blockwise_pytorch (line 523) | def _quantize_blockwise_pytorch(
  function optimizer_update_8bit_blockwise_pytorch (line 562) | def optimizer_update_8bit_blockwise_pytorch(
  function optimizer_update_8bit_blockwise_triton_quant (line 709) | def optimizer_update_8bit_blockwise_triton_quant(
  function _optimizer_update_1state_8bit_blockwise_triton_kernel (line 856) | def _optimizer_update_1state_8bit_blockwise_triton_kernel(
  function _optimizer_update_2state_8bit_blockwise_triton_kernel (line 939) | def _optimizer_update_2state_8bit_blockwise_triton_kernel(
  function optimizer_update_8bit_blockwise_impl (line 1076) | def optimizer_update_8bit_blockwise_impl(

FILE: bitsandbytes/backends/triton/ops.py
  function quantize_blockwise (line 17) | def quantize_blockwise(A: torch.Tensor, code: torch.Tensor, blocksize: i...
  function dequantize_blockwise (line 25) | def dequantize_blockwise(
  function dequantize_blockwise_inplace (line 42) | def dequantize_blockwise_inplace(
  function quantize_4bit (line 67) | def quantize_4bit(
  function dequantize_4bit (line 104) | def dequantize_4bit(
  function dequantize_4bit_inplace (line 133) | def dequantize_4bit_inplace(
  function gemv_4bit (line 148) | def gemv_4bit(
  function optimizer_update_8bit_blockwise (line 185) | def optimizer_update_8bit_blockwise(
  function optimizer_update_32bit (line 264) | def optimizer_update_32bit(

FILE: bitsandbytes/backends/utils.py
  function get_gaudi_sw_version (line 66) | def get_gaudi_sw_version():

FILE: bitsandbytes/backends/xpu/ops.py
  function _ (line 20) | def _(A: torch.Tensor, B: torch.Tensor):
  function _dequantize_4bit_impl (line 27) | def _dequantize_4bit_impl(
  function _dequantize_blockwise_impl (line 61) | def _dequantize_blockwise_impl(
  function _gemv_4bit_impl (line 81) | def _gemv_4bit_impl(
  function _ (line 165) | def _(
  function _ (line 178) | def _(
  function _ (line 186) | def _(
  function _ (line 199) | def _(
  function _ (line 213) | def _(

FILE: bitsandbytes/cextension.py
  function get_cuda_bnb_library_path (line 22) | def get_cuda_bnb_library_path(cuda_specs: CUDASpecs) -> Path:
  class BNBNativeLibrary (line 60) | class BNBNativeLibrary:
    method __init__ (line 64) | def __init__(self, lib: ct.CDLL):
    method __getattr__ (line 68) | def __getattr__(self, name):
    method __getitem__ (line 82) | def __getitem__(self, item):
  class CudaBNBNativeLibrary (line 86) | class CudaBNBNativeLibrary(BNBNativeLibrary):
    method __init__ (line 89) | def __init__(self, lib: ct.CDLL):
  class XpuBNBNativeLibrary (line 95) | class XpuBNBNativeLibrary(BNBNativeLibrary):
    method __init__ (line 98) | def __init__(self, lib: ct.CDLL):
  function get_available_cuda_binary_versions (line 104) | def get_available_cuda_binary_versions() -> list[str]:
  function parse_cuda_version (line 119) | def parse_cuda_version(version_str: str) -> str:
  class ErrorHandlerMockBNBNativeLibrary (line 126) | class ErrorHandlerMockBNBNativeLibrary(BNBNativeLibrary):
    method __init__ (line 147) | def __init__(self, error_msg: str):
    method _format_lib_error_message (line 175) | def _format_lib_error_message(
    method _format_dependency_error (line 258) | def _format_dependency_error(self) -> str:
    method __getattr__ (line 286) | def __getattr__(self, name):
    method __getitem__ (line 294) | def __getitem__(self, name):
  function get_native_library (line 298) | def get_native_library() -> BNBNativeLibrary:

FILE: bitsandbytes/cuda_specs.py
  class CUDASpecs (line 13) | class CUDASpecs:
    method has_imma (line 19) | def has_imma(self) -> bool:
  function get_compute_capabilities (line 23) | def get_compute_capabilities() -> list[tuple[int, int]]:
  function get_cuda_version_tuple (line 28) | def get_cuda_version_tuple() -> Optional[tuple[int, int]]:
  function get_cuda_version_string (line 46) | def get_cuda_version_string() -> Optional[str]:
  function get_cuda_specs (line 55) | def get_cuda_specs() -> Optional[CUDASpecs]:
  function get_rocm_gpu_arch (line 82) | def get_rocm_gpu_arch() -> str:
  function get_rocm_warpsize (line 114) | def get_rocm_warpsize() -> int:

FILE: bitsandbytes/diagnostics/cuda.py
  function find_cuda_libraries_in_path_list (line 47) | def find_cuda_libraries_in_path_list(paths_list_candidate: str) -> Itera...
  function is_relevant_candidate_env_var (line 69) | def is_relevant_candidate_env_var(env_var: str, value: str) -> bool:
  function get_potentially_lib_path_containing_env_vars (line 82) | def get_potentially_lib_path_containing_env_vars() -> dict[str, str]:
  function find_cudart_libraries (line 86) | def find_cudart_libraries() -> Iterator[Path]:
  function _print_cuda_diagnostics (line 110) | def _print_cuda_diagnostics(cuda_specs: CUDASpecs) -> None:
  function _print_hip_diagnostics (line 135) | def _print_hip_diagnostics(cuda_specs: CUDASpecs) -> None:
  function print_diagnostics (line 164) | def print_diagnostics(cuda_specs: CUDASpecs) -> None:
  function _print_cuda_runtime_diagnostics (line 171) | def _print_cuda_runtime_diagnostics() -> None:
  function _print_hip_runtime_diagnostics (line 198) | def _print_hip_runtime_diagnostics() -> None:
  function print_runtime_diagnostics (line 225) | def print_runtime_diagnostics() -> None:

FILE: bitsandbytes/diagnostics/main.py
  function sanity_check (line 30) | def sanity_check():
  function get_package_version (line 45) | def get_package_version(name: str) -> str:
  function show_environment (line 53) | def show_environment():
  function main (line 73) | def main():

FILE: bitsandbytes/diagnostics/utils.py
  function print_header (line 6) | def print_header(txt: str, width: int = HEADER_WIDTH, filler: str = "=")...
  function print_dedented (line 11) | def print_dedented(text):

FILE: bitsandbytes/functional.py
  class GlobalPageManager (line 24) | class GlobalPageManager:
    method __init__ (line 27) | def __init__(self):
    method initialize (line 30) | def initialize(self):
    method get_instance (line 34) | def get_instance(cls):
    method prefetch_all (line 40) | def prefetch_all(self, to_cpu=False):
  class CUBLAS_Context (line 48) | class CUBLAS_Context:
    method __init__ (line 51) | def __init__(self):
    method initialize (line 54) | def initialize(self):
    method get_instance (line 58) | def get_instance(cls):
    method get_context (line 64) | def get_context(self, device):
  function _cuda_device_of (line 81) | def _cuda_device_of(a: torch.Tensor):
  function _cuda_device_of (line 86) | def _cuda_device_of(a: torch.Tensor):
  function get_paged (line 90) | def get_paged(*shape, dtype=torch.float32, device=FIRST_CUDA_DEVICE):
  function prefetch_tensor (line 101) | def prefetch_tensor(A: torch.Tensor, to_cpu=False):
  function elementwise_func (line 111) | def elementwise_func(func_name, A, B, value, prefetch=True):
  function fill (line 141) | def fill(A, value, device=None, prefetch=True):
  function _mul (line 145) | def _mul(A, B, device=None):
  function create_linear_map (line 149) | def create_linear_map(signed=True, total_bits=8, add_zero=True):
  function create_normal_map (line 168) | def create_normal_map(offset=0.9677083, use_extra_value=True):
  function create_fp8_map (line 226) | def create_fp8_map(signed=True, exponent_bits=5, precision_bits=2, total...
  function create_dynamic_map (line 295) | def create_dynamic_map(signed=True, max_exponent_bits=7, total_bits=8):
  function is_on_gpu (line 350) | def is_on_gpu(tensors: Iterable[Optional[torch.Tensor]]):
  function _get_tensor_stream (line 386) | def _get_tensor_stream(tensor: Tensor) -> ct.c_void_p:
  function get_ptr (line 398) | def get_ptr(A: Optional[Tensor]) -> Optional[ct.c_void_p]:
  class QuantState (line 413) | class QuantState:
    method __init__ (line 433) | def __init__(
    method __getattr__ (line 454) | def __getattr__(self, name):
    method __getitem__ (line 466) | def __getitem__(self, idx):
    method from_dict (line 487) | def from_dict(cls, qs_dict: dict[str, Any], device: torch.device) -> "...
    method as_dict (line 538) | def as_dict(self, packed: bool = False) -> dict[str, Any]:
    method to (line 573) | def to(self, device):
    method __eq__ (line 582) | def __eq__(self, other):
  function quantize_blockwise (line 606) | def quantize_blockwise(
  function dequantize_blockwise (line 677) | def dequantize_blockwise(
  function get_4bit_type (line 754) | def get_4bit_type(typename, device=None, blocksize=64):
  function quantize_fp4 (line 844) | def quantize_fp4(
  function quantize_nf4 (line 855) | def quantize_nf4(
  function quantize_4bit (line 866) | def quantize_4bit(
  function dequantize_fp4 (line 947) | def dequantize_fp4(
  function dequantize_nf4 (line 957) | def dequantize_nf4(
  function dequantize_4bit (line 967) | def dequantize_4bit(
  function optimizer_update_32bit (line 1044) | def optimizer_update_32bit(
  function optimizer_update_8bit_blockwise (line 1133) | def optimizer_update_8bit_blockwise(
  function check_matmul (line 1179) | def check_matmul(A, B, out, transposed_A, transposed_B, expected_type=to...
  function gemv_4bit (line 1263) | def gemv_4bit(
  function igemm (line 1300) | def igemm(
  function batched_igemm (line 1401) | def batched_igemm(
  function int8_linear_matmul (line 1497) | def int8_linear_matmul(A: torch.Tensor, B: torch.Tensor, out: Optional[t...
  function int8_mm_dequant (line 1523) | def int8_mm_dequant(
  function int8_double_quant (line 1551) | def int8_double_quant(
  function int8_vectorwise_dequant (line 1602) | def int8_vectorwise_dequant(A: torch.Tensor, stats: torch.Tensor):
  function int8_vectorwise_quant (line 1616) | def int8_vectorwise_quant(A: torch.Tensor, threshold=0.0):
  function _convert_weight_packed_for_cpu (line 1637) | def _convert_weight_packed_for_cpu(qweight: torch.Tensor, quant_state: Q...
  function _convert_weight_packed_for_cpu_inverse (line 1691) | def _convert_weight_packed_for_cpu_inverse(
  function has_avx512bf16 (line 1759) | def has_avx512bf16():

FILE: bitsandbytes/nn/modules.py
  class StableEmbedding (line 28) | class StableEmbedding(torch.nn.Embedding):
    method __init__ (line 54) | def __init__(
    method reset_parameters (line 101) | def reset_parameters(self) -> None:
    method _fill_padding_idx_with_zero (line 112) | def _fill_padding_idx_with_zero(self) -> None:
    method forward (line 117) | def forward(self, input: Tensor) -> Tensor:
  class Embedding (line 134) | class Embedding(torch.nn.Embedding):
    method __init__ (line 139) | def __init__(
    method reset_parameters (line 183) | def reset_parameters(self) -> None:
    method _fill_padding_idx_with_zero (line 194) | def _fill_padding_idx_with_zero(self) -> None:
    method forward (line 199) | def forward(self, input: Tensor) -> Tensor:
  class Params4bit (line 213) | class Params4bit(torch.nn.Parameter):
    method __new__ (line 214) | def __new__(
    method __getstate__ (line 243) | def __getstate__(self):
    method __setstate__ (line 249) | def __setstate__(self, state):
    method __getattr__ (line 284) | def __getattr__(self, name):
    method __deepcopy__ (line 297) | def __deepcopy__(self, memo):
    method __copy__ (line 305) | def __copy__(self):
    method from_prequantized (line 312) | def from_prequantized(
    method _quantize (line 337) | def _quantize(self, device):
    method cpu (line 353) | def cpu(self):
    method cuda (line 356) | def cuda(self, device: Optional[int | device | str] = None, non_blocki...
    method xpu (line 361) | def xpu(self, device: Optional[int | device | str] = None, non_blockin...
    method to (line 367) | def to(
    method to (line 375) | def to(self: T, dtype: dtype | str, non_blocking: bool = ...) -> T: ...
    method to (line 378) | def to(self: T, tensor: Tensor, non_blocking: bool = ...) -> T: ...
    method to (line 380) | def to(self, *args, **kwargs):
    method __torch_function__ (line 403) | def __torch_function__(cls, func, types, args=(), kwargs=None):
  function fix_4bit_weight_quant_state_from_module (line 443) | def fix_4bit_weight_quant_state_from_module(module: Union["Embedding4bit...
  class Linear4bit (line 460) | class Linear4bit(nn.Linear):
    method __init__ (line 493) | def __init__(
    method set_compute_type (line 531) | def set_compute_type(self, x):
    method _save_to_state_dict (line 549) | def _save_to_state_dict(self, destination, prefix, keep_vars):
    method forward (line 565) | def forward(self, x: torch.Tensor):
  class LinearFP4 (line 596) | class LinearFP4(Linear4bit):
    method __init__ (line 601) | def __init__(
  class LinearNF4 (line 632) | class LinearNF4(Linear4bit):
    method __init__ (line 644) | def __init__(
  class Int8Params (line 675) | class Int8Params(torch.nn.Parameter):
    method __new__ (line 676) | def __new__(
    method _quantize (line 692) | def _quantize(self, device):
    method cpu (line 705) | def cpu(self):
    method cuda (line 708) | def cuda(self, device: Optional[int | device | str] = None, non_blocki...
    method xpu (line 711) | def xpu(self, device: Optional[int | device | str] = None, non_blockin...
    method __deepcopy__ (line 714) | def __deepcopy__(self, memo):
    method to (line 727) | def to(
    method to (line 735) | def to(self: T, dtype: dtype | str, non_blocking: bool = ...) -> T: ...
    method to (line 738) | def to(self: T, tensor: Tensor, non_blocking: bool = ...) -> T: ...
    method to (line 740) | def to(self, *args, **kwargs):
  function maybe_rearrange_weight (line 767) | def maybe_rearrange_weight(state_dict, prefix, local_metadata, strict, m...
  class Embedding8bit (line 788) | class Embedding8bit(nn.Embedding):
    method __init__ (line 808) | def __init__(self, num_embeddings, embedding_dim, device=None, dtype=N...
    method _save_to_state_dict (line 814) | def _save_to_state_dict(self, destination, prefix, keep_vars):
    method forward (line 817) | def forward(self, input: Tensor) -> Tensor:
  class Embedding4bit (line 835) | class Embedding4bit(nn.Embedding):
    method __init__ (line 856) | def __init__(
    method _forward_with_partial_dequantize (line 885) | def _forward_with_partial_dequantize(self, input: Tensor):
    method _save_to_state_dict (line 918) | def _save_to_state_dict(self, destination, prefix, keep_vars):
    method forward (line 921) | def forward(self, input: Tensor) -> Tensor:
  class EmbeddingFP4 (line 935) | class EmbeddingFP4(Embedding4bit):
    method __init__ (line 936) | def __init__(
  class EmbeddingNF4 (line 954) | class EmbeddingNF4(Embedding4bit):
    method __init__ (line 955) | def __init__(
  class Linear8bitLt (line 973) | class Linear8bitLt(nn.Linear):
    method __init__ (line 1005) | def __init__(
    method _save_to_state_dict (line 1050) | def _save_to_state_dict(self, destination, prefix, keep_vars):
    method _load_from_state_dict (line 1074) | def _load_from_state_dict(
    method init_8bit_state (line 1113) | def init_8bit_state(self):
    method to (line 1119) | def to(self, *args, **kwargs):
    method forward (line 1134) | def forward(self, x: torch.Tensor):
  class OutlierAwareLinear (line 1151) | class OutlierAwareLinear(nn.Linear):
    method __init__ (line 1152) | def __init__(self, input_features, output_features, bias=True, device=...
    method forward_with_outliers (line 1157) | def forward_with_outliers(self, x, outlier_idx):
    method quantize_weight (line 1160) | def quantize_weight(self, w, outlier_idx):
    method forward (line 1163) | def forward(self, x):

FILE: bitsandbytes/nn/parametrize.py
  class Bnb4bitParametrization (line 11) | class Bnb4bitParametrization(nn.Module):
    method __init__ (line 24) | def __init__(self, quant_state: F.QuantState):
    method forward (line 29) | def forward(self, quantized_param: torch.Tensor) -> torch.Tensor:
  function replace_parameter_4bit_prequantized (line 42) | def replace_parameter_4bit_prequantized(
  function replace_parameter_4bit (line 62) | def replace_parameter_4bit(
  function _disable_parametrization_cache (line 129) | def _disable_parametrization_cache(module: nn.Module, inputs: tuple[Any,...
  function _enable_parametrization_cache (line 135) | def _enable_parametrization_cache(module: nn.Module, inputs: tuple[Any, ...
  function _register_parametrization_hooks (line 139) | def _register_parametrization_hooks(module: nn.Module, param_name: str):
  function _parametrized_state_dict_post_hook (line 156) | def _parametrized_state_dict_post_hook(

FILE: bitsandbytes/optim/adagrad.py
  class Adagrad (line 8) | class Adagrad(Optimizer1State):
    method __init__ (line 9) | def __init__(
  class Adagrad8bit (line 67) | class Adagrad8bit(Optimizer1State):
    method __init__ (line 68) | def __init__(
  class Adagrad32bit (line 126) | class Adagrad32bit(Optimizer1State):
    method __init__ (line 127) | def __init__(

FILE: bitsandbytes/optim/adam.py
  class Adam (line 9) | class Adam(Optimizer2State):
    method __init__ (line 10) | def __init__(
  class Adam8bit (line 62) | class Adam8bit(Optimizer2State):
    method __init__ (line 63) | def __init__(
  class Adam32bit (line 126) | class Adam32bit(Optimizer2State):
    method __init__ (line 127) | def __init__(
  class PagedAdam (line 179) | class PagedAdam(Optimizer2State):
    method __init__ (line 180) | def __init__(
  class PagedAdam8bit (line 232) | class PagedAdam8bit(Optimizer2State):
    method __init__ (line 233) | def __init__(
  class PagedAdam32bit (line 296) | class PagedAdam32bit(Optimizer2State):
    method __init__ (line 297) | def __init__(

FILE: bitsandbytes/optim/adamw.py
  class AdamW (line 9) | class AdamW(Optimizer2State):
    method __init__ (line 10) | def __init__(
  class AdamW8bit (line 62) | class AdamW8bit(Optimizer2State):
    method __init__ (line 63) | def __init__(
  class AdamW32bit (line 126) | class AdamW32bit(Optimizer2State):
    method __init__ (line 127) | def __init__(
  class PagedAdamW (line 179) | class PagedAdamW(Optimizer2State):
    method __init__ (line 180) | def __init__(
  class PagedAdamW8bit (line 229) | class PagedAdamW8bit(Optimizer2State):
    method __init__ (line 230) | def __init__(
  class PagedAdamW32bit (line 290) | class PagedAdamW32bit(Optimizer2State):
    method __init__ (line 291) | def __init__(

FILE: bitsandbytes/optim/ademamix.py
  class _ReferenceAdEMAMix (line 11) | class _ReferenceAdEMAMix(torch.optim.Optimizer):
    method __init__ (line 16) | def __init__(
    method step (line 34) | def step(self, closure=None):
  class AdEMAMix (line 107) | class AdEMAMix(Optimizer2State):
    method __init__ (line 108) | def __init__(
    method init_state (line 139) | def init_state(self, group, p, gindex, pindex):
    method update_step (line 176) | def update_step(self, group, p, gindex, pindex):
    method _get_state_double_buffer (line 260) | def _get_state_double_buffer(self, p, dtype=torch.float32):
  class AdEMAMix8bit (line 270) | class AdEMAMix8bit(AdEMAMix):
    method __init__ (line 271) | def __init__(
  class PagedAdEMAMix8bit (line 299) | class PagedAdEMAMix8bit(AdEMAMix8bit):
    method __init__ (line 300) | def __init__(
  class PagedAdEMAMix (line 326) | class PagedAdEMAMix(AdEMAMix):
    method __init__ (line 327) | def __init__(
  class AdEMAMix32bit (line 355) | class AdEMAMix32bit(Optimizer2State):
    method __init__ (line 356) | def __init__(
  class PagedAdEMAMix32bit (line 386) | class PagedAdEMAMix32bit(AdEMAMix32bit):
    method __init__ (line 387) | def __init__(

FILE: bitsandbytes/optim/lamb.py
  class LAMB (line 8) | class LAMB(Optimizer2State):
    method __init__ (line 9) | def __init__(
  class LAMB8bit (line 67) | class LAMB8bit(Optimizer2State):
    method __init__ (line 68) | def __init__(
  class LAMB32bit (line 123) | class LAMB32bit(Optimizer2State):
    method __init__ (line 124) | def __init__(

FILE: bitsandbytes/optim/lars.py
  class LARS (line 11) | class LARS(Optimizer1State):
    method __init__ (line 12) | def __init__(
  class LARS8bit (line 66) | class LARS8bit(Optimizer1State):
    method __init__ (line 67) | def __init__(
  class LARS32bit (line 118) | class LARS32bit(Optimizer1State):
    method __init__ (line 119) | def __init__(
  class PytorchLARS (line 170) | class PytorchLARS(Optimizer):
    method __init__ (line 171) | def __init__(
    method __setstate__ (line 200) | def __setstate__(self, state):
    method step (line 206) | def step(self, closure=None):

FILE: bitsandbytes/optim/lion.py
  class Lion (line 8) | class Lion(Optimizer1State):
    method __init__ (line 9) | def __init__(
  class Lion8bit (line 55) | class Lion8bit(Optimizer1State):
    method __init__ (line 56) | def __init__(
  class Lion32bit (line 99) | class Lion32bit(Optimizer1State):
    method __init__ (line 100) | def __init__(
  class PagedLion (line 143) | class PagedLion(Optimizer1State):
    method __init__ (line 144) | def __init__(
  class PagedLion8bit (line 187) | class PagedLion8bit(Optimizer1State):
    method __init__ (line 188) | def __init__(
  class PagedLion32bit (line 228) | class PagedLion32bit(Optimizer1State):
    method __init__ (line 229) | def __init__(

FILE: bitsandbytes/optim/optimizer.py
  class MockArgs (line 16) | class MockArgs:
    method __init__ (line 17) | def __init__(self, initial_data):
  class GlobalOptimManager (line 22) | class GlobalOptimManager:
    method __init__ (line 29) | def __init__(self):
    method initialize (line 32) | def initialize(self):
    method get_instance (line 40) | def get_instance(cls):
    method register_parameters (line 46) | def register_parameters(self, params):
    method override_config (line 56) | def override_config(self, parameters, key=None, value=None, key_value_...
    method register_module_override (line 109) | def register_module_override(self, module, param_name, config):
  class Optimizer8bit (line 113) | class Optimizer8bit(torch.optim.Optimizer):
    method __init__ (line 116) | def __init__(self, params, defaults, optim_bits=32, is_paged=False):
    method fill_qmap (line 153) | def fill_qmap(self):
    method state_dict (line 157) | def state_dict(self):
    method __setstate__ (line 185) | def __setstate__(self, state):
    method load_state_dict (line 188) | def load_state_dict(self, state_dict, move_to_device=True):
    method to_gpu (line 269) | def to_gpu(self):
    method check_overrides (line 280) | def check_overrides(self):
    method step (line 300) | def step(self, closure=None):
    method get_config (line 337) | def get_config(self, gindex, pindex, group):
    method init_state (line 362) | def init_state(self, group, p, gindex, pindex):
    method update_step (line 365) | def update_step(self, group, p, gindex, pindex):
    method get_state_buffer (line 368) | def get_state_buffer(self, p, dtype=torch.float32):
    method prefetch_state (line 378) | def prefetch_state(self, p):
  class Optimizer2State (line 389) | class Optimizer2State(Optimizer8bit):
    method __init__ (line 390) | def __init__(
    method init_state (line 478) | def init_state(self, group, p, gindex, pindex):
    method update_step (line 521) | def update_step(self, group, p, gindex, pindex):
  class Optimizer1State (line 579) | class Optimizer1State(Optimizer8bit):
    method __init__ (line 580) | def __init__(
    method init_state (line 650) | def init_state(self, group, p, gindex, pindex):
    method update_step (line 687) | def update_step(self, group, p, gindex, pindex):

FILE: bitsandbytes/optim/rmsprop.py
  class RMSprop (line 8) | class RMSprop(Optimizer1State):
    method __init__ (line 9) | def __init__(
  class RMSprop8bit (line 64) | class RMSprop8bit(Optimizer1State):
    method __init__ (line 65) | def __init__(
  class RMSprop32bit (line 117) | class RMSprop32bit(Optimizer1State):
    method __init__ (line 118) | def __init__(

FILE: bitsandbytes/optim/sgd.py
  class SGD (line 8) | class SGD(Optimizer1State):
    method __init__ (line 9) | def __init__(
  class SGD8bit (line 59) | class SGD8bit(Optimizer1State):
    method __init__ (line 60) | def __init__(
  class SGD32bit (line 107) | class SGD32bit(Optimizer1State):
    method __init__ (line 108) | def __init__(

FILE: bitsandbytes/utils.py
  function outlier_hook (line 11) | def outlier_hook(module, input):
  class OutlierTracer (line 44) | class OutlierTracer:
    method __init__ (line 47) | def __init__(self):
    method initialize (line 50) | def initialize(self, model):
    method is_initialized (line 63) | def is_initialized(self):
    method get_hvalue (line 66) | def get_hvalue(self, weight):
    method get_outliers (line 69) | def get_outliers(self, weight):
    method get_instance (line 80) | def get_instance(cls):
  function find_outlier_dims (line 86) | def find_outlier_dims(weight, reduction_dim=0, zscore=4.0, topk=None, rd...
  function execute_and_return (line 104) | def execute_and_return(command_string: str) -> tuple[str, str]:
  function replace_linear (line 121) | def replace_linear(
  function pack_dict_to_tensor (line 166) | def pack_dict_to_tensor(source_dict):
  function unpack_tensor_to_dict (line 183) | def unpack_tensor_to_dict(tensor_data):
  function sync_gpu (line 204) | def sync_gpu(t: torch.Tensor):

FILE: csrc/common.h
  type DataType_t (line 3) | typedef enum DataType_t {

FILE: csrc/cpu_ops.cpp
  function lookup_code_index (line 19) | inline unsigned char lookup_code_index(const float* codebook, float valu...
  function __m256i (line 42) | inline __m256i cvt_fp32_to_fp16(const __m512 src) {
  function __m256i (line 46) | inline __m256i cvt_fp32_to_bf16(const __m512 src) {
  function __m512 (line 70) | static inline __m512 set_nf4_lut() {
  function __m512 (line 78) | static inline __m512 set_fp4_lut() {
  function dequantizeBlockwise4bitCpu (line 89) | void dequantizeBlockwise4bitCpu(
  function dequantizeBlockwise8bitCpu (line 183) | void dequantizeBlockwise8bitCpu(
  function quantize_cpu (line 207) | void quantize_cpu(float* code, float* A, float* absmax, unsigned char* o...
  type tinygemm_kernel_nn (line 267) | struct tinygemm_kernel_nn {
    method apply (line 268) | static inline void apply(
  type tinygemm_kernel_nn<bf16_t, BLOCK_M, BLOCK_N, DATA_TYPE> (line 276) | struct tinygemm_kernel_nn<bf16_t, BLOCK_M, BLOCK_N, DATA_TYPE> {
    method apply (line 277) | static inline void apply(
  function tinygemm_kernel (line 389) | void tinygemm_kernel(
  function gemv_4bit_inference (line 446) | void gemv_4bit_inference(

FILE: csrc/cpu_ops.h
  function block_size_m (line 24) | constexpr int block_size_m() { return 2 * TILE_M; }
  function block_size_n (line 26) | constexpr int block_size_n() { return 2 * TILE_N; }
  function get_cache_blocks (line 28) | int get_cache_blocks(int chunk_size) {
  function const (line 42) | void operator()(const Func& f, Args... args) const {
  type Unroll (line 48) | struct Unroll
  function const (line 49) | void operator()(const Func& f, Args... args) const {
  function get_max_threads (line 58) | inline int get_max_threads() {
  function adjust_num_threads (line 67) | inline int adjust_num_threads(int m) {
  function parallel_2d (line 74) | void parallel_2d(int m, int n, const func_t& f) {
  type fp16_t (line 124) | struct fp16_t {
  type bf16_t (line 128) | struct bf16_t {
  function bf16_to_float (line 139) | static float bf16_to_float(uint16_t bf16) {
  function fp16_t (line 146) | static inline fp16_t float_to_fp16(float x) {
  function dDequantizeFP4 (line 188) | inline float dDequantizeFP4(unsigned char val) {
  function dDequantizeNF4 (line 230) | inline float dDequantizeNF4(unsigned char val) {
  function has_avx512f (line 292) | static inline bool has_avx512f() {
  function has_avx512bf16 (line 302) | static inline bool has_avx512bf16() {
  function has_avx512f (line 312) | static inline bool has_avx512f() {
  function has_avx512bf16 (line 318) | static inline bool has_avx512bf16() {

FILE: csrc/pythonInterface.cpp
  function gemm_4bit_inference_naive_fp16 (line 43) | void gemm_4bit_inference_naive_fp16(
  function gemm_4bit_inference_naive_bf16 (line 50) | void gemm_4bit_inference_naive_bf16(
  function gemm_4bit_inference_naive_fp32 (line 59) | void gemm_4bit_inference_naive_fp32(
  function quantizeBlockwise_fp16 (line 133) | void quantizeBlockwise_fp16(float* code, half* A, float* absmax, unsigne...
  function quantizeBlockwise_fp16_fp4 (line 137) | void quantizeBlockwise_fp16_fp4(float* code, half* A, float* absmax, uns...
  function quantizeBlockwise_fp16_nf4 (line 141) | void quantizeBlockwise_fp16_nf4(float* code, half* A, float* absmax, uns...
  function quantizeBlockwise_bf16 (line 145) | void quantizeBlockwise_bf16(
  function quantizeBlockwise_bf16_fp4 (line 151) | void quantizeBlockwise_bf16_fp4(
  function quantizeBlockwise_bf16_nf4 (line 157) | void quantizeBlockwise_bf16_nf4(
  function quantizeBlockwise_fp32 (line 163) | void quantizeBlockwise_fp32(float* code, float* A, float* absmax, unsign...
  function quantizeBlockwise_fp32_fp4 (line 167) | void quantizeBlockwise_fp32_fp4(float* code, float* A, float* absmax, un...
  function quantizeBlockwise_fp32_nf4 (line 171) | void quantizeBlockwise_fp32_nf4(float* code, float* A, float* absmax, un...
  function dequantizeBlockwise_fp16 (line 175) | void dequantizeBlockwise_fp16(
  function dequantizeBlockwise_fp16_fp4 (line 181) | void dequantizeBlockwise_fp16_fp4(
  function dequantizeBlockwise_fp16_nf4 (line 187) | void dequantizeBlockwise_fp16_nf4(
  function dequantizeBlockwise_fp32 (line 193) | void dequantizeBlockwise_fp32(
  function dequantizeBlockwise_fp32_fp4 (line 199) | void dequantizeBlockwise_fp32_fp4(
  function dequantizeBlockwise_fp32_nf4 (line 205) | void dequantizeBlockwise_fp32_nf4(
  function dequantizeBlockwise_bf16 (line 211) | void dequantizeBlockwise_bf16(
  function dequantizeBlockwise_bf16_fp4 (line 217) | void dequantizeBlockwise_bf16_fp4(
  function dequantizeBlockwise_bf16_nf4 (line 223) | void dequantizeBlockwise_bf16_nf4(
  function igemmlt_32 (line 229) | int igemmlt_32(
  function igemmlt_8 (line 236) | int igemmlt_8(
  function igemmlt_8_rowscale (line 243) | int igemmlt_8_rowscale(
  function dequantizeBlockwise_fp16 (line 254) | void dequantizeBlockwise_fp16(
  function dequantizeBlockwise_fp16_fp4 (line 260) | void dequantizeBlockwise_fp16_fp4(
  function dequantizeBlockwise_fp16_nf4 (line 266) | void dequantizeBlockwise_fp16_nf4(
  function dequantizeBlockwise_fp32 (line 272) | void dequantizeBlockwise_fp32(
  function dequantizeBlockwise_fp32_fp4 (line 278) | void dequantizeBlockwise_fp32_fp4(
  function dequantizeBlockwise_fp32_nf4 (line 284) | void dequantizeBlockwise_fp32_nf4(
  function dequantizeBlockwise_bf16 (line 290) | void dequantizeBlockwise_bf16(
  function dequantizeBlockwise_bf16_fp4 (line 297) | void dequantizeBlockwise_bf16_fp4(
  function dequantizeBlockwise_bf16_nf4 (line 304) | void dequantizeBlockwise_bf16_nf4(
  function gemv_4bit_inference_fp16 (line 311) | void gemv_4bit_inference_fp16(
  function gemv_4bit_inference_bf16 (line 318) | void gemv_4bit_inference_bf16(
  function gemv_4bit_inference_fp32 (line 327) | void gemv_4bit_inference_fp32(
  function cdequantize_blockwise_fp16_fp4 (line 352) | void cdequantize_blockwise_fp16_fp4(
  function cdequantize_blockwise_fp16 (line 358) | void cdequantize_blockwise_fp16(
  function cdequantize_blockwise_fp16_nf4 (line 364) | void cdequantize_blockwise_fp16_nf4(
  function cquantize_blockwise_fp16 (line 370) | void cquantize_blockwise_fp16(float* code, half* A, float* absmax, unsig...
  function cquantize_blockwise_fp16_fp4 (line 374) | void cquantize_blockwise_fp16_fp4(float* code, half* A, float* absmax, u...
  function cquantize_blockwise_fp16_nf4 (line 378) | void cquantize_blockwise_fp16_nf4(float* code, half* A, float* absmax, u...
  function cquantize_blockwise_fp32 (line 382) | void cquantize_blockwise_fp32(float* code, float* A, float* absmax, unsi...
  function cquantize_blockwise_fp32_fp4 (line 386) | void cquantize_blockwise_fp32_fp4(
  function cquantize_blockwise_fp32_nf4 (line 392) | void cquantize_blockwise_fp32_nf4(
  function cdequantize_blockwise_fp32 (line 398) | void cdequantize_blockwise_fp32(
  function cdequantize_blockwise_fp32_fp4 (line 404) | void cdequantize_blockwise_fp32_fp4(
  function cdequantize_blockwise_fp32_nf4 (line 410) | void cdequantize_blockwise_fp32_nf4(
  function cquantize_blockwise_bf16 (line 416) | void cquantize_blockwise_bf16(
  function cquantize_blockwise_bf16_fp4 (line 422) | void cquantize_blockwise_bf16_fp4(
  function cquantize_blockwise_bf16_nf4 (line 428) | void cquantize_blockwise_bf16_nf4(
  function cdequantize_blockwise_bf16 (line 434) | void cdequantize_blockwise_bf16(
  function cdequantize_blockwise_bf16_fp4 (line 440) | void cdequantize_blockwise_bf16_fp4(
  function cdequantize_blockwise_bf16_nf4 (line 446) | void cdequantize_blockwise_bf16_nf4(
  function cigemm (line 512) | cigemm(
  function cbatched_igemm (line 519) | void cbatched_igemm(
  function Context (line 528) | Context* get_context() { return new Context(); }
  function cigemmlt_32 (line 530) | int cigemmlt_32(
  function cigemmlt_8 (line 537) | int cigemmlt_8(
  function cigemmlt_8_rowscale (line 544) | int cigemmlt_8_rowscale(
  function cdequant_mm_int32_fp16 (line 551) | void cdequant_mm_int32_fp16(
  function cint8_vector_quant (line 557) | void cint8_vector_quant(
  function cprefetch (line 571) | void cprefetch(void* ptr, size_t bytes, int device) {
  function cgemm_4bit_inference_naive_fp16 (line 600) | void cgemm_4bit_inference_naive_fp16(
  function cgemm_4bit_inference_naive_bf16 (line 607) | void cgemm_4bit_inference_naive_bf16(
  function cgemm_4bit_inference_naive_fp32 (line 614) | void cgemm_4bit_inference_naive_fp32(
  function cdequantize_blockwise_fp16_fp4 (line 625) | void cdequantize_blockwise_fp16_fp4(
  function cdequantize_blockwise_fp16 (line 631) | void cdequantize_blockwise_fp16(
  function cdequantize_blockwise_fp16_nf4 (line 637) | void cdequantize_blockwise_fp16_nf4(
  function cdequantize_blockwise_fp32 (line 643) | void cdequantize_blockwise_fp32(
  function cdequantize_blockwise_fp32_fp4 (line 649) | void cdequantize_blockwise_fp32_fp4(
  function cdequantize_blockwise_fp32_nf4 (line 655) | void cdequantize_blockwise_fp32_nf4(
  function cdequantize_blockwise_bf16 (line 661) | void cdequantize_blockwise_bf16(
  function cdequantize_blockwise_bf16_fp4 (line 668) | void cdequantize_blockwise_bf16_fp4(
  function cdequantize_blockwise_bf16_nf4 (line 675) | void cdequantize_blockwise_bf16_nf4(
  function cgemv_4bit_inference_fp16 (line 682) | void cgemv_4bit_inference_fp16(
  function cgemv_4bit_inference_bf16 (line 689) | void cgemv_4bit_inference_bf16(
  function cgemv_4bit_inference_fp32 (line 696) | void cgemv_4bit_inference_fp32(
  function cprefetch (line 723) | void cprefetch(void* ptr, size_t bytes, int device) {
  function cfill_fp32 (line 736) | void cfill_fp32(float* A, float* B, float value, long n) {
  function cfill_uint8 (line 745) | void cfill_uint8(unsigned char* A, unsigned char* B, unsigned char value...
  function cquantize_blockwise_cpu_fp32 (line 754) | void cquantize_blockwise_cpu_fp32(
  function cdequantize_blockwise_cpu_fp32 (line 760) | void cdequantize_blockwise_cpu_fp32(
  function cdequantize_blockwise_cpu_bf16 (line 766) | void cdequantize_blockwise_cpu_bf16(
  function cdequantize_blockwise_cpu_fp16 (line 772) | void cdequantize_blockwise_cpu_fp16(
  function cdequantize_blockwise_cpu_fp4_fp32 (line 778) | void cdequantize_blockwise_cpu_fp4_fp32(
  function cdequantize_blockwise_cpu_fp4_bf16 (line 784) | void cdequantize_blockwise_cpu_fp4_bf16(
  function cdequantize_blockwise_cpu_fp4_fp16 (line 790) | void cdequantize_blockwise_cpu_fp4_fp16(
  function cdequantize_blockwise_cpu_nf4_fp32 (line 796) | void cdequantize_blockwise_cpu_nf4_fp32(
  function cdequantize_blockwise_cpu_nf4_bf16 (line 802) | void cdequantize_blockwise_cpu_nf4_bf16(
  function cdequantize_blockwise_cpu_nf4_fp16 (line 808) | void cdequantize_blockwise_cpu_nf4_fp16(
  function gemv_4bit_inference_cpu_fp4_bf16 (line 815) | void gemv_4bit_inference_cpu_fp4_bf16(
  function gemv_4bit_inference_cpu_nf4_bf16 (line 822) | void gemv_4bit_inference_cpu_nf4_bf16(
  function has_avx512f_cpu (line 830) | bool has_avx512f_cpu() { return has_avx512f(); }
  function has_avx512bf16_cpu (line 832) | bool has_avx512bf16_cpu() { return has_avx512bf16(); }

FILE: csrc/xpu_kernels.cpp
  function dDequantizeFP4 (line 8) | inline float dDequantizeFP4(unsigned char val) {
  function dDequantizeNF4 (line 50) | inline float dDequantizeNF4(unsigned char val) {
  function SYCL_EXTERNAL (line 97) | SYCL_EXTERNAL void kDequantizeBlockwise<T, TILE_SIZE, NUM_PER_TH, DATA_T...
  function SYCL_EXTERNAL (line 175) | SYCL_EXTERNAL void
  class kDequantizeBlockwise<sycl::half, 512, 4, FP4> (line 268) | class kDequantizeBlockwise<sycl::half, 512, 4, FP4>
  class kDequantizeBlockwise<sycl::half, 512, 4, General8bit> (line 269) | class kDequantizeBlockwise<sycl::half, 512, 4, General8bit>
  class kDequantizeBlockwise<sycl::half, 512, 4, NF4> (line 270) | class kDequantizeBlockwise<sycl::half, 512, 4, NF4>
  class kDequantizeBlockwise<float, 512, 4, FP4> (line 272) | class kDequantizeBlockwise<float, 512, 4, FP4>
  class kDequantizeBlockwise<float, 512, 4, General8bit> (line 273) | class kDequantizeBlockwise<float, 512, 4, General8bit>
  class kDequantizeBlockwise<float, 512, 4, NF4> (line 274) | class kDequantizeBlockwise<float, 512, 4, NF4>
  class kDequantizeBlockwise<sycl::ext::oneapi::bfloat16, 512, 4, FP4> (line 276) | class kDequantizeBlockwise<sycl::ext::oneapi::bfloat16, 512, 4, FP4>
  class kDequantizeBlockwise<sycl::ext::oneapi::bfloat16, 512, 4, General8bit> (line 277) | class kDequantizeBlockwise<sycl::ext::oneapi::bfloat16, 512, 4, General8...
  class kDequantizeBlockwise<sycl::ext::oneapi::bfloat16, 512, 4, NF4> (line 278) | class kDequantizeBlockwise<sycl::ext::oneapi::bfloat16, 512, 4, NF4>
  class kgemv_4bit_inference<sycl::half, 128, 4, 32, 16> (line 280) | class kgemv_4bit_inference<sycl::half, 128, 4, 32, 16>
  class kgemv_4bit_inference<sycl::ext::oneapi::bfloat16, 128, 4, 32, 16> (line 281) | class kgemv_4bit_inference<sycl::ext::oneapi::bfloat16, 128, 4, 32, 16>
  class kgemv_4bit_inference<float, 128, 4, 32, 32> (line 282) | class kgemv_4bit_inference<float, 128, 4, 32, 32>

FILE: csrc/xpu_ops.cpp
  function dequantizeBlockwise (line 5) | void dequantizeBlockwise(
  function gemv_4bit_inference (line 34) | void gemv_4bit_inference(

FILE: csrc/xpu_ops.h
  function sycl_kernel_submit (line 16) | inline void sycl_kernel_submit(sycl::nd_range<dim> range, sycl::queue q,...
  function sycl_comp_kernel_submit (line 23) | inline void sycl_comp_kernel_submit(sycl::nd_range<dim> range, sycl::que...

FILE: examples/xpu/benchmark_paged_memory.py
  function get_args (line 22) | def get_args():
  function get_torch_dtype (line 37) | def get_torch_dtype(name):
  function get_accelerator (line 41) | def get_accelerator(device_type):
  function count_params (line 48) | def count_params(model):
  function create_model (line 52) | def create_model(args):
  function make_batch (line 67) | def make_batch(args):
  function cleanup (line 74) | def cleanup(device_type):
  function measure_training (line 82) | def measure_training(args, optimizer_name, OptClass):
  function fmt_mb (line 142) | def fmt_mb(nbytes):
  function fmt_gb (line 146) | def fmt_gb(nbytes):
  function main (line 150) | def main():

FILE: examples/xpu/paged_xpu_training.py
  function get_args (line 21) | def get_args():
  function format_alpaca (line 59) | def format_alpaca(example):
  function prepare_data (line 65) | def prepare_data(tokenizer, dataset_name, max_length, num_samples=200):
  function collate_fn (line 80) | def collate_fn(batch):
  function create_optimizer (line 84) | def create_optimizer(model, name, lr):
  function train_loop (line 107) | def train_loop(model, optimizer, dataloader, steps, log_interval, device):
  function get_torch_dtype (line 142) | def get_torch_dtype(name):
  function run_single (line 146) | def run_single(args):
  function run_with_trainer (line 184) | def run_with_trainer(args):
  function run_compare (line 250) | def run_compare(args):
  function main (line 290) | def main():

FILE: install_cuda.py
  function install_cuda (line 18) | def install_cuda(version, base_path, download_path):
  function main (line 67) | def main():

FILE: scripts/stale.py
  function main (line 30) | def main():

FILE: setup.py
  class BinaryDistribution (line 15) | class BinaryDistribution(Distribution):
    method has_ext_modules (line 16) | def has_ext_modules(self):
  class ExtBuildPy (line 20) | class ExtBuildPy(build_py):
    method run (line 21) | def run(self):

FILE: tests/conftest.py
  function _set_seed (line 9) | def _set_seed():
  function pytest_runtest_call (line 17) | def pytest_runtest_call(item):
  function pytest_runtest_teardown (line 36) | def pytest_runtest_teardown(item, nextitem):
  function requires_cuda (line 48) | def requires_cuda() -> bool:

FILE: tests/fsdp_state_dict_save.py
  class SimpleQLoRAModel (line 23) | class SimpleQLoRAModel(nn.Module):
    method __init__ (line 26) | def __init__(self, quant_type="nf4"):
    method forward (line 31) | def forward(self, x):
  function main (line 35) | def main():

FILE: tests/helpers.py
  function get_available_devices (line 21) | def get_available_devices(no_cpu=False):
  function torch_save_to_buffer (line 53) | def torch_save_to_buffer(obj):
  function torch_load_from_buffer (line 60) | def torch_load_from_buffer(buffer):
  function get_test_dims (line 67) | def get_test_dims(min: int, max: int, *, n: int) -> list[int]:
  function format_with_label (line 71) | def format_with_label(label: str, value: Any) -> str:
  function id_formatter (line 83) | def id_formatter(label: str):
  function describe_dtype (line 102) | def describe_dtype(dtype: torch.dtype) -> str:
  function is_supported_on_hpu (line 106) | def is_supported_on_hpu(

FILE: tests/test_autograd.py
  function test_matmullt (line 33) | def test_matmullt(
  function test_matmul_4bit (line 166) | def test_matmul_4bit(

FILE: tests/test_cuda_setup_evaluator.py
  function cuda120_spec (line 8) | def cuda120_spec() -> CUDASpecs:
  function test_get_cuda_bnb_library_path (line 17) | def test_get_cuda_bnb_library_path(monkeypatch, cuda120_spec):
  function test_get_cuda_bnb_library_path_override (line 23) | def test_get_cuda_bnb_library_path_override(monkeypatch, cuda120_spec, c...
  function rocm70_spec (line 31) | def rocm70_spec() -> CUDASpecs:
  function test_get_rocm_bnb_library_path (line 40) | def test_get_rocm_bnb_library_path(monkeypatch, rocm70_spec):
  function test_get_rocm_bnb_library_path_override (line 48) | def test_get_rocm_bnb_library_path_override(monkeypatch, rocm70_spec, ca...
  function test_get_rocm_bnb_library_path_rejects_cuda_override (line 57) | def test_get_rocm_bnb_library_path_rejects_cuda_override(monkeypatch, ro...
  function test_get_rocm_bnb_library_path_rocm_override_takes_priority (line 66) | def test_get_rocm_bnb_library_path_rocm_override_takes_priority(monkeypa...

FILE: tests/test_functional.py
  function assert_all_approx_close (line 27) | def assert_all_approx_close(a, b, rtol=1e-3, atol=1e-3, count=0, throw=T...
  class FFN (line 38) | class FFN(torch.nn.Module):
    method __init__ (line 39) | def __init__(self, input_features, hidden_size, bias=True):
    method forward (line 48) | def forward(self, x):
  class Timer (line 54) | class Timer:
    method __init__ (line 55) | def __init__(self):
    method tick (line 60) | def tick(self, name="default"):
    method tock (line 68) | def tock(self, name="default", evict=True, print_ms=True):
    method reset (line 85) | def reset(self):
  class Test8BitBlockwiseQuantizeFunctional (line 92) | class Test8BitBlockwiseQuantizeFunctional:
    method test_dynamic_blockwise_quantization (line 101) | def test_dynamic_blockwise_quantization(self, device, dtype, nested, b...
    method test_dynamic_blockwise_quantization_large (line 163) | def test_dynamic_blockwise_quantization_large(self, device, dtype, blo...
    method test_blockwise_cpu_large (line 190) | def test_blockwise_cpu_large(self, hidden, blocksize):
    method test_few_bit_quant (line 213) | def test_few_bit_quant(self, device, bits, method):
    method test_fp8_quant (line 265) | def test_fp8_quant(self, device):
    method test_bench_dequantization (line 320) | def test_bench_dequantization(self):
  function test_stable_embedding (line 337) | def test_stable_embedding():
  function quant (line 342) | def quant(x):
  function dequant (line 348) | def dequant(c, maxC):
  function mm_dequant (line 352) | def mm_dequant(maxA, maxB, C):
  function quant_multi (line 356) | def quant_multi(x, dim):
  function quant_multi_chunk (line 363) | def quant_multi_chunk(x, dim, chunk_size=32):
  function mean (line 379) | def mean(xx):
  class TestIGEMMFunctional (line 396) | class TestIGEMMFunctional:
    method test_approx_igemm (line 401) | def test_approx_igemm(self, dim1, dim2, quant_methods, batched):
    method test_igemm (line 440) | def test_igemm(self, hidden_dim, batch_dim, transpose, seq_dim):
    method test_dim3_igemm (line 494) | def test_dim3_igemm(self, seq_dim, hidden_dim, batch_dim):
    method test_minmax_igemm (line 511) | def test_minmax_igemm(self, seq_dim, hidden_dim, batch_dim, transpose):
    method test_ibmm (line 588) | def test_ibmm(self, dim1, dim2, dim3, dim4, transpose):
  class TestLLMInt8Functional (line 616) | class TestLLMInt8Functional:
    method vectorwise_mm_dequant (line 618) | def vectorwise_mm_dequant(xq, S1, S2, dtype=torch.half):
    method vectorwise_quant (line 635) | def vectorwise_quant(x, dim=1):
    method test_int8_linear_matmul (line 648) | def test_int8_linear_matmul(self, device, dim1, dim2, dim3, dim4, dims...
    method test_int8_linear_matmul_half (line 666) | def test_int8_linear_matmul_half(self, device, dim1, dim2, dim3, dim4,...
    method test_dequant_mm (line 689) | def test_dequant_mm(self, device, dim1, dim4, dims, has_bias):
    method test_int8_double_quant (line 728) | def test_int8_double_quant(self, dim1, dim2):
    method test_integrated_int8_linear_matmul (line 772) | def test_integrated_int8_linear_matmul(self, device, dim1, dim4, inner):
    method test_coo_double_quant (line 805) | def test_coo_double_quant(self, device, dim1, dim2):
    method test_coo_int8_vectorwise_quant (line 825) | def test_coo_int8_vectorwise_quant(self, device, dim1, dim2):
  class TestQuantize4BitFunctional (line 839) | class TestQuantize4BitFunctional:
    method test_4bit_quant (line 847) | def test_4bit_quant(self, device, dtype, quant_type, blocksize):
    method test_4bit_compressed_stats (line 930) | def test_4bit_compressed_stats(self, device, quant_type, blocksize, dt...
    method test_4bit_quant_large (line 966) | def test_4bit_quant_large(self, device, dtype, quant_type, blocksize):
    method test_bench_4bit_dequant (line 996) | def test_bench_4bit_dequant(self, quant_type):
    method test_gemv_4bit (line 1031) | def test_gemv_4bit(self, device, dim, dtype, storage_type, double_quan...
    method test_gemv_eye_4bit (line 1179) | def test_gemv_eye_4bit(self, device, storage_type, dtype):
  function test_normal_map_tree (line 1211) | def test_normal_map_tree():

FILE: tests/test_generation.py
  function get_4bit_config (line 12) | def get_4bit_config():
  function get_model_and_tokenizer (line 24) | def get_model_and_tokenizer(config):
  function get_prompt_for_generation_eval (line 44) | def get_prompt_for_generation_eval(text, add_roles=True):
  function generate (line 56) | def generate(model, tokenizer, text, generation_config, prompt_func=get_...
  function model_and_tokenizer (line 68) | def model_and_tokenizer(request):
  function test_pi (line 78) | def test_pi(requires_cuda, model_and_tokenizer, inference_kernel, DQ, dt...

FILE: tests/test_linear4bit.py
  function test_linear_serialization (line 39) | def test_linear_serialization(
  function test_copy_param (line 199) | def test_copy_param(device, quant_type, blocksize, compress_statistics):
  function test_params4bit_torch_chunk_split (line 219) | def test_params4bit_torch_chunk_split(device, quant_type):
  function test_quant_storage_shard_roundtrip (line 259) | def test_quant_storage_shard_roundtrip(device, quant_type, quant_storage):
  function test_deepcopy_param (line 290) | def test_deepcopy_param(device, quant_type, blocksize, compress_statisti...
  function test_params4bit_real_serialization (line 319) | def test_params4bit_real_serialization(device, quant_type, blocksize, co...
  function test_linear4bit_torch_compile (line 362) | def test_linear4bit_torch_compile(device, quant_type, compute_dtype, com...
  function test_params4bit_quant_state_attr_access (line 440) | def test_params4bit_quant_state_attr_access(device, quant_type, compress...
  function test_fsdp_state_dict_save_4bit (line 508) | def test_fsdp_state_dict_save_4bit():

FILE: tests/test_linear8bitlt.py
  function test_linear_no_igemmlt (line 26) | def test_linear_no_igemmlt(device):
  function test_linear_serialization (line 74) | def test_linear_serialization(
  function linear8bit (line 176) | def linear8bit(requires_cuda):
  function test_linear8bit_copy_param (line 195) | def test_linear8bit_copy_param(linear8bit):
  function test_linear8bit_deepcopy_param (line 202) | def test_linear8bit_deepcopy_param(linear8bit):
  function test_linear8bit_serialization (line 217) | def test_linear8bit_serialization(linear8bit):
  function test_linear8bitlt_torch_compile (line 240) | def test_linear8bitlt_torch_compile(device, threshold, bias, fullgraph, ...
  function test_linear8bitlt_device_movement (line 305) | def test_linear8bitlt_device_movement(device):

FILE: tests/test_modules.py
  function caplog_at_level (line 14) | def caplog_at_level(caplog, level, logger_name):
  class MockArgs (line 19) | class MockArgs:
    method __init__ (line 20) | def __init__(self, initial_data):
  class MLP8bit (line 25) | class MLP8bit(torch.nn.Module):
    method __init__ (line 26) | def __init__(self, dim1, dim2, has_fp16_weights=True, threshold=0.0):
    method forward (line 41) | def forward(self, x):
  function get_args (line 47) | def get_args():
  function assert_all_approx_close (line 55) | def assert_all_approx_close(a, b, atol=1e-8, rtol=1e-5, count=10):
  function test_linear8bitlt_inference (line 65) | def test_linear8bitlt_inference(device, threshold):
  function test_linear8bitlt_accumulated_gradient (line 80) | def test_linear8bitlt_accumulated_gradient(device):
  function test_linear8bitlt_no_fp16_weights (line 127) | def test_linear8bitlt_no_fp16_weights(device, threshold):
  function test_linear_kbit_fp32_bias (line 252) | def test_linear_kbit_fp32_bias(device, module):
  function test_kbit_backprop (line 291) | def test_kbit_backprop(device, module, dtype):
  function test_embedding_lossless (line 373) | def test_embedding_lossless(device, embedding_class, input_shape, embedd...
  function test_embedding_error (line 424) | def test_embedding_error(device, embedding_class, input_shape, embedding...
  function test_4bit_linear_warnings (line 464) | def test_4bit_linear_warnings(device, caplog):
  function test_4bit_embedding_warnings (line 484) | def test_4bit_embedding_warnings(device, caplog):
  function test_4bit_embedding_weight_fsdp_fix (line 498) | def test_4bit_embedding_weight_fsdp_fix(requires_cuda):
  function test_4bit_linear_weight_fsdp_fix (line 515) | def test_4bit_linear_weight_fsdp_fix(requires_cuda):
  function test_embedding_not_implemented_error (line 532) | def test_embedding_not_implemented_error():

FILE: tests/test_ops.py
  class TestLLMInt8Ops (line 17) | class TestLLMInt8Ops:
    method test_int8_linear_matmul (line 19) | def test_int8_linear_matmul(self, device):
    method test_int8_linear_matmul_out (line 31) | def test_int8_linear_matmul_out(self, device):
    method test_int8_vectorwise_quant (line 46) | def test_int8_vectorwise_quant(self, threshold, device):
    method test_int8_mm_dequant (line 71) | def test_int8_mm_dequant(self, device):
    method test_int8_scaled_mm (line 86) | def test_int8_scaled_mm(self, device, dtype, has_bias):
  class TestInt8BlockwiseQuantOps (line 101) | class TestInt8BlockwiseQuantOps:
    method test_quantize_blockwise (line 105) | def test_quantize_blockwise(self, device, dtype, blocksize):
    method test_dequantize_blockwise (line 129) | def test_dequantize_blockwise(self, device, dtype, blocksize):
  class Test4bitBlockwiseQuantOps (line 149) | class Test4bitBlockwiseQuantOps:
    method test_quantize_4bit (line 155) | def test_quantize_4bit(self, device, dtype, storage_dtype, quant_type,...
    method test_quantize_4bit_not_divisible_by_blocksize (line 178) | def test_quantize_4bit_not_divisible_by_blocksize(self, device, dtype,...
    method test_dequantize_4bit (line 205) | def test_dequantize_4bit(self, device, dtype, storage_dtype, quant_typ...
    method test_gemv_4bit (line 239) | def test_gemv_4bit(self, device, dtype, storage_dtype, quant_type, blo...
  class TestNonContiguousInputs (line 275) | class TestNonContiguousInputs:
    method test_quantize_blockwise_non_contiguous (line 281) | def test_quantize_blockwise_non_contiguous(self, device, dtype, blocks...
    method test_dequantize_blockwise_non_contiguous (line 303) | def test_dequantize_blockwise_non_contiguous(self, device, dtype, bloc...
    method test_quantize_4bit_non_contiguous (line 334) | def test_quantize_4bit_non_contiguous(self, device, dtype, quant_type,...
    method test_quantize_4bit_roundtrip_non_contiguous (line 356) | def test_quantize_4bit_roundtrip_non_contiguous(self, device, dtype, q...

FILE: tests/test_optim.py
  function assert_most_approx_close (line 22) | def assert_most_approx_close(a, b, rtol=1e-3, atol=1e-3, max_error_count...
  function get_temp_dir (line 30) | def get_temp_dir():
  function rm_path (line 36) | def rm_path(path):
  function test_optimizer32bit (line 181) | def test_optimizer32bit(dim1, dim2, gtype, optim_name, device):
  function test_global_config (line 265) | def test_global_config(dim1, dim2, gtype, device):
  function test_override_config_after_register (line 311) | def test_override_config_after_register(device):
  function test_optimizer8bit (line 358) | def test_optimizer8bit(dim1, dim2, gtype, optim_name, device):
  function test_benchmark_blockwise (line 520) | def test_benchmark_blockwise(dim1, dim2, gtype, optim_name, device):
  function test_ademamix_state_dict_no_nan (line 561) | def test_ademamix_state_dict_no_nan(optim_name, optim_factory, device):

FILE: tests/test_parametrize.py
  class ParametrizeTestModule (line 20) | class ParametrizeTestModule(nn.Module):
    method __init__ (line 23) | def __init__(self, device="cpu", dtype=torch.float32):
  function test_replace_parameter_4bit (line 40) | def test_replace_parameter_4bit(device, dtype, quant_type, compress_stat...
  function test_moe_parameter_shape (line 97) | def test_moe_parameter_shape(device, dtype):
  function test_prequantized_replacement (line 143) | def test_prequantized_replacement(device, dtype, quant_type):
  function test_state_dict_functionality (line 174) | def test_state_dict_functionality(device, dtype, quant_type, compress_st...
  function test_moe_realistic_forward (line 206) | def test_moe_realistic_forward(device, dtype):
  function test_error_conditions (line 249) | def test_error_conditions():
  function test_quant_state_preservation (line 272) | def test_quant_state_preservation(device, dtype):
  function test_multiple_parameters (line 306) | def test_multiple_parameters(device, dtype):
  function test_different_blocksizes (line 340) | def test_different_blocksizes(device, dtype, blocksize):
  function test_parametrization_forward_method (line 376) | def test_parametrization_forward_method():
  function test_gradient_behavior (line 415) | def test_gradient_behavior(device, dtype):
Condensed preview — 178 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,583K chars).
[
  {
    "path": ".clang-format",
    "chars": 943,
    "preview": "---\nBasedOnStyle: LLVM\nAlignAfterOpenBracket: BlockIndent\nBinPackArguments: true\nBinPackParameters: true\nBracedInitializ"
  },
  {
    "path": ".editorconfig",
    "chars": 64,
    "preview": "[*]\ntrim_trailing_whitespace = true\ninsert_final_newline = true\n"
  },
  {
    "path": ".git-blame-ignore-revs",
    "chars": 584,
    "preview": "# ran black and isort for coherent code formatting\nbfa0e33294f2b1dc25e65a33be2397f989824298\n\n# reran black with lineleng"
  },
  {
    "path": ".gitattributes",
    "chars": 20,
    "preview": "*.bat text eol=crlf\n"
  },
  {
    "path": ".github/FUNDING.yml",
    "chars": 30,
    "preview": "open_collective: bitsandbytes\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug-report.yml",
    "chars": 978,
    "preview": "name: \"\\U0001F41B Bug Report\"\ndescription: Submit a bug report to help us improve bitsandbytes\nbody:\n  - type: textarea\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature-request.yml",
    "chars": 788,
    "preview": "name: \"\\U0001F680 Feature request\"\ndescription: Submit a proposal/request for a new feature\nlabels: [\"feature\"]\nbody:\n  "
  },
  {
    "path": ".github/dependabot.yml.disabled",
    "chars": 216,
    "preview": "version: 2\nupdates:\n  - package-ecosystem: pip\n    directory: \"/\"\n    schedule:\n      interval: \"weekly\"\n    groups:\n   "
  },
  {
    "path": ".github/scripts/auditwheel_show.py",
    "chars": 739,
    "preview": "import argparse\nimport subprocess\n\n\ndef main():\n    ap = argparse.ArgumentParser()\n    ap.add_argument(\"wheels\", nargs=\""
  },
  {
    "path": ".github/scripts/build-cpu.sh",
    "chars": 449,
    "preview": "#!/bin/bash\ndeclare build_arch\ndeclare build_os\n\nset -xeuo pipefail\n\npip install cmake==3.28.3\n\nif [ \"${build_os:0:5}\" ="
  },
  {
    "path": ".github/scripts/build-cuda.sh",
    "chars": 2009,
    "preview": "#!/bin/bash\ndeclare build_arch\ndeclare build_os\ndeclare cuda_version\ndeclare cuda_targets\n\nset -xeuo pipefail\n\nif [[ -v "
  },
  {
    "path": ".github/scripts/build-rocm.sh",
    "chars": 1084,
    "preview": "#!/bin/bash\ndeclare build_arch\ndeclare build_os\ndeclare rocm_version\n\nset -xeuo pipefail\nbnb_rocm_arch=\"gfx90a;gfx942;gf"
  },
  {
    "path": ".github/scripts/build-xpu-windows.bat",
    "chars": 1212,
    "preview": "set INTEL_DLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/75d4eb97-914a-4a95-852c-7b9733d80f74/intel"
  },
  {
    "path": ".github/scripts/build-xpu.sh",
    "chars": 773,
    "preview": "#!/bin/bash\ndeclare build_os\n\nset -xeuo pipefail\n\n# We currently only build XPU on Linux.\nif [ \"${build_os:0:6}\" == ubun"
  },
  {
    "path": ".github/scripts/set_platform_tag.py",
    "chars": 834,
    "preview": "import argparse\nimport platform\nimport sys\n\n\ndef get_platform_tag(architecture):\n    system = platform.system()\n\n    if "
  },
  {
    "path": ".github/workflows/build_documentation.yml",
    "chars": 596,
    "preview": "name: Build documentation\n\non:\n  push:\n    branches:\n      - main\n      - doc-builder*\n      - v*-release\n\njobs:\n  build"
  },
  {
    "path": ".github/workflows/build_pr_documentation.yml",
    "chars": 726,
    "preview": "name: Build PR Documentation\n\non:\n  pull_request:\n\nconcurrency:\n  group: ${{ github.workflow }}-${{ github.head_ref || g"
  },
  {
    "path": ".github/workflows/lint.yml",
    "chars": 328,
    "preview": "name: Lint\n\non:\n  push:\n    branches:\n      - main\n  pull_request:\n\njobs:\n  Lint:\n    runs-on: ubuntu-latest\n    steps:\n"
  },
  {
    "path": ".github/workflows/python-package.yml",
    "chars": 14006,
    "preview": "name: Python package\n\non:\n  push: {}\n  pull_request:\n    branches: [main]\n    paths:\n      - \".github/workflows/python-p"
  },
  {
    "path": ".github/workflows/stale.yml.disabled",
    "chars": 553,
    "preview": "name: Stale Bot\n\non:\n  schedule:\n    - cron: \"0 15 * * *\"\n\njobs:\n  close_stale_issues:\n    name: Close Stale Issues\n    "
  },
  {
    "path": ".github/workflows/test-runner.yml",
    "chars": 7640,
    "preview": "name: Test Runner\n\non:\n  workflow_call:\n    inputs:\n      platform:\n        type: string\n        required: true\n        "
  },
  {
    "path": ".github/workflows/tests-nightly.yml",
    "chars": 3299,
    "preview": "name: Nightly Tests\n\non:\n  workflow_dispatch:\n  schedule:\n    # Every day at 02:15 AM UTC\n    - cron: \"15 2 * * *\"\n\nconc"
  },
  {
    "path": ".github/workflows/tests-pr.yml",
    "chars": 2921,
    "preview": "name: PR Tests\n\non:\n  pull_request:\n    types: [opened, synchronize, reopened]\n    branches: [main]\n    paths:\n      - \""
  },
  {
    "path": ".github/workflows/upload_pr_documentation.yml",
    "chars": 477,
    "preview": "name: Upload PR Documentation\n\non:\n  workflow_run:\n    workflows: [\"Build PR Documentation\"]\n    types:\n      - complete"
  },
  {
    "path": ".gitignore",
    "chars": 2132,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n*.so\n*.dll\n*.dylib\n*.o\n*.obj\n*.air\n*.metallib\n"
  },
  {
    "path": ".pre-commit-config.yaml",
    "chars": 743,
    "preview": "repos:\n  - repo: https://github.com/astral-sh/ruff-pre-commit\n    rev: v0.14.3\n    hooks:\n      - id: ruff\n        args:"
  },
  {
    "path": ".vscode/extensions.json",
    "chars": 114,
    "preview": "{\n    \"recommendations\": [\n        \"ms-python.python\",\n        \"charliermarsh.ruff\",\n        \"twxs.cmake\"\n    ]\n}\n"
  },
  {
    "path": ".vscode/settings.json",
    "chars": 134,
    "preview": "{\n    \"ruff.fixAll\": true,\n    \"ruff.lint.run\": \"onType\",\n    \"editor.codeActionsOnSave\": {\n        \"source.fixAll\": \"al"
  },
  {
    "path": "CHANGELOG.md",
    "chars": 26559,
    "preview": "### v0.45.1\n\n#### Improvements:\n\n* Compatibility for `triton>=3.2.0`\n* Moved package configuration to `pyproject.toml`\n*"
  },
  {
    "path": "CLAUDE.md",
    "chars": 3401,
    "preview": "# MANDATORY: Use git worktrees for all branch work\n\nNEVER work on a fix or feature branch inside the main `~/git/bitsand"
  },
  {
    "path": "CMakeLists.txt",
    "chars": 18127,
    "preview": "# This CMake config hopefully makes it easier to compile.\n# Ensure the CUDA Toolkit is available on your path. Then run:"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 3541,
    "preview": "# Code of Conduct\n\n## Our Pledge\n\nIn the interest of fostering an open and welcoming environment, we as\ncontributors and"
  },
  {
    "path": "COMPILE_H100_L40.md",
    "chars": 3186,
    "preview": "# Compiling bitsandbytes for H100 and L40 GPUs\n\nThis guide shows how to compile bitsandbytes from source specifically op"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 859,
    "preview": "# Contributing to bitsandbytes\nWe want to make contributing to this project as easy and transparent as\npossible.\n\n## Pul"
  },
  {
    "path": "LICENSE",
    "chars": 1086,
    "preview": "MIT License\n\nCopyright (c) Facebook, Inc. and its affiliates.\n\nPermission is hereby granted, free of charge, to any pers"
  },
  {
    "path": "MANIFEST.in",
    "chars": 48,
    "preview": "include CMakeLists.txt\ngraft csrc\ngraft include\n"
  },
  {
    "path": "NOTICE.md",
    "chars": 171,
    "preview": "The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license"
  },
  {
    "path": "README.md",
    "chars": 7681,
    "preview": "<p align=\"center\"><img src=\"https://avatars.githubusercontent.com/u/175231607?s=200&v=4\" alt=\"\"></p>\n<h1 align=\"center\">"
  },
  {
    "path": "SECURITY.md",
    "chars": 1460,
    "preview": "# Security Policy\n\n## Supported Versions\n\nWe provide security updates for the latest stable minor release line.\n\n| Versi"
  },
  {
    "path": "_typos.toml",
    "chars": 562,
    "preview": "[files]\n# Skip these files in typo checks\nextend-exclude = [\n    \"agents/*.md\",\n    \"csrc/xpu_ops.h\",\n    \"csrc/xpu_ops."
  },
  {
    "path": "agents/api_surface.md",
    "chars": 44928,
    "preview": "# bitsandbytes Public API Surface\n\nThis document catalogs every public symbol in the bitsandbytes library, organized by\n"
  },
  {
    "path": "agents/architecture_guide.md",
    "chars": 50865,
    "preview": "# bitsandbytes Architecture Guide\n\nThis document provides a comprehensive architecture reference for agents reviewing pu"
  },
  {
    "path": "agents/code_standards.md",
    "chars": 45225,
    "preview": "# bitsandbytes Code Standards\n\nThis document defines the coding standards, patterns, and conventions for the bitsandbyte"
  },
  {
    "path": "agents/dispatch_guide.md",
    "chars": 15896,
    "preview": "# Agent Dispatch Guide\n\nYou are the Dispatcher. Your job is to analyze open GitHub issues for bitsandbytes, identify iss"
  },
  {
    "path": "agents/downstream_integrations.md",
    "chars": 44378,
    "preview": "# Downstream Integrations Guide\n\nThis document catalogs every major downstream consumer of bitsandbytes, the specific AP"
  },
  {
    "path": "agents/fetch_issues.py",
    "chars": 8870,
    "preview": "#!/usr/bin/env python3\n\"\"\"Fetch all issues (open and closed) from a GitHub repository via GraphQL and store as structure"
  },
  {
    "path": "agents/github_tools_guide.md",
    "chars": 5614,
    "preview": "# Using GitHub Tools for bitsandbytes Issue Analysis\n\nThe `agents/` directory contains scripts for fetching and querying"
  },
  {
    "path": "agents/issue_maintenance_guide.md",
    "chars": 6008,
    "preview": "# Issue Maintenance Guide\n\nYou are an issue maintenance agent. Your job is to review open GitHub issues for bitsandbytes"
  },
  {
    "path": "agents/issue_patterns.md",
    "chars": 22848,
    "preview": "# Common Issue Patterns in bitsandbytes\n\nThis document catalogs recurring issue patterns across the bitsandbytes issue t"
  },
  {
    "path": "agents/issue_triage_workflow.md",
    "chars": 7417,
    "preview": "# Issue Triage Workflow: Human + Agent Collaboration\n\nThis document describes the interactive workflow for triaging GitH"
  },
  {
    "path": "agents/linting_guide.md",
    "chars": 4620,
    "preview": "# Linting Guide\n\nThis project enforces linting and formatting via CI on every pull request. The Lint workflow runs `pre-"
  },
  {
    "path": "agents/pr_review_guide.md",
    "chars": 79636,
    "preview": "# Pull Request Review Guide\n\nThis document defines the complete workflow for reviewing pull requests to bitsandbytes.\nIt"
  },
  {
    "path": "agents/query_issues.py",
    "chars": 20473,
    "preview": "#!/usr/bin/env python3\n\"\"\"Search and query GitHub issues from the local JSON data file.\n\nOptimized for agent consumption"
  },
  {
    "path": "agents/security_guide.md",
    "chars": 57467,
    "preview": "# bitsandbytes Security Review Guide\n\nThis document defines the security review checklist for pull requests to the bitsa"
  },
  {
    "path": "agents/testing_guide.md",
    "chars": 5904,
    "preview": "# Testing Guide for bitsandbytes\n\n## Quick Start\n\nRun the full test suite with optimal parallelization:\n\n```bash\npytest "
  },
  {
    "path": "agents/worktree_guide.md",
    "chars": 1651,
    "preview": "# Worktree conventions for bitsandbytes\n\nFor general worktree concepts, setup, and the worktree registry, see `~/git/lab"
  },
  {
    "path": "benchmarking/README.md",
    "chars": 10854,
    "preview": "# Benchmarking\n\n## Inference\nEnd-to-end inference benchmarking can be performed using the 🤗 [`optimum-benchmark`](https:"
  },
  {
    "path": "benchmarking/inference_benchmark.py",
    "chars": 5982,
    "preview": "\"\"\"\nInference benchmarking tool.\n\nRequirements:\n    transformers\n    accelerate\n    bitsandbytes\n    optimum-benchmark\n\n"
  },
  {
    "path": "benchmarking/int8/int8_benchmark.py",
    "chars": 1659,
    "preview": "\"\"\"\nBasic benchmark for text generation.\n\nUsage: python benchmarking/int8/int8_benchmark.py\n\"\"\"\n\nimport time\n\nimport tor"
  },
  {
    "path": "benchmarking/int8/training_benchmark.py",
    "chars": 6872,
    "preview": "\"\"\"\nExtracted from tests/test_functional.py\n\nUsage: pytest benchmarking/int8/training_benchmark.py\n\"\"\"\n\nimport time\n\nimp"
  },
  {
    "path": "benchmarking/matmul_benchmark.py",
    "chars": 8080,
    "preview": "\"\"\"\nExtracted from tests/test_functional.py\n\nUsage: pytest benchmarking/matmul_benchmark.py\n\"\"\"\n\nimport time\n\nimport pyt"
  },
  {
    "path": "benchmarking/optimizer_benchmark.py",
    "chars": 1701,
    "preview": "\"\"\"\nExtracted from tests/test_optim.py\n\nUsage: pytest benchmarking/optimizer_benchmark.py\n\"\"\"\n\nimport time\n\nimport pytes"
  },
  {
    "path": "benchmarking/xpu/inference_benchmark.py",
    "chars": 7566,
    "preview": "import argparse\nimport time\n\n# import intel_extension_for_pytorch as ipex\nimport numpy as np\nimport torch\nfrom transform"
  },
  {
    "path": "bitsandbytes/__init__.py",
    "chars": 2227,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/__main__.py",
    "chars": 90,
    "preview": "if __name__ == \"__main__\":\n    from bitsandbytes.diagnostics.main import main\n\n    main()\n"
  },
  {
    "path": "bitsandbytes/_ops.py",
    "chars": 15014,
    "preview": "from collections.abc import Sequence\nfrom math import prod\nfrom typing import Optional\n\nimport torch\n\n_IS_TORCH_GTE_24 ="
  },
  {
    "path": "bitsandbytes/autograd/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "bitsandbytes/autograd/_functions.py",
    "chars": 14310,
    "preview": "from dataclasses import dataclass\nimport logging\nfrom math import prod\nfrom typing import Optional\nimport warnings\nfrom "
  },
  {
    "path": "bitsandbytes/backends/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "bitsandbytes/backends/cpu/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "bitsandbytes/backends/cpu/ops.py",
    "chars": 11349,
    "preview": "from collections.abc import Sequence\nimport ctypes as ct\nimport logging\nfrom math import prod\n\nimport torch\n\nfrom bitsan"
  },
  {
    "path": "bitsandbytes/backends/cuda/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "bitsandbytes/backends/cuda/ops.py",
    "chars": 24841,
    "preview": "from collections.abc import Sequence\nimport ctypes as ct\nfrom math import prod\nfrom typing import Optional\n\nimport torch"
  },
  {
    "path": "bitsandbytes/backends/default/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "bitsandbytes/backends/default/ops.py",
    "chars": 19131,
    "preview": "from collections.abc import Sequence\nfrom functools import wraps\nfrom math import prod, sqrt\nfrom typing import Optional"
  },
  {
    "path": "bitsandbytes/backends/hpu/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "bitsandbytes/backends/hpu/ops.py",
    "chars": 1469,
    "preview": "from collections.abc import Sequence\nimport math\n\nimport torch\n\nfrom ..._ops import register_kernel\nfrom ..utils import "
  },
  {
    "path": "bitsandbytes/backends/mps/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "bitsandbytes/backends/mps/ops.py",
    "chars": 3841,
    "preview": "\"\"\"MPS backend for bitsandbytes 4-bit quantization ops.\n\nUses Metal kernels from kernels-community/bitsandbytes-mps via "
  },
  {
    "path": "bitsandbytes/backends/triton/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "bitsandbytes/backends/triton/kernels_4bit.py",
    "chars": 22150,
    "preview": "import torch\n\nimport triton\nimport triton.language as tl\n\n\n# Triton implementation of similar CUDA kernel to avoid loadi"
  },
  {
    "path": "bitsandbytes/backends/triton/kernels_8bit_quant.py",
    "chars": 6542,
    "preview": "import torch\n\nimport triton\nimport triton.language as tl\n\n\n# @triton.autotune(\n#     configs=[\n#         # triton.Config"
  },
  {
    "path": "bitsandbytes/backends/triton/kernels_optim.py",
    "chars": 37439,
    "preview": "import math\nfrom typing import Optional\n\nimport torch\n\nimport triton\nimport triton.language as tl\n\n# from triton.languag"
  },
  {
    "path": "bitsandbytes/backends/triton/ops.py",
    "chars": 10398,
    "preview": "from collections.abc import Sequence\nfrom typing import Optional\n\nimport torch\n\nfrom . import kernels_4bit, kernels_8bit"
  },
  {
    "path": "bitsandbytes/backends/utils.py",
    "chars": 1809,
    "preview": "import subprocess\n\nfrom packaging import version\nimport torch\n\ntry:\n    import triton  # noqa: F401\n    import triton.la"
  },
  {
    "path": "bitsandbytes/backends/xpu/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "bitsandbytes/backends/xpu/ops.py",
    "chars": 8044,
    "preview": "from collections.abc import Sequence\nimport ctypes as ct\nimport logging\n\nfrom packaging import version\nimport torch\n\nfro"
  },
  {
    "path": "bitsandbytes/cextension.py",
    "chars": 14970,
    "preview": "import ctypes as ct\nimport functools\nimport logging\nimport os\nfrom pathlib import Path\nimport re\nfrom typing import Opti"
  },
  {
    "path": "bitsandbytes/consts.py",
    "chars": 380,
    "preview": "from pathlib import Path\nimport platform\n\nDYNAMIC_LIBRARY_SUFFIX = {\n    \"Darwin\": \".dylib\",\n    \"Linux\": \".so\",\n    \"Wi"
  },
  {
    "path": "bitsandbytes/cuda_specs.py",
    "chars": 4509,
    "preview": "import dataclasses\nfrom functools import lru_cache\nimport logging\nimport platform\nimport re\nimport subprocess\nfrom typin"
  },
  {
    "path": "bitsandbytes/diagnostics/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "bitsandbytes/diagnostics/cuda.py",
    "chars": 8685,
    "preview": "from collections.abc import Iterable, Iterator\nimport logging\nimport os\nfrom pathlib import Path\n\nimport torch\n\nfrom bit"
  },
  {
    "path": "bitsandbytes/diagnostics/main.py",
    "chars": 3347,
    "preview": "import importlib\nimport platform\nimport sys\nimport traceback\n\nimport torch\n\nfrom bitsandbytes import __version__ as bnb_"
  },
  {
    "path": "bitsandbytes/diagnostics/utils.py",
    "chars": 284,
    "preview": "import textwrap\n\nHEADER_WIDTH = 60\n\n\ndef print_header(txt: str, width: int = HEADER_WIDTH, filler: str = \"=\") -> None:\n "
  },
  {
    "path": "bitsandbytes/functional.py",
    "chars": 62793,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/nn/__init__.py",
    "chars": 432,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/nn/modules.py",
    "chars": 42903,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/nn/parametrize.py",
    "chars": 7297,
    "preview": "from functools import partial\nfrom typing import Any, Literal, Optional\n\nimport torch\nimport torch.nn as nn\nimport torch"
  },
  {
    "path": "bitsandbytes/optim/__init__.py",
    "chars": 881,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/optim/adagrad.py",
    "chars": 6344,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/optim/adam.py",
    "chars": 12832,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/optim/adamw.py",
    "chars": 12449,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/optim/ademamix.py",
    "chars": 12750,
    "preview": "from collections.abc import Iterable\nimport math\nfrom typing import Literal, Optional\n\nimport torch\n\nimport bitsandbytes"
  },
  {
    "path": "bitsandbytes/optim/lamb.py",
    "chars": 6413,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/optim/lars.py",
    "chars": 8326,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/optim/lion.py",
    "chars": 8309,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/optim/optimizer.py",
    "chars": 27561,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/optim/rmsprop.py",
    "chars": 6134,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/optim/sgd.py",
    "chars": 4916,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "bitsandbytes/py.typed",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "bitsandbytes/utils.py",
    "chars": 6928,
    "preview": "import json\nimport logging\nimport shlex\nimport subprocess\n\nimport torch\n\nlogger = logging.getLogger(__name__)\n\n\ndef outl"
  },
  {
    "path": "check_bnb_install.py",
    "chars": 333,
    "preview": "import torch\n\nimport bitsandbytes as bnb\n\np = torch.nn.Parameter(torch.rand(10, 10).cuda())\na = torch.rand(10, 10).cuda("
  },
  {
    "path": "csrc/common.cuh",
    "chars": 2527,
    "preview": "// common.cuh — Architecture constants and feature detection\n\n#pragma once\n\n#include \"compat.cuh\"\n\n// Warp size\n\n#if BNB"
  },
  {
    "path": "csrc/common.h",
    "chars": 101,
    "preview": "#pragma once\n\ntypedef enum DataType_t {\n    General8bit = 0,\n    FP4 = 1,\n    NF4 = 2,\n} DataType_t;\n"
  },
  {
    "path": "csrc/compat.cuh",
    "chars": 5996,
    "preview": "// compat.cuh — Platform abstraction layer for CUDA/HIP portability\n//\n// This header resolves ALL mechanical difference"
  },
  {
    "path": "csrc/compat_device.cuh",
    "chars": 1227,
    "preview": "// compat_device.cuh — Device-only portability layer (CUB, reduction ops, MMA)\n//\n// Include this from .cu kernel files "
  },
  {
    "path": "csrc/cpu_ops.cpp",
    "chars": 23373,
    "preview": "#include \"cpu_ops.h\"\n#include <algorithm>\n#include <cmath>\n#include <cstdio>\n#include <thread>\n#include <vector>\n\n#ifdef"
  },
  {
    "path": "csrc/cpu_ops.h",
    "chars": 10014,
    "preview": "#ifndef BITSANDBYTES_CPU_OPS_H\n#define BITSANDBYTES_CPU_OPS_H\n\n#include \"common.h\"\n#include <algorithm>\n#include <cmath>"
  },
  {
    "path": "csrc/kernels.cu",
    "chars": 78007,
    "preview": "// Copyright (c) Facebook, Inc. and its affiliates.\n//\n// This source code is licensed under the MIT license found in th"
  },
  {
    "path": "csrc/kernels.cuh",
    "chars": 4153,
    "preview": "// Copyright (c) Facebook, Inc. and its affiliates.\n//\n// This source code is licensed under the MIT license found in th"
  },
  {
    "path": "csrc/mps_kernels.metal",
    "chars": 2796,
    "preview": "#include <metal_stdlib>\nusing namespace metal;\n\n#define HLF_MAX 65504\n#define TH 1024\n#define NUM 4\n#define NUM_BLOCK 40"
  },
  {
    "path": "csrc/mps_ops.mm",
    "chars": 1730,
    "preview": "#import <MetalPerformanceShadersGraph/MetalPerformanceShadersGraph.h>\n\n#define HLF_MAX 65504\n#define TH 1024\n#define NUM"
  },
  {
    "path": "csrc/ops.cu",
    "chars": 26144,
    "preview": "// Copyright (c) Facebook, Inc. and its affiliates.\n//\n// This source code is licensed under the MIT license found in th"
  },
  {
    "path": "csrc/ops.cuh",
    "chars": 4153,
    "preview": "// Copyright (c) Facebook, Inc. and its affiliates.\n//\n// This source code is licensed under the MIT license found in th"
  },
  {
    "path": "csrc/pythonInterface.cpp",
    "chars": 35931,
    "preview": "// Copyright (c) Facebook, Inc. and its affiliates.\n//\n// This source code is licensed under the MIT license found in th"
  },
  {
    "path": "csrc/xpu_kernels.cpp",
    "chars": 11272,
    "preview": "#include \"xpu_kernels.h\"\n#include <bit>\n#include <cmath>\n#include <iostream>\n\n#include <sycl/sycl.hpp>\n\ninline float dDe"
  },
  {
    "path": "csrc/xpu_kernels.h",
    "chars": 1496,
    "preview": "#include <float.h>\n#include <xpu_ops.h>\n\n#ifndef xpu_kernels\n#define xpu_kernels\n\ntemplate <typename T, int TILE_SIZE, i"
  },
  {
    "path": "csrc/xpu_ops.cpp",
    "chars": 4925,
    "preview": "#include <xpu_kernels.h>\n#include <xpu_ops.h>\n\ntemplate <typename T, int DATA_TYPE>\nvoid dequantizeBlockwise(\n    float*"
  },
  {
    "path": "csrc/xpu_ops.h",
    "chars": 1283,
    "preview": "#ifndef xpu_ops_H\n#define xpu_ops_H\n\n#include <assert.h>\n#include <common.h>\n#include <cstdint>\n#include <iostream>\n#inc"
  },
  {
    "path": "docs/source/_toctree.yml",
    "chars": 1535,
    "preview": "- title: Get started\n  sections:\n  - local: index\n    title: bitsandbytes\n  - local: installation\n    title: Installatio"
  },
  {
    "path": "docs/source/contributing.mdx",
    "chars": 1537,
    "preview": "# Contribution Guide\n\n## Setup\n\n### Setup pre-commit hooks\n- Install pre-commit hooks with `pip install pre-commit`.\n- R"
  },
  {
    "path": "docs/source/errors.mdx",
    "chars": 1543,
    "preview": "# Troubleshoot\n\n## No kernel image available\n\nThis problem arises with the cuda version loaded by bitsandbytes is not su"
  },
  {
    "path": "docs/source/explanations/optimizers.mdx",
    "chars": 5350,
    "preview": "# 8-bit optimizers\n\nStateful optimizers maintain gradient statistics over time, for example, the exponentially smoothed "
  },
  {
    "path": "docs/source/explanations/resources.mdx",
    "chars": 3876,
    "preview": "# Papers, related resources & how to cite\n\nThe below academic work is ordered in reverse chronological order.\n\n## [SpQR:"
  },
  {
    "path": "docs/source/faqs.mdx",
    "chars": 440,
    "preview": "# FAQs\n\nPlease submit your questions in [this Github Discussion thread](https://github.com/bitsandbytes-foundation/bitsa"
  },
  {
    "path": "docs/source/fsdp_qlora.md",
    "chars": 6974,
    "preview": "# FSDP-QLoRA\n\nFSDP-QLoRA combines data parallelism (FSDP enables sharding model parameters, optimizer states, and gradie"
  },
  {
    "path": "docs/source/index.mdx",
    "chars": 958,
    "preview": "# bitsandbytes\n\nbitsandbytes enables accessible large language models via k-bit quantization for PyTorch. bitsandbytes p"
  },
  {
    "path": "docs/source/installation.mdx",
    "chars": 12134,
    "preview": "# Installation Guide\n\nWelcome to the installation guide for the `bitsandbytes` library! This document provides step-by-s"
  },
  {
    "path": "docs/source/integrations.mdx",
    "chars": 6037,
    "preview": "# Integrations\n\nbitsandbytes is widely integrated with many of the libraries in the Hugging Face and wider PyTorch ecosy"
  },
  {
    "path": "docs/source/optimizers.mdx",
    "chars": 4471,
    "preview": "# 8-bit optimizers\n\nWith 8-bit optimizers, large models can be finetuned with 75% less GPU memory without losing any acc"
  },
  {
    "path": "docs/source/quickstart.mdx",
    "chars": 4026,
    "preview": "# Quickstart\n\nWelcome to bitsandbytes! This library enables accessible large language models via k-bit quantization for "
  },
  {
    "path": "docs/source/reference/functional.mdx",
    "chars": 1282,
    "preview": "# Overview\nThe `bitsandbytes.functional` API provides the low-level building blocks for the library's features.\n\n## When"
  },
  {
    "path": "docs/source/reference/nn/embeddings.mdx",
    "chars": 650,
    "preview": "# Embedding\n\nThe embedding class is used to store and retrieve word embeddings from their indices. There are two types o"
  },
  {
    "path": "docs/source/reference/nn/linear4bit.mdx",
    "chars": 744,
    "preview": "# 4-bit quantization\n\n[QLoRA](https://hf.co/papers/2305.14314) is a finetuning method that quantizes a model to 4-bits a"
  },
  {
    "path": "docs/source/reference/nn/linear8bit.mdx",
    "chars": 869,
    "preview": "# LLM.int8()\n[LLM.int8()](https://hf.co/papers/2208.07339) is a quantization method that aims to make large language mod"
  },
  {
    "path": "docs/source/reference/optim/adagrad.mdx",
    "chars": 635,
    "preview": "# AdaGrad\n\n[AdaGrad (Adaptive Gradient)](https://jmlr.org/papers/v12/duchi11a.html) is an adaptive learning rate optimiz"
  },
  {
    "path": "docs/source/reference/optim/adam.mdx",
    "chars": 1019,
    "preview": "# Adam\n\n[Adam (Adaptive moment estimation)](https://hf.co/papers/1412.6980) is an adaptive learning rate optimizer, comb"
  },
  {
    "path": "docs/source/reference/optim/adamw.mdx",
    "chars": 871,
    "preview": "# AdamW\n\n[AdamW](https://hf.co/papers/1711.05101) is a variant of the [`Adam`] optimizer that separates weight decay fro"
  },
  {
    "path": "docs/source/reference/optim/ademamix.mdx",
    "chars": 751,
    "preview": "# AdEMAMix\n\n[AdEMAMix](https://hf.co/papers/2409.03137) is a variant of the [`Adam`] optimizer.\n\nbitsandbytes also suppo"
  },
  {
    "path": "docs/source/reference/optim/lamb.mdx",
    "chars": 693,
    "preview": "# LAMB\n\n[LAMB (Layerwise adaptive large batch optimization)](https://hf.co/papers/1904.00962) is an adaptive optimizer d"
  },
  {
    "path": "docs/source/reference/optim/lars.mdx",
    "chars": 605,
    "preview": "# LARS\n\n[LARS (Layer-wise Adaptive Rate Scaling)](https:/hf.co/papers/1708.03888) is an optimizer designed for training "
  },
  {
    "path": "docs/source/reference/optim/lion.mdx",
    "chars": 748,
    "preview": "# Lion\n\n[Lion (Evolved Sign Momentum)](https://hf.co/papers/2302.06675) is a unique optimizer that uses the sign of the "
  },
  {
    "path": "docs/source/reference/optim/optim_overview.mdx",
    "chars": 1066,
    "preview": "# Overview\n\n[8-bit optimizers](https://hf.co/papers/2110.02861) reduce the memory footprint of 32-bit optimizers without"
  },
  {
    "path": "docs/source/reference/optim/rmsprop.mdx",
    "chars": 571,
    "preview": "# RMSprop\n\nRMSprop is an adaptive learning rate optimizer that is very similar to [`Adagrad`]. RMSprop stores a *weighte"
  },
  {
    "path": "docs/source/reference/optim/sgd.mdx",
    "chars": 638,
    "preview": "# SGD\n\nStochastic gradient descent (SGD) is a basic gradient descent optimizer to minimize loss given a set of model par"
  },
  {
    "path": "examples/compile_inference.py",
    "chars": 928,
    "preview": "import torch\nimport torch._dynamo\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n\n# to"
  },
  {
    "path": "examples/int8_inference_huggingface.py",
    "chars": 698,
    "preview": "import torch\nfrom transformers import LlamaForCausalLM, LlamaTokenizer\n\nMAX_NEW_TOKENS = 128\nmodel_name = \"meta-llama/Ll"
  },
  {
    "path": "examples/xpu/benchmark_paged_memory.py",
    "chars": 9222,
    "preview": "\"\"\"\nBenchmark: Paged vs Non-Paged Optimizer GPU Memory Usage.\n\nDemonstrates that paged optimizers significantly reduce G"
  },
  {
    "path": "examples/xpu/paged_xpu_training.py",
    "chars": 14358,
    "preview": "\"\"\"\nReal training case for XPU Paged Optimizer using JackFram/llama-68m + Alpaca Clean.\n\nUsage:\n    python paged_xpu_tra"
  },
  {
    "path": "install_cuda.py",
    "chars": 3704,
    "preview": "import os\nimport subprocess\nimport sys\nfrom urllib.request import urlretrieve\n\ncuda_versions = {\n    \"118\": \"https://dev"
  },
  {
    "path": "install_cuda.sh",
    "chars": 2290,
    "preview": "URL118=https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run\nURL120"
  },
  {
    "path": "pyproject.toml",
    "chars": 4903,
    "preview": "[build-system]\nrequires = [\"scikit-build-core\", \"setuptools >= 77.0.3\", \"trove-classifiers>=2025.8.6.13\"]\nbuild-backend "
  },
  {
    "path": "scripts/stale.py",
    "chars": 2261,
    "preview": "# Copyright 2023 The HuggingFace Team, the AllenNLP library authors. All rights reserved.\n#\n# Licensed under the Apache "
  },
  {
    "path": "setup.py",
    "chars": 1558,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# This source code is licensed under the MIT license found in the\n#"
  },
  {
    "path": "tests/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/conftest.py",
    "chars": 1320,
    "preview": "import gc\nimport random\n\nimport numpy as np\nimport pytest\nimport torch\n\n\ndef _set_seed():\n    torch.manual_seed(0)\n    t"
  },
  {
    "path": "tests/fsdp_state_dict_save.py",
    "chars": 2830,
    "preview": "\"\"\"FSDP state_dict save integration test for 4-bit quantized models (#1405).\n\nThis script must be launched via torchrun "
  },
  {
    "path": "tests/helpers.py",
    "chars": 3383,
    "preview": "import functools\nfrom io import BytesIO\nfrom itertools import product\nimport os\nimport random\nfrom typing import Any\n\nim"
  },
  {
    "path": "tests/test_autograd.py",
    "chars": 10600,
    "preview": "import pytest\nimport torch\n\nimport bitsandbytes as bnb\nfrom tests.helpers import (\n    BOOLEAN_TRIPLES,\n    TRUE_FALSE,\n"
  },
  {
    "path": "tests/test_cuda_setup_evaluator.py",
    "chars": 3280,
    "preview": "import pytest\n\nfrom bitsandbytes.cextension import HIP_ENVIRONMENT, get_cuda_bnb_library_path\nfrom bitsandbytes.cuda_spe"
  },
  {
    "path": "tests/test_functional.py",
    "chars": 51516,
    "preview": "import math\nimport platform\nimport random\nimport time\n\nimport einops\nfrom packaging import version\nimport pytest\nimport "
  },
  {
    "path": "tests/test_generation.py",
    "chars": 4182,
    "preview": "from itertools import product\nimport math\n\nimport pytest\nimport torch\n\nfrom tests.helpers import TRUE_FALSE, describe_dt"
  },
  {
    "path": "tests/test_linear4bit.py",
    "chars": 19974,
    "preview": "import copy\nimport os\nimport pathlib\nimport pickle\nimport platform\nimport subprocess\nimport sys\nfrom tempfile import Tem"
  },
  {
    "path": "tests/test_linear8bitlt.py",
    "chars": 12588,
    "preview": "from contextlib import nullcontext\nimport copy\nimport os\nimport pickle\nimport platform\nimport sys\nfrom tempfile import T"
  },
  {
    "path": "tests/test_modules.py",
    "chars": 18616,
    "preview": "import contextlib\nimport inspect\nimport logging\n\nimport pytest\nimport torch\nfrom torch import nn\n\nimport bitsandbytes as"
  },
  {
    "path": "tests/test_ops.py",
    "chars": 17719,
    "preview": "from math import prod\n\nimport pytest\nimport torch\n\nimport bitsandbytes\nfrom tests.helpers import TRUE_FALSE, get_availab"
  },
  {
    "path": "tests/test_optim.py",
    "chars": 25277,
    "preview": "import os\nfrom os.path import join\nimport shutil\nimport sys\nimport time\nimport uuid\n\nfrom lion_pytorch import Lion\nimpor"
  },
  {
    "path": "tests/test_parametrize.py",
    "chars": 20971,
    "preview": "import pytest\nimport torch\nimport torch.nn as nn\n\nfrom bitsandbytes import functional as F\nfrom bitsandbytes.nn.parametr"
  }
]

About this extraction

This page contains the full source code of the bitsandbytes-foundation/bitsandbytes GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 178 files (1.5 MB), approximately 408.3k tokens, and a symbol index with 830 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!