main db931fcf42da cached
864 files
6.3 MB
1.7M tokens
5123 symbols
1 requests
Download .txt
Showing preview only (6,791K chars total). Download the full file or copy to clipboard to get everything.
Repository: huggingface/text-generation-inference
Branch: main
Commit: db931fcf42da
Files: 864
Total size: 6.3 MB

Directory structure:
gitextract_uvlvpncm/

├── .dockerignore
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug-report.yml
│   │   ├── config.yml
│   │   ├── feature-request.yml
│   │   └── new-model-addition.yml
│   ├── PULL_REQUEST_TEMPLATE.md
│   └── workflows/
│       ├── autodocs.yaml
│       ├── build.yaml
│       ├── build_documentation.yaml
│       ├── build_pr_documentation.yaml
│       ├── ci_build.yaml
│       ├── client-tests.yaml
│       ├── codeql.yml
│       ├── integration_tests.yaml
│       ├── load_test.yaml
│       ├── nix_build.yaml
│       ├── nix_cache.yaml
│       ├── nix_tests.yaml
│       ├── stale.yaml
│       ├── tests.yaml
│       ├── trufflehog.yaml
│       └── upload_pr_documentation.yaml
├── .gitignore
├── .pre-commit-config.yaml
├── .redocly.lint-ignore.yaml
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Cargo.toml
├── Dockerfile
├── Dockerfile.neuron
├── Dockerfile.nix
├── Dockerfile_amd
├── Dockerfile_gaudi
├── Dockerfile_intel
├── Dockerfile_llamacpp
├── Dockerfile_trtllm
├── LICENSE
├── Makefile
├── README.md
├── assets/
│   └── tgi_grafana.json
├── backends/
│   ├── client/
│   │   ├── Cargo.toml
│   │   ├── build.rs
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── v2/
│   │       │   ├── client.rs
│   │       │   ├── mod.rs
│   │       │   └── sharded_client.rs
│   │       └── v3/
│   │           ├── client.rs
│   │           ├── mod.rs
│   │           └── sharded_client.rs
│   ├── gaudi/
│   │   ├── Makefile
│   │   ├── README.md
│   │   ├── examples/
│   │   │   └── docker_commands/
│   │   │       └── docker_commands.md
│   │   ├── server/
│   │   │   ├── .gitignore
│   │   │   ├── Makefile
│   │   │   ├── Makefile-awq
│   │   │   ├── Makefile-eetq
│   │   │   ├── Makefile-fbgemm
│   │   │   ├── Makefile-flash-att
│   │   │   ├── Makefile-flash-att-v2
│   │   │   ├── Makefile-selective-scan
│   │   │   ├── Makefile-vllm
│   │   │   ├── README.md
│   │   │   ├── dill-0.3.7-patch.sh
│   │   │   ├── dill-0.3.8-patch.sh
│   │   │   ├── pyproject.toml
│   │   │   ├── requirements.txt
│   │   │   └── text_generation_server/
│   │   │       ├── __init__.py
│   │   │       ├── adapters/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── config.py
│   │   │       │   ├── lora.py
│   │   │       │   └── weights.py
│   │   │       ├── cache.py
│   │   │       ├── cli.py
│   │   │       ├── interceptor.py
│   │   │       ├── layers/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── attention/
│   │   │       │   │   ├── __init__.py
│   │   │       │   │   ├── common.py
│   │   │       │   │   ├── hpu.py
│   │   │       │   │   └── kv_cache.py
│   │   │       │   ├── awq/
│   │   │       │   │   ├── conversion_utils.py
│   │   │       │   │   └── quantize/
│   │   │       │   │       ├── __init__.py
│   │   │       │   │       └── hpu.py
│   │   │       │   ├── bnb.py
│   │   │       │   ├── compressed_tensors/
│   │   │       │   │   ├── __init__.py
│   │   │       │   │   ├── loader.py
│   │   │       │   │   └── w8an_fp.py
│   │   │       │   ├── conv.py
│   │   │       │   ├── exl2.py
│   │   │       │   ├── fp8.py
│   │   │       │   ├── gptq/
│   │   │       │   │   ├── __init__.py
│   │   │       │   │   ├── hpu.py
│   │   │       │   │   ├── quantize.py
│   │   │       │   │   └── utils.py
│   │   │       │   ├── layernorm.py
│   │   │       │   ├── linear.py
│   │   │       │   ├── lora.py
│   │   │       │   ├── medusa.py
│   │   │       │   ├── mlp.py
│   │   │       │   ├── moe/
│   │   │       │   │   ├── __init__.py
│   │   │       │   │   ├── fp8.py
│   │   │       │   │   ├── fused_moe.py
│   │   │       │   │   └── unquantized.py
│   │   │       │   ├── rotary.py
│   │   │       │   ├── speculative.py
│   │   │       │   └── tensor_parallel.py
│   │   │       ├── models/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── custom_modeling/
│   │   │       │   │   ├── __init__.py
│   │   │       │   │   ├── bloom_modeling.py
│   │   │       │   │   ├── clip.py
│   │   │       │   │   ├── flash_cohere_modeling.py
│   │   │       │   │   ├── flash_dbrx_modeling.py
│   │   │       │   │   ├── flash_deepseek_v2_modeling.py
│   │   │       │   │   ├── flash_deepseek_v3_modeling.py
│   │   │       │   │   ├── flash_gemma2_modeling.py
│   │   │       │   │   ├── flash_gemma3_modeling.py
│   │   │       │   │   ├── flash_gemma_modeling.py
│   │   │       │   │   ├── flash_gpt2_modeling.py
│   │   │       │   │   ├── flash_gptj_modeling.py
│   │   │       │   │   ├── flash_llama4_modeling.py
│   │   │       │   │   ├── flash_llama_modeling.py
│   │   │       │   │   ├── flash_llava_next.py
│   │   │       │   │   ├── flash_mistral_modeling.py
│   │   │       │   │   ├── flash_mixtral_modeling.py
│   │   │       │   │   ├── flash_mllama.py
│   │   │       │   │   ├── flash_neox_modeling.py
│   │   │       │   │   ├── flash_pali_gemma_modeling.py
│   │   │       │   │   ├── flash_phi_modeling.py
│   │   │       │   │   ├── flash_phi_moe_modeling.py
│   │   │       │   │   ├── flash_qwen2_modeling.py
│   │   │       │   │   ├── flash_qwen3_modeling.py
│   │   │       │   │   ├── flash_qwen3_moe_modeling.py
│   │   │       │   │   ├── flash_rw_modeling.py
│   │   │       │   │   ├── flash_santacoder_modeling.py
│   │   │       │   │   ├── flash_starcoder2_modeling.py
│   │   │       │   │   ├── idefics2.py
│   │   │       │   │   ├── idefics3.py
│   │   │       │   │   ├── mamba_modeling.py
│   │   │       │   │   ├── qwen2_5_vl.py
│   │   │       │   │   ├── qwen2_vl.py
│   │   │       │   │   ├── siglip.py
│   │   │       │   │   └── vlm.py
│   │   │       │   ├── flash_causal_lm.py
│   │   │       │   ├── flash_vlm_causal_lm.py
│   │   │       │   ├── globals.py
│   │   │       │   ├── mllama_causal_lm.py
│   │   │       │   ├── model.py
│   │   │       │   ├── seq2seq_lm.py
│   │   │       │   └── types.py
│   │   │       ├── pb/
│   │   │       │   └── .gitignore
│   │   │       ├── server.py
│   │   │       ├── tracing.py
│   │   │       └── utils/
│   │   │           ├── __init__.py
│   │   │           ├── adapter.py
│   │   │           ├── chunks.py
│   │   │           ├── convert.py
│   │   │           ├── debug.py
│   │   │           ├── dist.py
│   │   │           ├── hub.py
│   │   │           ├── import_utils.py
│   │   │           ├── kernels.py
│   │   │           ├── log.py
│   │   │           ├── logits_process.py
│   │   │           ├── merges/
│   │   │           │   ├── strategies.py
│   │   │           │   └── utils.py
│   │   │           ├── peft.py
│   │   │           ├── prefill_chunking.py
│   │   │           ├── quantization.py
│   │   │           ├── segments.py
│   │   │           ├── sgmv.py
│   │   │           ├── speculate.py
│   │   │           ├── tokens.py
│   │   │           ├── version.py
│   │   │           ├── watermark.py
│   │   │           └── weights.py
│   │   └── tgi-entrypoint.sh
│   ├── grpc-metadata/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       └── lib.rs
│   ├── llamacpp/
│   │   ├── Cargo.toml
│   │   ├── README.md
│   │   ├── build.rs
│   │   ├── requirements.txt
│   │   └── src/
│   │       ├── backend.rs
│   │       ├── llamacpp.rs
│   │       ├── main.rs
│   │       └── quantize.rs
│   ├── neuron/
│   │   ├── Cargo.toml
│   │   ├── Makefile
│   │   ├── README.md
│   │   ├── server/
│   │   │   ├── .gitignore
│   │   │   ├── Makefile
│   │   │   ├── build-requirements.txt
│   │   │   ├── pyproject.toml
│   │   │   └── text_generation_server/
│   │   │       ├── cli.py
│   │   │       ├── generator.py
│   │   │       ├── interceptor.py
│   │   │       ├── model.py
│   │   │       ├── server.py
│   │   │       └── tgi_env.py
│   │   ├── tests/
│   │   │   ├── conftest.py
│   │   │   ├── fixtures/
│   │   │   │   └── model.py
│   │   │   ├── prune_test_models.py
│   │   │   ├── pytest.ini
│   │   │   ├── requirements.txt
│   │   │   ├── server/
│   │   │   │   ├── helpers.py
│   │   │   │   ├── test_cached_model.py
│   │   │   │   ├── test_continuous_batching.py
│   │   │   │   ├── test_decode.py
│   │   │   │   ├── test_generator_slot.py
│   │   │   │   ├── test_info.py
│   │   │   │   └── test_prefill.py
│   │   │   └── test_entry_point.py
│   │   ├── tgi-entrypoint.sh
│   │   └── tgi_entry_point.py
│   ├── trtllm/
│   │   ├── CMakeLists.txt
│   │   ├── Cargo.toml
│   │   ├── README.md
│   │   ├── build.rs
│   │   ├── cmake/
│   │   │   ├── json.cmake
│   │   │   ├── spdlog.cmake
│   │   │   ├── trtllm.cmake
│   │   │   └── utils/
│   │   │       └── detect_cuda_arch.cu
│   │   ├── csrc/
│   │   │   ├── backend.cpp
│   │   │   ├── backend.hpp
│   │   │   ├── ffi.hpp
│   │   │   └── hardware.hpp
│   │   ├── scripts/
│   │   │   ├── install_tensorrt.sh
│   │   │   └── setup_sccache.py
│   │   ├── src/
│   │   │   ├── errors.rs
│   │   │   ├── lib.rs
│   │   │   ├── looper.rs
│   │   │   ├── main.rs
│   │   │   └── utils.rs
│   │   └── tests/
│   │       ├── test_backend.cpp
│   │       └── test_hardware.cpp
│   ├── v2/
│   │   ├── Cargo.toml
│   │   ├── build.rs
│   │   └── src/
│   │       ├── backend.rs
│   │       ├── client/
│   │       │   ├── grpc_client.rs
│   │       │   ├── mod.rs
│   │       │   └── sharded_client.rs
│   │       ├── lib.rs
│   │       ├── main.rs
│   │       └── queue.rs
│   └── v3/
│       ├── Cargo.toml
│       ├── benches/
│       │   └── prefix_cache.rs
│       ├── build.rs
│       └── src/
│           ├── backend.rs
│           ├── block_allocator.rs
│           ├── client/
│           │   ├── grpc_client.rs
│           │   ├── mod.rs
│           │   └── sharded_client.rs
│           ├── lib.rs
│           ├── main.rs
│           ├── queue.rs
│           └── radix.rs
├── benchmark/
│   ├── Cargo.toml
│   ├── README.md
│   └── src/
│       ├── app.rs
│       ├── event.rs
│       ├── generation.rs
│       ├── lib.rs
│       ├── main.rs
│       ├── table.rs
│       └── utils.rs
├── clients/
│   └── python/
│       ├── .gitignore
│       ├── Makefile
│       ├── README.md
│       ├── pyproject.toml
│       ├── tests/
│       │   ├── conftest.py
│       │   ├── test_client.py
│       │   ├── test_errors.py
│       │   ├── test_inference_api.py
│       │   └── test_types.py
│       └── text_generation/
│           ├── __init__.py
│           ├── client.py
│           ├── errors.py
│           ├── inference_api.py
│           └── types.py
├── crate-hashes.json
├── docs/
│   ├── README.md
│   ├── index.html
│   ├── openapi.json
│   └── source/
│       ├── _toctree.yml
│       ├── architecture.md
│       ├── backends/
│       │   ├── gaudi.mdx
│       │   ├── llamacpp.md
│       │   ├── neuron.md
│       │   └── trtllm.md
│       ├── basic_tutorials/
│       │   ├── consuming_tgi.md
│       │   ├── gated_model_access.md
│       │   ├── monitoring.md
│       │   ├── non_core_models.md
│       │   ├── preparing_model.md
│       │   ├── safety.md
│       │   ├── train_medusa.md
│       │   ├── using_cli.md
│       │   ├── using_guidance.md
│       │   └── visual_language_models.md
│       ├── conceptual/
│       │   ├── chunking.md
│       │   ├── external.md
│       │   ├── flash_attention.md
│       │   ├── guidance.md
│       │   ├── lora.md
│       │   ├── paged_attention.md
│       │   ├── quantization.md
│       │   ├── safetensors.md
│       │   ├── speculation.md
│       │   ├── streaming.md
│       │   └── tensor_parallelism.md
│       ├── index.md
│       ├── installation.md
│       ├── installation_amd.md
│       ├── installation_gaudi.md
│       ├── installation_inferentia.md
│       ├── installation_intel.md
│       ├── installation_nvidia.md
│       ├── installation_tpu.md
│       ├── multi_backend_support.md
│       ├── quicktour.md
│       ├── reference/
│       │   ├── api_reference.md
│       │   ├── launcher.md
│       │   └── metrics.md
│       ├── supported_models.md
│       └── usage_statistics.md
├── flake.nix
├── integration-tests/
│   ├── conftest.py
│   ├── fixtures/
│   │   ├── gaudi/
│   │   │   └── service.py
│   │   └── neuron/
│   │       ├── export_models.py
│   │       └── service.py
│   ├── gaudi/
│   │   ├── capture_expected_outputs.py
│   │   └── test_gaudi_generate.py
│   ├── models/
│   │   ├── __snapshots__/
│   │   │   ├── test.py
│   │   │   ├── test_bloom_560m/
│   │   │   │   ├── test_bloom_560m.json
│   │   │   │   ├── test_bloom_560m_all_params.json
│   │   │   │   └── test_bloom_560m_load.json
│   │   │   ├── test_bloom_560m_sharded/
│   │   │   │   ├── test_bloom_560m_sharded.json
│   │   │   │   └── test_bloom_560m_sharded_load.json
│   │   │   ├── test_chat_llama/
│   │   │   │   └── test_flash_llama_simple.json
│   │   │   ├── test_completion_prompts/
│   │   │   │   ├── test_chat_hfhub_nousage.json
│   │   │   │   ├── test_chat_hfhub_usage.json
│   │   │   │   ├── test_chat_openai_nousage.json
│   │   │   │   ├── test_chat_openai_usage.json
│   │   │   │   ├── test_flash_llama_completion_many_prompts.json
│   │   │   │   ├── test_flash_llama_completion_many_prompts_stream.json
│   │   │   │   ├── test_flash_llama_completion_single_prompt.json
│   │   │   │   └── test_flash_llama_completion_stream_usage.json
│   │   │   ├── test_compressed_tensors_w8a8_int/
│   │   │   │   ├── test_compressed_tensors_w8a8_int.json
│   │   │   │   ├── test_compressed_tensors_w8a8_int_all_params.json
│   │   │   │   └── test_compressed_tensors_w8a8_int_load.json
│   │   │   ├── test_compressed_tensors_w8a8_int_dynamic_weight/
│   │   │   │   ├── test_compressed_tensors_w8a8_int_dynamic_weight.json
│   │   │   │   ├── test_compressed_tensors_w8a8_int_dynamic_weight_all_params.json
│   │   │   │   └── test_compressed_tensors_w8a8_int_dynamic_weight_load.json
│   │   │   ├── test_compressed_tensors_w8an_fp/
│   │   │   │   ├── test_compressed_tensors_w8an.json
│   │   │   │   ├── test_compressed_tensors_w8an_all_params.json
│   │   │   │   └── test_compressed_tensors_w8an_load.json
│   │   │   ├── test_compressed_tensors_wna16_int/
│   │   │   │   ├── test_compressed_tensors_wna16.json
│   │   │   │   ├── test_compressed_tensors_wna16_all_params.json
│   │   │   │   └── test_compressed_tensors_wna16_load.json
│   │   │   ├── test_compressed_tensors_wna16_int_24/
│   │   │   │   ├── test_compressed_tensors_wna16_int_24.json
│   │   │   │   ├── test_compressed_tensors_wna16_int_24_all_params.json
│   │   │   │   └── test_compressed_tensors_wna16_int_24_load.json
│   │   │   ├── test_continue_final_message/
│   │   │   │   ├── test_llama_completion_single_prompt.json
│   │   │   │   └── test_llama_completion_single_prompt_continue.json
│   │   │   ├── test_flash_awq/
│   │   │   │   ├── test_flash_llama_awq.json
│   │   │   │   ├── test_flash_llama_awq_all_params.json
│   │   │   │   └── test_flash_llama_awq_load.json
│   │   │   ├── test_flash_awq_sharded/
│   │   │   │   ├── test_flash_llama_awq_load_sharded.json
│   │   │   │   └── test_flash_llama_awq_sharded.json
│   │   │   ├── test_flash_deepseek_v2/
│   │   │   │   ├── test_flash_deepseek_v2.json
│   │   │   │   ├── test_flash_deepseek_v2_all_params.json
│   │   │   │   └── test_flash_deepseek_v2_load.json
│   │   │   ├── test_flash_falcon/
│   │   │   │   ├── test_flash_falcon.json
│   │   │   │   ├── test_flash_falcon_all_params.json
│   │   │   │   └── test_flash_falcon_load.json
│   │   │   ├── test_flash_gemma/
│   │   │   │   ├── test_flash_gemma_all_params.json
│   │   │   │   ├── test_flash_gemma_load.json
│   │   │   │   └── test_flash_gemma_simple.json
│   │   │   ├── test_flash_gemma2/
│   │   │   │   ├── test_flash_gemma2.json
│   │   │   │   └── test_flash_gemma2_load.json
│   │   │   ├── test_flash_gemma3/
│   │   │   │   ├── test_exceed_window.json
│   │   │   │   ├── test_flash_gemma3.json
│   │   │   │   ├── test_flash_gemma3_image_base64_rgb_jpg.json
│   │   │   │   ├── test_flash_gemma3_image_base64_rgb_png.json
│   │   │   │   ├── test_flash_gemma3_image_base64_rgba.json
│   │   │   │   ├── test_flash_gemma3_image_cow.json
│   │   │   │   └── test_flash_gemma3_image_cow_dog.json
│   │   │   ├── test_flash_gemma_gptq/
│   │   │   │   ├── test_flash_gemma_gptq.json
│   │   │   │   ├── test_flash_gemma_gptq_all_params.json
│   │   │   │   └── test_flash_gemma_gptq_load.json
│   │   │   ├── test_flash_gpt2/
│   │   │   │   ├── test_flash_gpt2.json
│   │   │   │   └── test_flash_gpt2_load.json
│   │   │   ├── test_flash_grammar_llama/
│   │   │   │   ├── test_flash_llama_grammar.json
│   │   │   │   ├── test_flash_llama_grammar_json.json
│   │   │   │   ├── test_flash_llama_grammar_load.json
│   │   │   │   ├── test_flash_llama_grammar_regex.json
│   │   │   │   └── test_flash_llama_grammar_single_load_instance.json
│   │   │   ├── test_flash_llama/
│   │   │   │   ├── test_flash_llama_all_params.json
│   │   │   │   ├── test_flash_llama_load.json
│   │   │   │   └── test_flash_llama_simple.json
│   │   │   ├── test_flash_llama_exl2/
│   │   │   │   ├── test_flash_llama_exl2.json
│   │   │   │   ├── test_flash_llama_exl2_all_params.json
│   │   │   │   └── test_flash_llama_exl2_load.json
│   │   │   ├── test_flash_llama_fp8/
│   │   │   │   ├── test_flash_llama_fp8.json
│   │   │   │   ├── test_flash_llama_fp8_all_params.json
│   │   │   │   └── test_flash_llama_fp8_load.json
│   │   │   ├── test_flash_llama_fp8_kv_cache/
│   │   │   │   ├── test_flash_llama_fp8_kv_cache.json
│   │   │   │   ├── test_flash_llama_fp8_kv_cache_all_params.json
│   │   │   │   └── test_flash_llama_fp8_kv_cache_load.json
│   │   │   ├── test_flash_llama_gptq/
│   │   │   │   ├── test_flash_llama_gptq.json
│   │   │   │   ├── test_flash_llama_gptq_all_params.json
│   │   │   │   └── test_flash_llama_gptq_load.json
│   │   │   ├── test_flash_llama_marlin/
│   │   │   │   ├── test_flash_llama_marlin.json
│   │   │   │   ├── test_flash_llama_marlin_all_params.json
│   │   │   │   └── test_flash_llama_marlin_load.json
│   │   │   ├── test_flash_llama_marlin_24/
│   │   │   │   ├── test_flash_llama_marlin.json
│   │   │   │   ├── test_flash_llama_marlin24_all_params.json
│   │   │   │   └── test_flash_llama_marlin24_load.json
│   │   │   ├── test_flash_llama_prefix/
│   │   │   │   └── test_flash_llama_load.json
│   │   │   ├── test_flash_llama_prefix_flashdecoding/
│   │   │   │   └── test_flash_llama_flashdecoding.json
│   │   │   ├── test_flash_medusa/
│   │   │   │   ├── test_flash_medusa_all_params.json
│   │   │   │   ├── test_flash_medusa_load.json
│   │   │   │   └── test_flash_medusa_simple.json
│   │   │   ├── test_flash_mistral/
│   │   │   │   ├── test_flash_mistral.json
│   │   │   │   ├── test_flash_mistral_all_params.json
│   │   │   │   └── test_flash_mistral_load.json
│   │   │   ├── test_flash_mixtral/
│   │   │   │   ├── test_flash_mixtral.json
│   │   │   │   ├── test_flash_mixtral_all_params.json
│   │   │   │   └── test_flash_mixtral_load.json
│   │   │   ├── test_flash_mixtral_awq/
│   │   │   │   ├── test_flash_mixtral_awq.json
│   │   │   │   ├── test_flash_mixtral_awq_all_params.json
│   │   │   │   └── test_flash_mixtral_awq_load.json
│   │   │   ├── test_flash_mixtral_gptq/
│   │   │   │   ├── test_flash_mixtral_gptq.json
│   │   │   │   ├── test_flash_mixtral_gptq_all_params.json
│   │   │   │   └── test_flash_mixtral_gptq_load.json
│   │   │   ├── test_flash_neox/
│   │   │   │   ├── test_flash_neox.json
│   │   │   │   └── test_flash_neox_load.json
│   │   │   ├── test_flash_neox_sharded/
│   │   │   │   ├── test_flash_neox.json
│   │   │   │   └── test_flash_neox_load.json
│   │   │   ├── test_flash_pali_gemma/
│   │   │   │   ├── test_flash_pali_gemma.json
│   │   │   │   └── test_flash_pali_gemma_two_images.json
│   │   │   ├── test_flash_pali_gemma2/
│   │   │   │   └── test_flash_pali_gemma_image.json
│   │   │   ├── test_flash_phi/
│   │   │   │   ├── test_flash_phi.json
│   │   │   │   ├── test_flash_phi_all_params.json
│   │   │   │   └── test_flash_phi_load.json
│   │   │   ├── test_flash_phi35_moe/
│   │   │   │   ├── test_flash_phi35_moe.json
│   │   │   │   ├── test_flash_phi35_moe_all_params.json
│   │   │   │   └── test_flash_phi35_moe_load.json
│   │   │   ├── test_flash_qwen2/
│   │   │   │   ├── test_flash_qwen2.json
│   │   │   │   ├── test_flash_qwen2_all_params.json
│   │   │   │   └── test_flash_qwen2_load.json
│   │   │   ├── test_flash_qwen2_5_vl/
│   │   │   │   ├── test_flash_qwen2_5_vl_bay.json
│   │   │   │   ├── test_flash_qwen2_5_vl_inpaint.json
│   │   │   │   ├── test_flash_qwen2_5_vl_simple.json
│   │   │   │   └── test_flash_qwen2_5_vl_simple_streaming.json
│   │   │   ├── test_flash_qwen2_vl/
│   │   │   │   ├── test_flash_qwen2_vl_bay.json
│   │   │   │   ├── test_flash_qwen2_vl_inpaint.json
│   │   │   │   ├── test_flash_qwen2_vl_simple.json
│   │   │   │   └── test_flash_qwen2_vl_simple_streaming.json
│   │   │   ├── test_flash_santacoder/
│   │   │   │   ├── test_flash_santacoder.json
│   │   │   │   └── test_flash_santacoder_load.json
│   │   │   ├── test_flash_starcoder/
│   │   │   │   ├── test_flash_starcoder.json
│   │   │   │   ├── test_flash_starcoder_default_params.json
│   │   │   │   └── test_flash_starcoder_load.json
│   │   │   ├── test_flash_starcoder2/
│   │   │   │   ├── test_flash_starcoder2.json
│   │   │   │   ├── test_flash_starcoder2_default_params.json
│   │   │   │   └── test_flash_starcoder2_load.json
│   │   │   ├── test_flash_starcoder2_lora/
│   │   │   │   ├── test_flash_starcoder2.json
│   │   │   │   ├── test_flash_starcoder2_default_params.json
│   │   │   │   ├── test_flash_starcoder2_load.json
│   │   │   │   └── test_flash_starcoder2_with_hugcode_adapter.json
│   │   │   ├── test_flash_starcoder_gptq/
│   │   │   │   ├── test_flash_starcoder_gptq.json
│   │   │   │   ├── test_flash_starcoder_gptq_default_params.json
│   │   │   │   └── test_flash_starcoder_gptq_load.json
│   │   │   ├── test_grammar_llama/
│   │   │   │   └── test_non_flash_llama_grammar_json.json
│   │   │   ├── test_grammar_response_format_llama/
│   │   │   │   ├── test_grammar_response_format_llama_json.1.json
│   │   │   │   ├── test_grammar_response_format_llama_json.2.json
│   │   │   │   └── test_grammar_response_format_llama_json.json
│   │   │   ├── test_idefics/
│   │   │   │   ├── test_idefics.json
│   │   │   │   ├── test_idefics_load.json
│   │   │   │   └── test_idefics_two_images.json
│   │   │   ├── test_idefics2/
│   │   │   │   ├── test_flash_idefics2_next_all_params.json
│   │   │   │   ├── test_flash_idefics2_next_load.json
│   │   │   │   ├── test_flash_idefics2_next_simple.json
│   │   │   │   └── test_flash_idefics2_two_images.json
│   │   │   ├── test_idefics3/
│   │   │   │   └── test_flash_idefics3_next_simple_url.json
│   │   │   ├── test_json_schema_constrain/
│   │   │   │   ├── test_json_schema_basic.json
│   │   │   │   ├── test_json_schema_complex.json
│   │   │   │   └── test_json_schema_stream.json
│   │   │   ├── test_llava_next/
│   │   │   │   ├── test_flash_llava_next_all_params.json
│   │   │   │   ├── test_flash_llava_next_load.json
│   │   │   │   └── test_flash_llava_next_simple.json
│   │   │   ├── test_lora_mistral/
│   │   │   │   ├── test_lora_mistral_with_customer_support_adapter.json
│   │   │   │   ├── test_lora_mistral_with_dbpedia_adapter.json
│   │   │   │   ├── test_lora_mistral_without_adapter.json
│   │   │   │   └── test_lora_mistral_without_customer_support_adapter.json
│   │   │   ├── test_mamba/
│   │   │   │   ├── test_mamba.json
│   │   │   │   ├── test_mamba_all_params.json
│   │   │   │   └── test_mamba_load.json
│   │   │   ├── test_mllama/
│   │   │   │   ├── test_mllama_load.json
│   │   │   │   └── test_mllama_simpl.json
│   │   │   ├── test_mpt/
│   │   │   │   ├── test_mpt.json
│   │   │   │   └── test_mpt_load.json
│   │   │   ├── test_mt0_base/
│   │   │   │   ├── test_mt0_base.json
│   │   │   │   ├── test_mt0_base_all_params.json
│   │   │   │   └── test_mt0_base_load.json
│   │   │   ├── test_neox/
│   │   │   │   ├── test_neox.json
│   │   │   │   └── test_neox_load.json
│   │   │   ├── test_neox_sharded/
│   │   │   │   ├── test_neox.json
│   │   │   │   └── test_neox_load.json
│   │   │   ├── test_server_gptq_quantized/
│   │   │   │   ├── test_server_gptq_quantized.json
│   │   │   │   ├── test_server_gptq_quantized_all_params.json
│   │   │   │   └── test_server_gptq_quantized_load.json
│   │   │   ├── test_smolvlm/
│   │   │   │   └── test_flash_smolvlm_next_simple_url.json
│   │   │   ├── test_t5_sharded/
│   │   │   │   ├── test_t5_sharded.json
│   │   │   │   └── test_t5_sharded_load.json
│   │   │   ├── test_tools_llama/
│   │   │   │   ├── test_flash_llama_grammar_tools_auto_nostream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_choice_nostream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_choice_stream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_insufficient_information_nostream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_insufficient_information_stream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_nostream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_openai.json
│   │   │   │   ├── test_flash_llama_grammar_tools_sea_creatures_stream_auto.json
│   │   │   │   ├── test_flash_llama_grammar_tools_sea_creatures_stream_function_object.json
│   │   │   │   ├── test_flash_llama_grammar_tools_sea_creatures_stream_none.json
│   │   │   │   ├── test_flash_llama_grammar_tools_sea_creatures_stream_required.json
│   │   │   │   └── test_flash_llama_tool_reply_response.json
│   │   │   ├── test_transformers_llama4/
│   │   │   │   ├── test_flash_llama4.json
│   │   │   │   ├── test_flash_llama4_image_base64_rgb_jpg.json
│   │   │   │   ├── test_flash_llama4_image_base64_rgb_png.json
│   │   │   │   ├── test_flash_llama4_image_base64_rgba.json
│   │   │   │   ├── test_flash_llama4_image_cow.json
│   │   │   │   └── test_flash_llama4_image_cow_dog.json
│   │   │   └── test_transformers_olmo/
│   │   │       ├── test_flash_llama_load.json
│   │   │       └── test_flash_llama_simple.json
│   │   ├── test_bloom_560m.py
│   │   ├── test_bloom_560m_sharded.py
│   │   ├── test_chat_llama.py
│   │   ├── test_chat_stream_options.py
│   │   ├── test_completion_prompts.py
│   │   ├── test_compressed_tensors_w8a8_int.py
│   │   ├── test_compressed_tensors_w8a8_int_dynamic_weight.py
│   │   ├── test_compressed_tensors_w8an_fp.py
│   │   ├── test_compressed_tensors_wna16_int.py
│   │   ├── test_compressed_tensors_wna16_int_24.py
│   │   ├── test_continue_final_message.py
│   │   ├── test_flash_awq.py
│   │   ├── test_flash_awq_sharded.py
│   │   ├── test_flash_deepseek_v2.py
│   │   ├── test_flash_falcon.py
│   │   ├── test_flash_gemma.py
│   │   ├── test_flash_gemma2.py
│   │   ├── test_flash_gemma3.py
│   │   ├── test_flash_gemma_gptq.py
│   │   ├── test_flash_gpt2.py
│   │   ├── test_flash_grammar_llama.py
│   │   ├── test_flash_llama.py
│   │   ├── test_flash_llama_exl2.py
│   │   ├── test_flash_llama_fp8.py
│   │   ├── test_flash_llama_fp8_kv_cache.py
│   │   ├── test_flash_llama_gptq.py
│   │   ├── test_flash_llama_marlin.py
│   │   ├── test_flash_llama_marlin_24.py
│   │   ├── test_flash_llama_prefix.py
│   │   ├── test_flash_llama_prefix_flashdecoding.py
│   │   ├── test_flash_medusa.py
│   │   ├── test_flash_mistral.py
│   │   ├── test_flash_mixtral.py
│   │   ├── test_flash_mixtral_awq.py
│   │   ├── test_flash_mixtral_gptq.py
│   │   ├── test_flash_neox.py
│   │   ├── test_flash_neox_sharded.py
│   │   ├── test_flash_pali_gemma.py
│   │   ├── test_flash_pali_gemma2.py
│   │   ├── test_flash_phi.py
│   │   ├── test_flash_phi35_moe.py
│   │   ├── test_flash_qwen2.py
│   │   ├── test_flash_qwen2_5_vl.py
│   │   ├── test_flash_qwen2_vl.py
│   │   ├── test_flash_santacoder.py
│   │   ├── test_flash_starcoder.py
│   │   ├── test_flash_starcoder2.py
│   │   ├── test_flash_starcoder2_lora.py
│   │   ├── test_flash_starcoder_gptq.py
│   │   ├── test_grammar_llama.py
│   │   ├── test_grammar_response_format_llama.py
│   │   ├── test_idefics.py
│   │   ├── test_idefics2.py
│   │   ├── test_idefics3.py
│   │   ├── test_json_schema_constrain.py
│   │   ├── test_llava_next.py
│   │   ├── test_lora_mistral.py
│   │   ├── test_mamba.py
│   │   ├── test_mllama.py
│   │   ├── test_mpt.py
│   │   ├── test_mt0_base.py
│   │   ├── test_neox.py
│   │   ├── test_neox_sharded.py
│   │   ├── test_opt.py
│   │   ├── test_smolvlm.py
│   │   ├── test_t5_sharded.py
│   │   ├── test_tools_llama.py
│   │   ├── test_transformers_llama4.py
│   │   └── test_transformers_olmo.py
│   ├── neuron/
│   │   ├── test_generate.py
│   │   └── test_implicit_env.py
│   ├── pyproject.toml
│   ├── pytest.ini
│   └── requirements.txt
├── launcher/
│   ├── Cargo.toml
│   ├── build.rs
│   └── src/
│       ├── env_runtime.rs
│       ├── gpu.rs
│       └── main.rs
├── load_tests/
│   ├── Makefile
│   ├── benchmarks.py
│   ├── common.js
│   ├── filter.py
│   ├── long.js
│   ├── long.py
│   ├── long_prompt2.py
│   ├── orca.py
│   └── pyproject.toml
├── nix/
│   ├── client.nix
│   ├── crate-overrides.nix
│   ├── docker.nix
│   ├── impure-shell.nix
│   ├── overlay.nix
│   └── server.nix
├── proto/
│   ├── generate.proto
│   └── v3/
│       └── generate.proto
├── router/
│   ├── Cargo.toml
│   ├── README.md
│   ├── build.rs
│   └── src/
│       ├── chat.rs
│       ├── config.rs
│       ├── infer/
│       │   ├── chat_template.rs
│       │   ├── mod.rs
│       │   └── tool_grammar.rs
│       ├── kserve.rs
│       ├── lib.rs
│       ├── logging.rs
│       ├── sagemaker.rs
│       ├── server.rs
│       ├── usage_stats.rs
│       ├── validation.rs
│       └── vertex.rs
├── rust-toolchain.toml
├── sagemaker-entrypoint.sh
├── server/
│   ├── .gitignore
│   ├── Makefile
│   ├── Makefile-awq
│   ├── Makefile-eetq
│   ├── Makefile-exllamav2
│   ├── Makefile-flash-att
│   ├── Makefile-flash-att-v2
│   ├── Makefile-flashinfer
│   ├── Makefile-selective-scan
│   ├── Makefile-vllm
│   ├── README.md
│   ├── bounds-from-nix.py
│   ├── custom_kernels/
│   │   ├── custom_kernels/
│   │   │   ├── fused_attention_cuda.cu
│   │   │   └── fused_bloom_attention_cuda.cu
│   │   └── setup.py
│   ├── exllama_kernels/
│   │   ├── exllama_kernels/
│   │   │   ├── cu_compat.cuh
│   │   │   ├── cuda_buffers.cu
│   │   │   ├── cuda_buffers.cuh
│   │   │   ├── cuda_func/
│   │   │   │   ├── column_remap.cu
│   │   │   │   ├── column_remap.cuh
│   │   │   │   ├── q4_matmul.cu
│   │   │   │   ├── q4_matmul.cuh
│   │   │   │   ├── q4_matrix.cu
│   │   │   │   └── q4_matrix.cuh
│   │   │   ├── exllama_ext.cpp
│   │   │   ├── hip_compat.cuh
│   │   │   ├── matrix.cuh
│   │   │   ├── tuning.h
│   │   │   └── util.cuh
│   │   └── setup.py
│   ├── exllamav2_kernels/
│   │   ├── exllamav2_kernels/
│   │   │   ├── config.h
│   │   │   ├── cpp/
│   │   │   │   └── util.h
│   │   │   ├── cuda/
│   │   │   │   ├── compat.cuh
│   │   │   │   ├── matrix_view.cuh
│   │   │   │   ├── q_gemm.cu
│   │   │   │   ├── q_gemm.cuh
│   │   │   │   ├── q_gemm_kernel.cuh
│   │   │   │   ├── q_gemm_kernel_gptq.cuh
│   │   │   │   ├── q_matrix.cu
│   │   │   │   ├── q_matrix.cuh
│   │   │   │   ├── quant/
│   │   │   │   │   ├── qdq_2.cuh
│   │   │   │   │   ├── qdq_3.cuh
│   │   │   │   │   ├── qdq_4.cuh
│   │   │   │   │   ├── qdq_5.cuh
│   │   │   │   │   ├── qdq_6.cuh
│   │   │   │   │   ├── qdq_8.cuh
│   │   │   │   │   └── qdq_util.cuh
│   │   │   │   └── util.cuh
│   │   │   └── ext.cpp
│   │   └── setup.py
│   ├── pyproject.toml
│   ├── req.txt
│   ├── requirements_cuda.txt
│   ├── requirements_gen.txt
│   ├── requirements_intel.txt
│   ├── requirements_rocm.txt
│   ├── tests/
│   │   ├── conftest.py
│   │   ├── models/
│   │   │   ├── test_bloom.py
│   │   │   ├── test_causal_lm.py
│   │   │   ├── test_model.py
│   │   │   ├── test_santacoder.py
│   │   │   └── test_seq2seq_lm.py
│   │   └── utils/
│   │       ├── test_adapter.py
│   │       ├── test_convert.py
│   │       ├── test_hub.py
│   │       ├── test_layers.py
│   │       ├── test_tokens.py
│   │       ├── test_watermark.py
│   │       └── test_weights.py
│   └── text_generation_server/
│       ├── __init__.py
│       ├── adapters/
│       │   ├── __init__.py
│       │   ├── config.py
│       │   ├── lora.py
│       │   └── weights.py
│       ├── cache.py
│       ├── cli.py
│       ├── interceptor.py
│       ├── layers/
│       │   ├── __init__.py
│       │   ├── attention/
│       │   │   ├── __init__.py
│       │   │   ├── common.py
│       │   │   ├── cuda.py
│       │   │   ├── flash_attn_triton.py
│       │   │   ├── flashinfer.py
│       │   │   ├── ipex.py
│       │   │   ├── kv_cache.py
│       │   │   └── rocm.py
│       │   ├── awq/
│       │   │   ├── conversion_utils.py
│       │   │   └── quantize/
│       │   │       ├── __init__.py
│       │   │       ├── cuda.py
│       │   │       └── ipex.py
│       │   ├── bnb.py
│       │   ├── compressed_tensors/
│       │   │   ├── __init__.py
│       │   │   ├── loader.py
│       │   │   ├── w8a8_int.py
│       │   │   ├── w8an_fp.py
│       │   │   ├── wna16_int.py
│       │   │   └── wna16_int_24.py
│       │   ├── conv.py
│       │   ├── eetq.py
│       │   ├── exl2.py
│       │   ├── fp8.py
│       │   ├── gptq/
│       │   │   ├── __init__.py
│       │   │   ├── custom_autotune.py
│       │   │   ├── exllama.py
│       │   │   ├── exllamav2.py
│       │   │   ├── ipex.py
│       │   │   ├── quantize.py
│       │   │   ├── triton.py
│       │   │   └── utils.py
│       │   ├── layernorm.py
│       │   ├── linear.py
│       │   ├── lora.py
│       │   ├── marlin/
│       │   │   ├── __init__.py
│       │   │   ├── fp8.py
│       │   │   ├── gptq.py
│       │   │   ├── marlin.py
│       │   │   └── util.py
│       │   ├── medusa.py
│       │   ├── mlp.py
│       │   ├── moe/
│       │   │   ├── __init__.py
│       │   │   ├── fp8.py
│       │   │   ├── fused_moe_ipex.py
│       │   │   ├── gptq_marlin.py
│       │   │   └── unquantized.py
│       │   ├── rotary.py
│       │   ├── speculative.py
│       │   └── tensor_parallel.py
│       ├── models/
│       │   ├── __init__.py
│       │   ├── bloom.py
│       │   ├── causal_lm.py
│       │   ├── custom_modeling/
│       │   │   ├── __init__.py
│       │   │   ├── bloom_modeling.py
│       │   │   ├── clip.py
│       │   │   ├── flash_cohere_modeling.py
│       │   │   ├── flash_dbrx_modeling.py
│       │   │   ├── flash_deepseek_v2_modeling.py
│       │   │   ├── flash_deepseek_v3_modeling.py
│       │   │   ├── flash_gemma2_modeling.py
│       │   │   ├── flash_gemma3_modeling.py
│       │   │   ├── flash_gemma_modeling.py
│       │   │   ├── flash_gpt2_modeling.py
│       │   │   ├── flash_gptj_modeling.py
│       │   │   ├── flash_llama_modeling.py
│       │   │   ├── flash_mistral_modeling.py
│       │   │   ├── flash_mixtral_modeling.py
│       │   │   ├── flash_neox_modeling.py
│       │   │   ├── flash_pali_gemma_modeling.py
│       │   │   ├── flash_phi_modeling.py
│       │   │   ├── flash_phi_moe_modeling.py
│       │   │   ├── flash_qwen2_modeling.py
│       │   │   ├── flash_rw_modeling.py
│       │   │   ├── flash_santacoder_modeling.py
│       │   │   ├── flash_starcoder2_modeling.py
│       │   │   ├── gemma3/
│       │   │   │   ├── configuration_gemma3.py
│       │   │   │   ├── image_processing_gemma3.py
│       │   │   │   ├── processing_gemma3.py
│       │   │   │   └── utils.py
│       │   │   ├── idefics2.py
│       │   │   ├── idefics3.py
│       │   │   ├── idefics_config.py
│       │   │   ├── idefics_image_processing.py
│       │   │   ├── idefics_modeling.py
│       │   │   ├── idefics_perceiver.py
│       │   │   ├── idefics_processing.py
│       │   │   ├── idefics_vision.py
│       │   │   ├── llava_next.py
│       │   │   ├── mamba_modeling.py
│       │   │   ├── mllama.py
│       │   │   ├── mpt_modeling.py
│       │   │   ├── neox_modeling.py
│       │   │   ├── opt_modeling.py
│       │   │   ├── phi_modeling.py
│       │   │   ├── qwen2_5_vl.py
│       │   │   ├── qwen2_vl.py
│       │   │   ├── siglip.py
│       │   │   ├── t5_modeling.py
│       │   │   └── vlm.py
│       │   ├── flash_causal_lm.py
│       │   ├── galactica.py
│       │   ├── globals.py
│       │   ├── idefics_causal_lm.py
│       │   ├── mamba.py
│       │   ├── metadata_kernels.py
│       │   ├── mllama_causal_lm.py
│       │   ├── model.py
│       │   ├── seq2seq_lm.py
│       │   ├── transformers_flash_causal_lm.py
│       │   ├── transformers_flash_vlm.py
│       │   ├── types.py
│       │   └── vlm_causal_lm.py
│       ├── pb/
│       │   └── .gitignore
│       ├── server.py
│       ├── tracing.py
│       └── utils/
│           ├── __init__.py
│           ├── adapter.py
│           ├── chunks.py
│           ├── convert.py
│           ├── dist.py
│           ├── hub.py
│           ├── import_utils.py
│           ├── kernels.py
│           ├── log.py
│           ├── logits_process.py
│           ├── merges/
│           │   ├── strategies.py
│           │   └── utils.py
│           ├── peft.py
│           ├── prefill_chunking.py
│           ├── quantization.py
│           ├── segments.py
│           ├── speculate.py
│           ├── tokens.py
│           ├── watermark.py
│           └── weights.py
├── tgi-entrypoint.sh
└── update_doc.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .dockerignore
================================================
aml
target
server/transformers
server/flash-attention
cmake-build-debug/
cmake-build-release/
Dockerfile*


================================================
FILE: .github/ISSUE_TEMPLATE/bug-report.yml
================================================
name: "\U0001F41B Bug Report"
description: Submit a bug report to help us improve text-generation-inference
body:
  - type: textarea
    id: system-info
    attributes:
      label: System Info
      description: |
        Please share your system info with us (`text-generation-launcher --env` if installed locally).
        The full command line used that causes issues:
        OS version:
        Rust version (if self-compiling, `cargo version`):
        Model being used (`curl 127.0.0.1:8080/info | jq`):
          If local model please explicit the kind of model and/or equivalents.
        Hardware used (GPUs, how many, on which cloud) (`nvidia-smi`):
        Deployment specificities (Kubernetes, EKS, AKS, any particular deployments):
        The current version being used:

      placeholder: text-generation-inference version, platform, python version, ...
    validations:
      required: true

  - type: checkboxes
    id: information-scripts-examples
    attributes:
      label: Information
      description: 'The problem arises when using:'
      options:
        - label: "Docker"
        - label: "The CLI directly"

  - type: checkboxes
    id: information-tasks
    attributes:
      label: Tasks
      description: "The thing I am working on is:"
      options:
        - label: "An officially supported command"
        - label: "My own modifications"

  - type: textarea
    id: reproduction
    validations:
      required: true
    attributes:
      label: Reproduction
      description: |
        Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
        If you have code snippets, error messages, stack traces please provide them here as well.
        Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
        Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.

      placeholder: |
        Steps to reproduce the behavior:

          1.
          2.
          3.


  - type: textarea
    id: expected-behavior
    validations:
      required: true
    attributes:
      label: Expected behavior
      description: "A clear and concise description of what you would expect to happen."


================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
blank_issues_enabled: true
version: 2.1


================================================
FILE: .github/ISSUE_TEMPLATE/feature-request.yml
================================================
name: "\U0001F680 Feature request"
description: Submit a proposal/request for a new text-generation-inference feature
labels: [ "feature" ]
body:
  - type: textarea
    id: feature-request
    validations:
      required: true
    attributes:
      label: Feature request
      description: |
        A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist.

  - type: textarea
    id: motivation
    validations:
      required: true
    attributes:
      label: Motivation
      description: |
        Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too.


  - type: textarea
    id: contribution
    validations:
      required: true
    attributes:
      label: Your contribution
      description: |
        Is there any way that you could help, e.g. by submitting a PR? Make sure to read the CONTRIBUTING.MD [readme](https://github.com/huggingface/text-generation-inference/blob/main/CONTRIBUTING.md)


================================================
FILE: .github/ISSUE_TEMPLATE/new-model-addition.yml
================================================
name: "\U0001F31F New model addition"
description: Submit a proposal/request to implement a new model
labels: [ "New model" ]

body:
  - type: textarea
    id: description-request
    validations:
      required: true
    attributes:
      label: Model description
      description: |
        Put any and all important information relative to the model

  - type: checkboxes
    id: information-tasks
    attributes:
      label: Open source status
      description: |
          Please note that if the model implementation isn't available or if the weights aren't open-source, we are less likely to implement it in `transformers`.
      options:
        - label: "The model implementation is available"
        - label: "The model weights are available"

  - type: textarea
    id: additional-info
    attributes:
      label: Provide useful links for the implementation
      description: |
        Please provide information regarding the implementation, the weights, and the authors.
        Please mention the authors by @gh-username if you're aware of their usernames.


================================================
FILE: .github/PULL_REQUEST_TEMPLATE.md
================================================
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet though.

Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.

Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.

Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the
      [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and
      [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @


@OlivierDehaene OR @Narsil

 -->


================================================
FILE: .github/workflows/autodocs.yaml
================================================
name: Automatic Documentation for Launcher

on:
  pull_request:

jobs:
  update_docs:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v2

    - name: Set up Rust
      uses: actions-rs/toolchain@v1
      with:
        profile: minimal
        toolchain: stable

    - name: Install Protocol Buffers compiler
      run: |
        sudo apt-get update
        sudo apt-get install -y protobuf-compiler libprotobuf-dev

    - name: Install Launcher
      id: install-launcher
      run: cargo install --path launcher/

    - name: Install router
      id: install-router
      run: cargo install --path backends/v3/

    - uses: actions/setup-node@v4
      with:
        node-version: 22

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.x'

    - name: Check that documentation is up-to-date
      run: |
        npm install -g @redocly/cli@1.34.2
        python update_doc.py --check


================================================
FILE: .github/workflows/build.yaml
================================================
name: Build and push docker image to internal registry

on:
  workflow_call:
    inputs:
      hardware:
        type: string
        description: Hardware
        # options:
        # - cuda
        # - cuda-trtllm
        # - rocm
        # - intel
        required: true
      release-tests:
        description: "Run release integration tests"
        required: true
        default: false
        type: boolean

jobs:
  build-and-push:
    outputs:
      docker_image: ${{ steps.final.outputs.docker_image }}
      docker_volume: ${{ steps.final.outputs.docker_volume }}
      docker_devices: ${{ steps.final.outputs.docker_devices }}
      runs_on: ${{ steps.final.outputs.runs_on }}
      label_extension: ${{ steps.final.outputs.label_extension }}
      extra_pytest: ${{ steps.final.outputs.extra_pytest }}
    concurrency:
      group: ${{ github.workflow }}-build-and-push-image-${{ inputs.hardware }}-${{ github.head_ref || github.run_id }}
      cancel-in-progress: true
    runs-on:
      group: aws-highmemory-64-plus-priv
    permissions:
      contents: write
      packages: write
      id-token: write
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Inject slug/short variables
        uses: rlespinasse/github-slug-action@v4.4.1
      - name: Inject required variables for sccache to interact with Github Actions Cache
        uses: actions/github-script@v7
        with:
          script: |
            core.exportVariable('ACTIONS_RESULTS_URL', process.env.ACTIONS_RESULTS_URL || '');
            core.exportVariable('ACTIONS_RUNTIME_TOKEN', process.env.ACTIONS_RUNTIME_TOKEN || '');

      - name: Extract TensorRT-LLM version
        run: |
          echo "TENSORRT_LLM_VERSION=$(grep -oP '([a-z,0-9]{40})' $GITHUB_WORKSPACE/backends/trtllm/cmake/trtllm.cmake)" >> $GITHUB_ENV
          echo "TensorRT-LLM version: ${{ env.TENSORRT_LLM_VERSION }}"
      - name: Construct hardware variables
        shell: bash
        run: |
          case ${{ inputs.hardware }} in
            cuda)
                export dockerfile="Dockerfile"
                export label_extension=""
                export docker_volume="/mnt/cache"
                export docker_devices=""
                export runs_on="aws-g6-12xl-plus-priv-cache"
                export platform=""
                export extra_pytest=""
                export target=""
                ;;
            cuda-trtllm)
                export dockerfile="Dockerfile_trtllm"
                export label_extension="-trtllm"
                export docker_volume="/mnt/cache"
                export docker_devices=""
                export runs_on="ubuntu-latest"
                export platform=""
                export extra_pytest=""
                if [[ "${GITHUB_REF}" == refs/tags/* ]]; then
                  export build_type="release";
                  export target="";
                else
                  export build_type="dev";
                  export target="ci-runtime";
                fi
                ;;
            rocm)
                export dockerfile="Dockerfile_amd"
                export label_extension="-rocm"
                export docker_devices="/dev/kfd,/dev/dri"
                export docker_volume="/mnt"
                # This runner was deactivated.
                export runs_on="ubuntu-latest"
                export platform=""
                export extra_pytest="-k test_flash_gemma_gptq_load"
                export target=""
                ;;
            intel-xpu)
                export dockerfile="Dockerfile_intel"
                export label_extension="-intel-xpu"
                export docker_devices=""
                export docker_volume="/mnt/cache"
                export runs_on="ubuntu-latest"
                export platform="xpu"
                export extra_pytest=""
                export target=""
                ;;
            intel-cpu)
                export dockerfile="Dockerfile_intel"
                export label_extension="-intel-cpu"
                export docker_devices="none"
                export docker_volume="/mnt/cache"
                # export runs_on="ubuntu-latest"
                export runs_on="aws-highmemory-32-plus-priv"
                export platform="cpu"
                export extra_pytest="-k test_flash_gemma_simple"
                export target=""
                ;;
            neuron)
                export dockerfile="Dockerfile.neuron"
                export label_extension="-neuron"
                export docker_devices="/dev/neuron0"
                export docker_volume="/mnt/cache"
                export runs_on="aws-inf2-8xlarge"
                export platform="cpu"
                export extra_pytest="--neuron"
                export target=""
                ;;
            gaudi)
                export dockerfile="Dockerfile_gaudi"
                export label_extension="-gaudi"
                export docker_volume="/mnt/cache"
                export docker_devices=""
                export runs_on="itac-bm-emr-gaudi3-dell-2gaudi"
                export platform=""
                export extra_pytest="--gaudi"
                export target=""
          esac
          echo $dockerfile
          echo "Dockerfile=${dockerfile}"
          echo $label_extension
          echo $docker_devices
          echo $runs_on
          echo $platform
          echo "DOCKERFILE=${dockerfile}" >> $GITHUB_ENV
          echo "LABEL_EXTENSION=${label_extension}" >> $GITHUB_ENV
          echo "PLATFORM=${platform}" >> $GITHUB_ENV
          echo "DOCKER_VOLUME=${docker_volume}" >> $GITHUB_ENV
          echo "DOCKER_DEVICES=${docker_devices}" >> $GITHUB_ENV
          echo "RUNS_ON=${runs_on}" >> $GITHUB_ENV
          echo "EXTRA_PYTEST=${extra_pytest}" >> $GITHUB_ENV
          echo REGISTRY_MIRROR=$REGISTRY_MIRROR >> $GITHUB_ENV
          echo "TARGET=${target}" >> $GITHUB_ENV
          echo "BUILD_TYPE=${build_type}" >> $GITHUB_ENV
      - name: Initialize Docker Buildx
        uses: docker/setup-buildx-action@v3
        with:
          install: true
          buildkitd-config: /tmp/buildkitd.toml
      - name: Login to internal Container Registry
        if: github.event_name != 'pull_request'
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.REGISTRY_USERNAME }}
          password: ${{ secrets.REGISTRY_PASSWORD }}
          registry: registry.internal.huggingface.tech
      - name: Login to GitHub Container Registry
        if: github.event_name != 'pull_request'
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Login to Docker Hub Container Registry
        uses: docker/login-action@v3
        with:
          registry: docker.io
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_PASSWORD }}
      - name: configure aws credentials
        id: aws-creds
        uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_GITHUB_BUILDX_CACHE }}
          role-duration-seconds: 18000
          aws-region: us-east-1
          output-credentials: true
      # If pull request
      - name: Extract metadata (tags, labels) for Docker
        if: ${{ github.event_name == 'pull_request' }}
        id: meta-pr
        uses: docker/metadata-action@v5
        with:
          images: |
            docker.io/huggingface/text-generation-inference-ci
          tags: |
            type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL_EXTENSION }}
      # If main, release or tag
      - name: Extract metadata (tags, labels) for Docker
        if: ${{ github.event_name != 'pull_request' }}
        id: meta
        uses: docker/metadata-action@v4.3.0
        with:
          flavor: |
            latest=false
          images: |
            registry.internal.huggingface.tech/api-inference/community/text-generation-inference
            ghcr.io/huggingface/text-generation-inference
          tags: |
            type=semver,pattern={{version}}${{ env.LABEL_EXTENSION }}
            type=semver,pattern={{major}}.{{minor}}${{ env.LABEL_EXTENSION }}
            type=raw,value=latest${{ env.LABEL_EXTENSION }},enable=${{ github.ref == format('refs/heads/{0}', github.event.repository.default_branch) }}
            type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL_EXTENSION }}
      - name: Build and push Docker image
        id: build-and-push
        uses: docker/build-push-action@v4
        env: 
          DOCKER_BUILD_SUMMARY: false
        with:
          context: .
          file: ${{ env.DOCKERFILE }}
          push: true
          platforms: 'linux/amd64'
          build-args: |
            GIT_SHA=${{ env.GITHUB_SHA }}
            DOCKER_LABEL=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL_EXTENSION }}
            PLATFORM=${{ env.PLATFORM }}
            build_type=${{ env.BUILD_TYPE }}
            sccache_gha_enabled=on
          secrets: |
            actions_results_url=${{ env.ACTIONS_RESULTS_URL }}
            actions_runtime_token=${{ env.ACTIONS_RUNTIME_TOKEN }}
          target: ${{ env.TARGET }}
          tags: ${{ steps.meta.outputs.tags || steps.meta-pr.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels || steps.meta-pr.outputs.labels }}
          cache-from: type=s3,region=us-east-1,bucket=${{ vars.AWS_S3BUCKET_GITHUB_BUILDX_CACHE }},name=text-generation-inference-cache${{ env.LABEL }},mode=min,access_key_id=${{ steps.aws-creds.outputs.aws-access-key-id }},secret_access_key=${{ steps.aws-creds.outputs.aws-secret-access-key }},session_token=${{ steps.aws-creds.outputs.aws-session-token }},mode=max
          cache-to: type=s3,region=us-east-1,bucket=${{ vars.AWS_S3BUCKET_GITHUB_BUILDX_CACHE }},name=text-generation-inference-cache${{ env.LABEL }},mode=min,access_key_id=${{ steps.aws-creds.outputs.aws-access-key-id }},secret_access_key=${{ steps.aws-creds.outputs.aws-secret-access-key }},session_token=${{ steps.aws-creds.outputs.aws-session-token }},mode=max
      - name: Final
        id: final
        run: |

          if [ "${{ github.event_name }}" = "pull_request" ]; then
            echo "docker_image=docker.io/huggingface/text-generation-inference-ci:sha-${{ env.GITHUB_SHA_SHORT}}${{ env.LABEL_EXTENSION }}" >> "$GITHUB_OUTPUT"
          else
            echo "docker_image=ghcr.io/huggingface/text-generation-inference:sha-${{ env.GITHUB_SHA_SHORT}}${{ env.LABEL_EXTENSION }}" >> "$GITHUB_OUTPUT"
          fi
          echo "docker_devices=${{ env.DOCKER_DEVICES }}" >> "$GITHUB_OUTPUT"
          echo "docker_volume=${{ env.DOCKER_VOLUME }}" >> "$GITHUB_OUTPUT"
          echo "runs_on=${{ env.RUNS_ON }}" >> "$GITHUB_OUTPUT"
          echo "label_extension=${{ env.LABEL_EXTENSION }}" >> "$GITHUB_OUTPUT"
          echo "extra_pytest=${{ env.EXTRA_PYTEST }}" >> "$GITHUB_OUTPUT"
  precompile_neuron_models:
    concurrency:
      group: ${{ github.workflow }}-${{ github.job }}-${{ needs.build-and-push.outputs.label_extension }}-${{ github.head_ref || github.run_id }}
      cancel-in-progress: true
    needs: build-and-push
    if: needs.build-and-push.outputs.label_extension == '-neuron'
    runs-on:
      group: ${{ needs.build-and-push.outputs.runs_on }}
    env:
      PYTEST_FLAGS: ${{ (startsWith(github.ref, 'refs/tags/') || github.ref == 'refs/heads/main' || inputs.release-tests == true) && '--release' || '--release' }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Inject slug/short variables
        uses: rlespinasse/github-slug-action@v4.4.1
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.11"
      - name: Install
        run: |
          make install-integration-tests
      - name: Export neuron models
        run: |
          export DOCKER_IMAGE=${{ needs.build-and-push.outputs.docker_image }}
          echo $DOCKER_IMAGE
          docker pull $DOCKER_IMAGE
          export HF_TOKEN=${{ secrets.HF_TOKEN_NEURON }}
          python integration-tests/fixtures/neuron/export_models.py
  integration_tests:
    concurrency:
      group: ${{ github.workflow }}-${{ github.job }}-${{ needs.build-and-push.outputs.label_extension }}-${{ github.head_ref || github.run_id }}
      cancel-in-progress: true
    needs: [precompile_neuron_models, build-and-push]
    if: ${{ always() && !contains(needs.*.result, 'failure') && !contains(needs.*.result, 'cancelled') && needs.build-and-push.outputs.runs_on != 'ubuntu-latest' }}
    runs-on:
      group: ${{ needs.build-and-push.outputs.runs_on }}
    env:
      PYTEST_FLAGS: ${{ (startsWith(github.ref, 'refs/tags/') || github.ref == 'refs/heads/main' || inputs.release-tests == true) && '--release' || '--release' }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Inject slug/short variables
        uses: rlespinasse/github-slug-action@v4.4.1
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.11"
      - name: Install
        run: |
          make install-integration-tests
      - name: Run tests
        run: |
          export DOCKER_VOLUME=${{ needs.build-and-push.outputs.docker_volume }}
          export DOCKER_IMAGE=${{ needs.build-and-push.outputs.docker_image }}
          export DOCKER_DEVICES=${{ needs.build-and-push.outputs.docker_devices }}
          export EXTRA_PYTEST="${{ needs.build-and-push.outputs.extra_pytest }}"
          export HF_TOKEN=${{ secrets.HF_TOKEN }}
          echo $DOCKER_IMAGE
          docker pull $DOCKER_IMAGE
          pytest -s -vv integration-tests ${PYTEST_FLAGS} ${EXTRA_PYTEST}

  backend_trtllm_cxx_tests:
    needs: build-and-push
    if: needs.build-and-push.outputs.label_extension == '-trtllm'
    concurrency:
      group: ${{ github.workflow }}-${{ github.job }}-trtllm-${{ github.head_ref || github.run_id }}
      cancel-in-progress: true
    runs-on:
      group: aws-g6-12xl-plus-priv-cache
    container:
      image: ${{ needs.build-and-push.outputs.docker_image }}
      credentials:
        username: ${{ secrets.DOCKERHUB_USERNAME }}
        password: ${{ secrets.DOCKERHUB_PASSWORD }}
      options: --gpus all --shm-size=8g

    steps:
      - name: Run C++/CUDA tests
        if: ${{ env.LABEL_EXTENSION == 'ci-runtime' }}
        run: /usr/local/tgi/bin/tgi_trtllm_backend_tests


================================================
FILE: .github/workflows/build_documentation.yaml
================================================
name: Build documentation

on:
  push:
    paths:
      - "docs/source/**"
    branches:
      - main
      - doc-builder*
      - v*-release

jobs:
   build:
    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
    with:
      commit_sha: ${{ github.sha }}
      package: text-generation-inference
      additional_args: --not_python_module
    secrets:
      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}


================================================
FILE: .github/workflows/build_pr_documentation.yaml
================================================
name: Build PR Documentation

on:
  pull_request:
    paths:
      - "docs/source/**"

concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true

jobs:
  build:
    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
    with:
      commit_sha: ${{ github.event.pull_request.head.sha }}
      pr_number: ${{ github.event.number }}
      package: text-generation-inference
      additional_args: --not_python_module


================================================
FILE: .github/workflows/ci_build.yaml
================================================
name: CI build

on:
  push:
    branches:
      - 'main'
    tags:
      - 'v*'
  pull_request:
    paths:
      - ".github/workflows/build.yaml"
      - "integration-tests/**"
      - "backends/**"
      - "server/**"
      - "proto/**"
      - "router/**"
      - "launcher/**"
      - "Cargo.lock"
      - "rust-toolchain.toml"
      - "Dockerfile"
      - "Dockerfile_amd"
      - "Dockerfile_intel"
      - "Dockerfile.neuron"
      - "Dockerfile_gaudi"
    branches:
      - "main"
  workflow_dispatch:
    inputs:
      release-tests:
        description: "Run release integration tests"
        required: true
        default: false
        type: boolean

jobs:
  build:
    strategy:
      # super important if you want to see all results, even if one fails
      # fail-fast is true by default
      fail-fast: false
      matrix:
        hardware: ["cuda", "cuda-trtllm", "rocm", "intel-xpu", "intel-cpu", "neuron", "gaudi"]
    uses: ./.github/workflows/build.yaml # calls the one above ^
    permissions:
      contents: write
      packages: write
      id-token: write
    with:
      hardware: ${{ matrix.hardware }}
      # https://github.com/actions/runner/issues/2206
      release-tests: ${{ inputs.release-tests == true }}
    secrets: inherit


================================================
FILE: .github/workflows/client-tests.yaml
================================================
name: Python Client Tests

on:
  pull_request:
    paths:
      - ".github/workflows/client-tests.yaml"
      - "clients/python/**"

jobs:
  run_tests:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v1
        with:
          python-version: 3.9
      - name: Install
        run: |
          cd clients/python && pip install .
      - name: Run tests
        run: |
          pip install pytest pytest-asyncio
          export HF_TOKEN=${{ secrets.HF_TOKEN }}
          make python-client-tests


================================================
FILE: .github/workflows/codeql.yml
================================================
---
name: CodeQL Security Analysis For Github Actions

on:
  push:
    branches: ["main"]
  workflow_dispatch:
  # pull_request:

jobs:
  codeql:
    name: CodeQL Analysis
    uses: huggingface/security-workflows/.github/workflows/codeql-reusable.yml@v1.2.0
    permissions:
      security-events: write
      packages: read
      actions: read
      contents: read
    with:
      languages: '["actions"]'
      queries: 'security-extended,security-and-quality'
      runner: 'ubuntu-latest' #optional if need custom runner
      use-runner-group: false #optional

      # if need to use runner group:
      # runner: 'cpu-low'
      # use-runner-group: true


================================================
FILE: .github/workflows/integration_tests.yaml
================================================
name: Integration tests

on:
  workflow_call:
    inputs:
      docker_image:
        type: string
        description: Hardware
        required: true
      docker_devices:
        type: string
        description: Hardware
      runs_on:
        type: string
        required: true
        description: Hardware to run integration tests
jobs:
  integration_tests:
    concurrency:
      group: ${{ github.workflow }}-${{ github.job }}-${{ github.head_ref || github.run_id }}
      cancel-in-progress: true
    runs-on: ${{ inputs.runs_on }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Inject slug/short variables
        uses: rlespinasse/github-slug-action@v4.4.1
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: 3.9
      - name: Install
        run: |
          make install-integration-tests
      - name: Run tests
        run: |
          export DOCKER_VOLUME=/mnt/cache
          export DOCKER_IMAGE=${{ inputs.docker_image }}
          export DOCKER_DEVICES=${{ inputs.docker_devices }}
          export HF_TOKEN=${{ secrets.HF_TOKEN }}
          pytest -s -vv integration-tests


================================================
FILE: .github/workflows/load_test.yaml
================================================
name: Nightly load test

on:
  schedule:
    - cron: '0 0 * * 1-5'
  workflow_call:
  workflow_dispatch:

  pull_request:
    paths:
      - ".github/workflows/load_test.yaml"

env:
  AWS_DEFAULT_REGION: us-east-1
  AWS_ACCESS_KEY_ID: ${{ secrets.S3_AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.S3_AWS_SECRET_ACCESS_KEY }}

jobs:
  load-tests:
    concurrency:
      group: ${{ github.workflow }}-${{ github.job }}-${{ github.head_ref || github.run_id }}
      cancel-in-progress: true
    runs-on:
      group: aws-g6-12xl-plus-priv-cache
    env:
      DOCKER_VOLUME: /cache
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Install Python 3.11
        uses: actions/setup-python@v2
        with:
          python-version: 3.11

      - name: Install poetry
        run: |
          curl -sSL https://install.python-poetry.org | python3 -
          export PATH="$HOME/.local/bin:$PATH"
          poetry --version

      - name: Run bench test
        run: |
          export PATH="$HOME/.local/bin:$PATH"
          cd load_tests
          poetry install
          poetry run python benchmarks.py --sha ${{ github.sha }} --results-file "s3://text-generation-inference-ci/benchmarks/ci/${{ github.sha }}.parquet"
        shell: bash
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN_BENCHMARK }}


================================================
FILE: .github/workflows/nix_build.yaml
================================================
name: "Nix Build Docker image"
on:
  pull_request:
  push:
    branches:
      - 'main'
    tags:
      - 'v*'
concurrency:
  group: nix-image-${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true

jobs:
  build_nix_image:
    runs-on:
      group: aws-highmemory-32-plus-priv
    steps:
    - uses: actions/checkout@v4
    - uses: cachix/install-nix-action@v27
      with:
        nix_path: nixpkgs=channel:nixos-unstable
    - uses: cachix/cachix-action@v14
      with:
        name: huggingface
        # If you chose signing key for write access
        # authToken: '${{ secrets.CACHIX_AUTH_TOKEN }}'
      env:
        USER: github_runner
    - name: Build
      run: nix build .#dockerImage
    - name: Initialize Docker Buildx
      uses: docker/setup-buildx-action@v3
      with:
        install: true
        buildkitd-config: /tmp/buildkitd.toml
    - name: Inject slug/short variables
      uses: rlespinasse/github-slug-action@v4.4.1
    - name: Login to internal Container Registry
      # if: github.event_name != 'pull_request'
      uses: docker/login-action@v3
      with:
        username: ${{ secrets.REGISTRY_USERNAME }}
        password: ${{ secrets.REGISTRY_PASSWORD }}
        registry: registry.internal.huggingface.tech
    - name: Push to docker
      run: |
        if [ "${{ github.event_name }}" = "pull_request" ]; then
          export TAG=nix-sha-${{ env.GITHUB_SHA_SHORT }}
        else
          export TAG=${{ github.ref_name }}-nix
        fi
        export IMAGE=registry.internal.huggingface.tech/api-inference/community/text-generation-inference:$TAG
        nix-shell -p skopeo --command "skopeo --insecure-policy copy docker-archive:$(readlink -f ./result) docker://$IMAGE --dest-compress-format zstd"


================================================
FILE: .github/workflows/nix_cache.yaml
================================================
name: "Cache devshells"
on:
  pull_request:
    paths:
      - "flake.nix"
      - "flake.lock"
      - "nix/**"
concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true

jobs:
  tests:
    runs-on:
      group: aws-highmemory-32-plus-priv
    steps:
      - uses: actions/checkout@v4
      - uses: cachix/install-nix-action@v27
        with:
          nix_path: nixpkgs=channel:nixos-unstable
      - uses: cachix/cachix-action@v14
        with:
          name: huggingface
          # If you chose signing key for write access
          #authToken: "${{ secrets.CACHIX_AUTH_TOKEN }}"
        env:
          USER: github_runner
      - name: Build impure devshell
        run: nix build .\#devShells.x86_64-linux.impure
      - name: Build impure devshell (CUDA dev)
        run: nix build .\#devShells.x86_64-linux.impureWithCuda
      # Pure shell dependencies are covered by Nix tests.
      # - name: Build pure devshell
      #   run: nix build .\#devShells.x86_64-linux.pure


================================================
FILE: .github/workflows/nix_tests.yaml
================================================
name: "Nix Tests"
on:
  pull_request:
    paths:
      - ".github/workflows/nix_tests.yaml"
      - "server/**"
      - "proto/**"
      - "router/**"
      - "launcher/**"
      - "backends/**"
      - "Cargo.lock"
      - "rust-toolchain.toml"
concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true

jobs:
  tests:
    runs-on:
      group: aws-highmemory-32-plus-priv
    steps:
    - uses: actions/checkout@v4
    - uses: cachix/install-nix-action@v27
      with:
        nix_path: nixpkgs=channel:nixos-unstable
    - uses: cachix/cachix-action@v14
      with:
        name: huggingface
        # If you chose signing key for write access
        #authToken: '${{ secrets.CACHIX_AUTH_TOKEN }}'
      env:
        USER: github_runner
    - name: Nix info
      run: nix-shell -p nix-info --run "nix-info -m"
    - name: Build
      run: nix develop .#test --command echo "Ok"
    - name: Pre-commit tests.
      run: nix develop .#test --command pre-commit run --all-files
    - name: Python tests.
      run: nix develop .#test --command python -m pytest server/tests/
      env:
        HF_TOKEN: ${{ secrets.HF_TOKEN }}
    - name: Rust tests.
      run: nix develop .#test --command cargo test


================================================
FILE: .github/workflows/stale.yaml
================================================
name: 'Close stale issues and PRs'
on:
  schedule:
    - cron: '30 1 * * *'

jobs:
  stale:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/stale@v8
        with:
          stale-issue-message: 'This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.'
          days-before-stale: 30
          days-before-close: 5


================================================
FILE: .github/workflows/tests.yaml
================================================
name: Server Tests

on:
  pull_request:
    paths:
      - ".github/workflows/tests.yaml"
      - "server/**"
      - "proto/**"
      - "router/**"
      - "launcher/**"
      - "backends/**"
      - "Cargo.lock"
      - "rust-toolchain.toml"

concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true

jobs:
  run_tests:
    runs-on:
      group: aws-highmemory-32-plus-priv
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v4
        id: python
        with:
          python-version: 3.11
      - uses: dtolnay/rust-toolchain@1.85.0
        with:
          components: rustfmt, clippy
      - name: Install Protoc
        uses: arduino/setup-protoc@v1
      - name: Clean unused files
        run: |
          sudo rm -rf /usr/local/lib/android # will release about 10 GB if you don't need Android
          sudo rm -rf /usr/share/dotnet # will release about 20GB if you don't need .NET
      - name: Install
        run: |
          sudo apt update
          sudo apt install python3.11-dev -y
          pip install -U pip uv
          uv venv
          source ./.venv/bin/activate
          make install-cpu
      - name: Download locked kernels
        run: |
          source ./.venv/bin/activate
          kernels download server
      - name: Run server tests
        run: |
          source ./.venv/bin/activate
          uv pip install pytest
          export HF_TOKEN=${{ secrets.HF_TOKEN }}
          pytest -s -vv server/tests
      - name: Pre-commit checks
        run: |
          pip install pre-commit
          pre-commit install
          pre-commit run --all-files
      - name: Run Rust tests
        run: |
          cargo test
      - name: Run Rust tests with google feature
        run: |
          cargo test --features google


================================================
FILE: .github/workflows/trufflehog.yaml
================================================
on:
  push:

name: Secret Leaks

permissions:
  contents: read

jobs:
  trufflehog:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Secret Scanning
        uses: trufflesecurity/trufflehog@853e1e8d249fd1e29d0fcc7280d29b03df3d643d
        with:
          # exclude buggy postgres detector that is causing false positives and not relevant to our codebase
          extra_args: --results=verified,unknown --exclude-detectors=postgres


================================================
FILE: .github/workflows/upload_pr_documentation.yaml
================================================
name: Upload PR Documentation

on:
  workflow_run:
    workflows: ["Build PR Documentation"]
    types:
      - completed

jobs:
  build:
    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
    with:
      package_name: text-generation-inference
    secrets:
      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}


================================================
FILE: .gitignore
================================================
.idea
target
router/tokenizer.json
*__pycache__*

backends/v2/src/client/pb
backends/v3/src/client/pb
backends/client/src/v2/pb
backends/client/src/v3/pb

# ROCm auto-generated files
*.hip
server/exllamav2
server/exllama_kernels/exllama_kernels/hip/
server/exllama_kernels/exllama_kernels/hip_func/
*_hip.cuh
server/exllama_kernels/exllama_kernels/hip_buffers.cuh
server/exllama_kernels/exllama_kernels/exllama_ext_hip.cpp

data/
load_tests/*.json
server/fbgemmm

.direnv/
.venv/

# Gaudi auto-generated files
hl-smi_log*.txt
.graph_dumps
out
hqt_output


================================================
FILE: .pre-commit-config.yaml
================================================
repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
    -   id: check-yaml
    -   id: end-of-file-fixer
        exclude: crate-hashes.json
    -   id: trailing-whitespace
        exclude: docs/source/reference/launcher.md
-   repo: https://github.com/psf/black
    rev: 24.2.0
    hooks:
    -   id: black
-   repo: https://github.com/doublify/pre-commit-rust
    rev: v1.0
    hooks:
    -   id: cargo-check
    -   id: fmt
    -   id: clippy
-   repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.3.0
    hooks:
    -   id: ruff
        args: [--fix, --exit-non-zero-on-fix]


================================================
FILE: .redocly.lint-ignore.yaml
================================================
# This file instructs Redocly's linter to ignore the rules contained for specific parts of your API.
# See https://redoc.ly/docs/cli/ for more information.
docs/openapi.json:
  no-empty-servers:
    - '#/openapi'
  spec:
    - >-
      #/components/schemas/GenerateParameters/properties/best_of/exclusiveMinimum
    - >-
      #/components/schemas/GenerateParameters/properties/frequency_penalty/exclusiveMinimum
    - '#/components/schemas/GenerateParameters/properties/grammar/nullable'
    - >-
      #/components/schemas/GenerateParameters/properties/repetition_penalty/exclusiveMinimum
    - '#/components/schemas/GenerateParameters/properties/seed/exclusiveMinimum'
    - >-
      #/components/schemas/GenerateParameters/properties/temperature/exclusiveMinimum
    - '#/components/schemas/GenerateParameters/properties/top_k/exclusiveMinimum'
    - >-
      #/components/schemas/GenerateParameters/properties/top_n_tokens/exclusiveMinimum
    - '#/components/schemas/GenerateParameters/properties/top_p/exclusiveMinimum'
    - >-
      #/components/schemas/GenerateParameters/properties/typical_p/exclusiveMinimum
    - '#/components/schemas/GenerateResponse/properties/details/nullable'
    - '#/components/schemas/StreamResponse/properties/details/nullable'
    - '#/components/schemas/ChatRequest/properties/response_format/nullable'
    - '#/components/schemas/ChatRequest/properties/stream_options/nullable'
    - '#/components/schemas/ChatRequest/properties/tool_choice/nullable'
    - '#/components/schemas/ToolChoice/nullable'
    - '#/components/schemas/ChatCompletionComplete/properties/logprobs/nullable'
    - '#/components/schemas/ChatCompletionChunk/properties/usage/nullable'
    - '#/components/schemas/ChatCompletionChoice/properties/logprobs/nullable'
  no-invalid-media-type-examples:
    - '#/paths/~1/post/responses/422/content/application~1json/example'
    - '#/paths/~1/post/responses/424/content/application~1json/example'
    - '#/paths/~1/post/responses/429/content/application~1json/example'
    - '#/paths/~1/post/responses/500/content/application~1json/example'
    - '#/paths/~1generate/post/responses/422/content/application~1json/example'
    - '#/paths/~1generate/post/responses/424/content/application~1json/example'
    - '#/paths/~1generate/post/responses/429/content/application~1json/example'
    - '#/paths/~1generate/post/responses/500/content/application~1json/example'
    - >-
      #/paths/~1generate_stream/post/responses/422/content/text~1event-stream/example
    - >-
      #/paths/~1generate_stream/post/responses/424/content/text~1event-stream/example
    - >-
      #/paths/~1generate_stream/post/responses/429/content/text~1event-stream/example
    - >-
      #/paths/~1generate_stream/post/responses/500/content/text~1event-stream/example
    - '#/paths/~1tokenize/post/responses/404/content/application~1json/example'
    - >-
      #/paths/~1v1~1chat~1completions/post/responses/422/content/application~1json/example
    - >-
      #/paths/~1v1~1chat~1completions/post/responses/424/content/application~1json/example
    - >-
      #/paths/~1v1~1chat~1completions/post/responses/429/content/application~1json/example
    - >-
      #/paths/~1v1~1chat~1completions/post/responses/500/content/application~1json/example
    - >-
      #/paths/~1v1~1completions/post/responses/422/content/application~1json/example
    - >-
      #/paths/~1v1~1completions/post/responses/424/content/application~1json/example
    - >-
      #/paths/~1v1~1completions/post/responses/429/content/application~1json/example
    - >-
      #/paths/~1v1~1completions/post/responses/500/content/application~1json/example
  operation-4xx-response:
    - '#/paths/~1health/get/responses'
    - '#/paths/~1info/get/responses'
    - '#/paths/~1metrics/get/responses'
  no-unused-components:
    - '#/components/schemas/Completion'
  security-defined:
    - '#/paths/~1/post'
    - '#/paths/~1generate/post'
    - '#/paths/~1generate_stream/post'
    - '#/paths/~1health/get'
    - '#/paths/~1info/get'
    - '#/paths/~1metrics/get'
    - '#/paths/~1tokenize/post'
    - '#/paths/~1v1~1chat~1completions/post'
    - '#/paths/~1v1~1completions/post'
    - '#/paths/~1v1~1models/get'


================================================
FILE: CODE_OF_CONDUCT.md
================================================

# Contributor Covenant Code of Conduct

## Our Pledge

We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, caste, color, religion, or sexual
identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.

## Our Standards

Examples of behavior that contributes to a positive environment for our
community include:

* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
  and learning from the experience
* Focusing on what is best not just for us as individuals, but for the overall
  community

Examples of unacceptable behavior include:

* The use of sexualized language or imagery, and sexual attention or advances of
  any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email address,
  without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
  professional setting

## Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.

Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.

## Scope

This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
feedback@huggingface.co.
All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the
reporter of any incident.

## Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:

### 1. Correction

**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.

**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.

### 2. Warning

**Community Impact**: A violation through a single incident or series of
actions.

**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or permanent
ban.

### 3. Temporary Ban

**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.

**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.

### 4. Permanent Ban

**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.

**Consequence**: A permanent ban from any sort of public interaction within the
community.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.1, available at
[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].

Community Impact Guidelines were inspired by
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].

For answers to common questions about this code of conduct, see the FAQ at
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
[https://www.contributor-covenant.org/translations][translations].

[homepage]: https://www.contributor-covenant.org
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
[Mozilla CoC]: https://github.com/mozilla/diversity
[FAQ]: https://www.contributor-covenant.org/faq
[translations]: https://www.contributor-covenant.org/translations


================================================
FILE: CONTRIBUTING.md
================================================
<!---
Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Contribute to text-generation-inference

Everyone is welcome to contribute, and we value everybody's contribution. Code
contributions are not the only way to help the community. Answering questions, helping
others, and improving the documentation are also immensely valuable.

It also helps us if you spread the word! Reference the library in blog posts
about the awesome projects it made possible, shout out on Twitter every time it has
helped you, or simply ⭐️ the repository to say thank you.

However you choose to contribute, please be mindful and respect our
[code of conduct](https://github.com/huggingface/text-generation-inference/blob/main/CODE_OF_CONDUCT.md).

**This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md).**

## Ways to contribute

There are several ways you can contribute to text-generation-inference.

* Fix outstanding issues with the existing code.
* Submit issues related to bugs or desired new features.
* Contribute to the examples or to the documentation.

> All contributions are equally valuable to the community. 🥰

## Fixing outstanding issues

If you notice an issue with the existing code and have a fix in mind, feel free to [start contributing](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) and open
a Pull Request!

## Submitting a bug-related issue or feature request

Do your best to follow these guidelines when submitting a bug-related issue or a feature
request. It will make it easier for us to come back to you quickly and with good
feedback.

### Did you find a bug?

The text-generation-inference library is robust and reliable thanks to users who report the problems they encounter.

Before you report an issue, we would really appreciate it if you could **make sure the bug was not
already reported** (use the search bar on GitHub under Issues). Your issue should also be related to bugs in the
library itself, and not your code.

Once you've confirmed the bug hasn't already been reported, please include the following information in your issue so
we can quickly resolve it:

* Your **OS type and version**, as well as your environment versions (versions of rust, python, and dependencies).
* A short, self-contained, code snippet that allows us to reproduce the bug.
* The *full* traceback if an exception is raised.
* Attach any other additional information, like screenshots, you think may help.

To get the OS and software versions automatically, you can re-run the launcher with the `--env` flag:

```bash
text-generation-launcher --env
```

This will precede the launch of the model with the information relative to your environment. We recommend pasting
that in your issue report.

### Do you want a new feature?

If there is a new feature you'd like to see in text-generation-inference, please open an issue and describe:

1. What is the *motivation* behind this feature? Is it related to a problem or frustration with the library? Is it
   a feature related to something you need for a project? Is it something you worked on and think it could benefit
   the community?

   Whatever it is, we'd love to hear about it!

2. Describe your requested feature in as much detail as possible. The more you can tell us about it, the better
   we'll be able to help you.
3. Provide a *code snippet* that demonstrates the feature's usage.
4. If the feature is related to a paper, please include a link.

If your issue is well written we're already 80% of the way there by the time you create it.

We have added [templates](https://github.com/huggingface/text-generation-inference/tree/main/.github/ISSUE_TEMPLATE)
to help you get started with your issue.

## Do you want to implement a new model?

New models are constantly released and if you want to implement a new model, please provide the following information:

* A short description of the model and a link to the paper.
* Link to the implementation if it is open-sourced.
* Link to the model weights if they are available.

If you are willing to contribute the model yourself, let us know so we can help you add it to text-generation-inference!

## Do you want to add documentation?

We're always looking for improvements to the documentation that make it more clear and accurate. Please let us know
how the documentation can be improved such as typos and any content that is missing, unclear or inaccurate. We'll be
happy to make the changes or help you make a contribution if you're interested!

## I want to become a maintainer of the project. How do I get there?

TGI is a project led and managed by Hugging Face as it powers our internal services. However, we are happy to have
motivated individuals from other organizations join us as maintainers with the goal of making TGI the best inference
service.

If you are such an individual (or organization), please reach out to us and let's collaborate.


================================================
FILE: Cargo.toml
================================================
[workspace]
members = [
    "benchmark",
    "backends/v2",
    "backends/v3",
    "backends/grpc-metadata",
    "backends/trtllm",
    "backends/llamacpp",
    "launcher",
    "router"
]
default-members = [
    "benchmark",
    "backends/v2",
    "backends/v3",
    "backends/grpc-metadata",
    # "backends/trtllm",
    "launcher",
    "router"
]
resolver = "2"

[workspace.package]
version = "3.3.6-dev0"
edition = "2021"
authors = ["Olivier Dehaene"]
homepage = "https://github.com/huggingface/text-generation-inference"

[workspace.dependencies]
base64 = "0.22.0"
tokenizers = { version = "0.20.0", features = ["http"] }
hf-hub = { version = "0.4.2", features = ["tokio"] }
metrics = { version = "0.23.0" }
metrics-exporter-prometheus = { version = "0.15.1", features = [] }
minijinja = { version = "2.2.0", features = ["json"] }
minijinja-contrib = { version = "2.0.2", features = ["pycompat"] }
pyo3 = { version = "0.22.2", features = ["auto-initialize"] }

[profile.release]
incremental = true

[profile.release-binary]
inherits = "release"
debug = 1
incremental = true
panic = "abort"

[profile.release-opt]
inherits = "release"
debug = 0
incremental = false
lto = "fat"
opt-level = 3
codegen-units = 1


================================================
FILE: Dockerfile
================================================
# Rust builder
FROM lukemathwalker/cargo-chef:latest-rust-1.85.1 AS chef
WORKDIR /usr/src

ARG CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse

FROM chef AS planner
COPY Cargo.lock Cargo.lock
COPY Cargo.toml Cargo.toml
COPY rust-toolchain.toml rust-toolchain.toml
COPY proto proto
COPY benchmark benchmark
COPY router router
COPY backends backends
COPY launcher launcher

RUN cargo chef prepare --recipe-path recipe.json

FROM chef AS builder

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    python3.11-dev
RUN PROTOC_ZIP=protoc-21.12-linux-x86_64.zip && \
    curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP && \
    unzip -o $PROTOC_ZIP -d /usr/local bin/protoc && \
    unzip -o $PROTOC_ZIP -d /usr/local 'include/*' && \
    rm -f $PROTOC_ZIP

COPY --from=planner /usr/src/recipe.json recipe.json
RUN cargo chef cook --profile release-opt --recipe-path recipe.json

ARG GIT_SHA
ARG DOCKER_LABEL

COPY Cargo.lock Cargo.lock
COPY Cargo.toml Cargo.toml
COPY rust-toolchain.toml rust-toolchain.toml
COPY proto proto
COPY benchmark benchmark
COPY router router
COPY backends backends
COPY launcher launcher
RUN cargo build --profile release-opt --frozen

# Python builder
# Adapted from: https://github.com/pytorch/pytorch/blob/master/Dockerfile
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS pytorch-install
WORKDIR /usr/src/

# NOTE: When updating PyTorch version, beware to remove `pip install nvidia-nccl-cu12==2.22.3` below in the Dockerfile. Context: https://github.com/huggingface/text-generation-inference/pull/2099
ARG PYTORCH_VERSION=2.7
ARG PYTHON_VERSION=3.11

# Keep in sync with `server/pyproject.toml
# Automatically set by buildx
ARG TARGETPLATFORM

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        build-essential \
        ca-certificates \
        ccache \
        curl \
        git && \
        rm -rf /var/lib/apt/lists/*
COPY --from=ghcr.io/astral-sh/uv:0.5.31 /uv /uvx /bin/
ENV PATH="$PATH:/root/.local/bin"
RUN uv python install ${PYTHON_VERSION}
RUN uv venv --python ${PYTHON_VERSION} && uv pip install torch==${PYTORCH_VERSION} torchvision pip setuptools packaging
ENV VIRTUAL_ENV=/usr/src/.venv/
ENV PATH="$PATH:/usr/src/.venv/bin/"

# CUDA kernels builder image
FROM pytorch-install AS kernel-builder

ARG MAX_JOBS=8
ENV TORCH_CUDA_ARCH_LIST="8.0;8.6;9.0+PTX"

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        ninja-build cmake \
        && rm -rf /var/lib/apt/lists/*

# Build Flash Attention CUDA kernels
FROM kernel-builder AS flash-att-builder

WORKDIR /usr/src

COPY server/Makefile-flash-att Makefile

# Build specific version of flash attention
RUN . .venv/bin/activate && make build-flash-attention

# Build Flash Attention v2 CUDA kernels
FROM kernel-builder AS flash-att-v2-builder

WORKDIR /usr/src

COPY server/Makefile-flash-att-v2 Makefile

# Build specific version of flash attention v2
RUN . .venv/bin/activate && make build-flash-attention-v2-cuda

# Build Transformers exllama kernels
FROM kernel-builder AS exllama-kernels-builder
WORKDIR /usr/src
COPY server/exllama_kernels/ .

RUN . .venv/bin/activate && python setup.py build

# Build Transformers exllama kernels
FROM kernel-builder AS exllamav2-kernels-builder
WORKDIR /usr/src
COPY server/Makefile-exllamav2/ Makefile

# Build specific version of transformers
RUN . .venv/bin/activate && make build-exllamav2

# Build Transformers awq kernels
FROM kernel-builder AS awq-kernels-builder
WORKDIR /usr/src
COPY server/Makefile-awq Makefile
# Build specific version of transformers
RUN . .venv/bin/activate && make build-awq

# Build Transformers CUDA kernels
FROM kernel-builder AS custom-kernels-builder
WORKDIR /usr/src
COPY server/custom_kernels/ .
# Build specific version of transformers
RUN . .venv/bin/activate && python setup.py build

# Build mamba kernels
FROM kernel-builder AS mamba-builder
WORKDIR /usr/src
COPY server/Makefile-selective-scan Makefile
RUN . .venv/bin/activate && make build-all

# Build flashinfer
FROM kernel-builder AS flashinfer-builder
WORKDIR /usr/src
COPY server/Makefile-flashinfer Makefile
RUN . .venv/bin/activate && make install-flashinfer

# Text Generation Inference base image
FROM nvidia/cuda:12.4.0-base-ubuntu22.04 AS base

# Text Generation Inference base env
ENV HF_HOME=/data \
    HF_HUB_ENABLE_HF_TRANSFER=1 \
    PORT=80

WORKDIR /usr/src

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        libssl-dev \
        ca-certificates \
        make \
        curl \
        git \
        && rm -rf /var/lib/apt/lists/*

# RUN curl -LsSf https://astral.sh/uv/install.sh | sh
# ENV PATH="$PATH:/root/.local/bin"
COPY --from=ghcr.io/astral-sh/uv:0.5.31 /uv /uvx /bin/
# Install flash-attention dependencies
# RUN pip install einops --no-cache-dir

# Copy env with PyTorch installed
COPY --from=pytorch-install /usr/src/.venv /usr/src/.venv
ENV PYTHON_VERSION=3.11
RUN uv python install ${PYTHON_VERSION}
ENV VIRTUAL_ENV=/usr/src/.venv/
ENV PATH="$PATH:/usr/src/.venv/bin/"

# Install server
COPY proto proto
COPY server server
COPY server/Makefile server/Makefile
ENV HF_KERNELS_CACHE=/kernels
RUN cd server && \
	uv sync --frozen --extra gen --extra bnb --extra accelerate --extra compressed-tensors --extra quantize --extra peft --extra outlines --extra torch --no-install-project --active && \
    make gen-server-raw && \
    kernels download .

RUN cd server && \
    uv sync --frozen --extra gen --extra bnb --extra accelerate --extra compressed-tensors --extra quantize --extra peft --extra outlines --extra torch --active --python=${PYTHON_VERSION} && \
    uv pip install nvidia-nccl-cu12==2.25.1 && \
    pwd && \
    text-generation-server --help

# Copy build artifacts from flash attention builder
COPY --from=flash-att-builder /usr/src/flash-attention/build/lib.linux-x86_64-cpython-311 /usr/src/.venv/lib/python3.11/site-packages
COPY --from=flash-att-builder /usr/src/flash-attention/csrc/layer_norm/build/lib.linux-x86_64-cpython-311 /usr/src/.venv/lib/python3.11/site-packages
COPY --from=flash-att-builder /usr/src/flash-attention/csrc/rotary/build/lib.linux-x86_64-cpython-311 /usr/src/.venv/lib/python3.11/site-packages

# Copy build artifacts from flash attention v2 builder
COPY --from=flash-att-v2-builder /usr/src/.venv/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so /usr/src/.venv/lib/python3.11/site-packages

# Copy build artifacts from custom kernels builder
COPY --from=custom-kernels-builder /usr/src/build/lib.linux-x86_64-cpython-311 /usr/src/.venv/lib/python3.11/site-packages
# Copy build artifacts from exllama kernels builder
COPY --from=exllama-kernels-builder /usr/src/build/lib.linux-x86_64-cpython-311 /usr/src/.venv/lib/python3.11/site-packages
# Copy build artifacts from exllamav2 kernels builder
COPY --from=exllamav2-kernels-builder /usr/src/exllamav2/build/lib.linux-x86_64-cpython-311 /usr/src/.venv/lib/python3.11/site-packages
# Copy build artifacts from awq kernels builder
COPY --from=awq-kernels-builder /usr/src/llm-awq/awq/kernels/build/lib.linux-x86_64-cpython-311 /usr/src/.venv/lib/python3.11/site-packages
# Copy build artifacts from mamba builder
COPY --from=mamba-builder /usr/src/mamba/build/lib.linux-x86_64-cpython-311/ /usr/src/.venv/lib/python3.11/site-packages
COPY --from=mamba-builder /usr/src/causal-conv1d/build/lib.linux-x86_64-cpython-311/ /usr/src/.venv/lib/python3.11/site-packages
COPY --from=flashinfer-builder /usr/src/.venv/lib/python3.11/site-packages/flashinfer/ /usr/src/.venv/lib/python3.11/site-packages/flashinfer/


# ENV LD_PRELOAD=/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/libnccl.so.2
# Required to find libpython within the rust binaries
# This is needed because exl2 tries to load flash-attn
# And fails with our builds.
ENV EXLLAMA_NO_FLASH_ATTN=1

# Deps before the binaries
# The binaries change on every build given we burn the SHA into them
# The deps change less often.
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        build-essential \
        g++ \
        && rm -rf /var/lib/apt/lists/*

# Install benchmarker
COPY --from=builder /usr/src/target/release-opt/text-generation-benchmark /usr/local/bin/text-generation-benchmark
# Install router
COPY --from=builder /usr/src/target/release-opt/text-generation-router /usr/local/bin/text-generation-router
# Install launcher
COPY --from=builder /usr/src/target/release-opt/text-generation-launcher /usr/local/bin/text-generation-launcher


# AWS Sagemaker compatible image
FROM base AS sagemaker

COPY sagemaker-entrypoint.sh entrypoint.sh
RUN chmod +x entrypoint.sh

ENTRYPOINT ["./entrypoint.sh"]

# Final image
FROM base

COPY ./tgi-entrypoint.sh /tgi-entrypoint.sh
RUN chmod +x /tgi-entrypoint.sh

ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/"
ENTRYPOINT ["/tgi-entrypoint.sh"]
# CMD ["--json-output"]


================================================
FILE: Dockerfile.neuron
================================================
# Fetch and extract the TGI sources
FROM alpine AS tgi
RUN mkdir -p /tgi

# Fetch the optimum-neuron sources directly to avoid relying on pypi deployments
FROM alpine AS optimum-neuron
RUN mkdir -p /optimum-neuron
ADD https://github.com/huggingface/optimum-neuron/archive/refs/tags/v0.3.0.tar.gz /optimum-neuron/sources.tar.gz
RUN tar -C /optimum-neuron -xf /optimum-neuron/sources.tar.gz --strip-components=1

# Build cargo components (adapted from TGI original Dockerfile)
# Note: we cannot use the cargo-chef base image as it uses python 3.11
FROM ubuntu:22.04 AS chef

RUN apt-get update -y \
 && apt-get install -y --no-install-recommends \
    curl ca-certificates build-essential \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain 1.85.1 --profile minimal -y
ENV PATH="/root/.cargo/bin:${PATH}"
RUN cargo install cargo-chef --locked

WORKDIR /usr/src

FROM chef AS planner
COPY backends/neuron/Cargo.toml Cargo.toml
COPY Cargo.lock Cargo.lock
COPY rust-toolchain.toml rust-toolchain.toml
COPY proto proto
COPY router router
COPY backends backends
COPY launcher launcher
RUN cargo chef prepare --recipe-path recipe.json

FROM chef AS builder

RUN apt-get update -y \
 && apt-get install -y --no-install-recommends \
    unzip python3-dev libssl-dev pkg-config \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

RUN PROTOC_ZIP=protoc-21.12-linux-x86_64.zip && \
    curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP && \
    unzip -o $PROTOC_ZIP -d /usr/local bin/protoc && \
    unzip -o $PROTOC_ZIP -d /usr/local 'include/*' && \
    rm -f $PROTOC_ZIP

COPY backends/neuron/Cargo.toml Cargo.toml
COPY --from=planner /usr/src/recipe.json recipe.json
RUN cargo chef cook --release --recipe-path recipe.json

COPY Cargo.lock Cargo.lock
COPY rust-toolchain.toml rust-toolchain.toml
COPY proto proto
COPY router router
COPY backends backends
COPY launcher launcher
RUN cargo build --release

# Python base image
FROM ubuntu:22.04 AS base

RUN apt-get update -y \
    && apt-get install -y --no-install-recommends \
    python3-pip \
    python3-setuptools \
    python-is-python3 \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean
RUN pip3 --no-cache-dir install --upgrade pip

# Python server build image
FROM base AS pyserver

RUN apt-get update -y \
    && apt-get install -y --no-install-recommends \
    make \
    python3-venv \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

RUN install -d /pyserver
WORKDIR /pyserver
COPY backends/neuron/server server
COPY proto proto
RUN pip3 install -r server/build-requirements.txt
RUN VERBOSE=1 BUILDDIR=/pyserver/build PROTODIR=/pyserver/proto make -C server package

# Neuron base image (used for deployment)
FROM base AS neuron

# Install system prerequisites
RUN apt-get update -y \
    && apt-get install -y --no-install-recommends \
    gnupg2 \
    wget \
    python3-dev \
    libexpat1 \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

RUN echo "deb https://apt.repos.neuron.amazonaws.com jammy main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

# Install neuronx packages
RUN apt-get update -y \
    && apt-get install -y --no-install-recommends \
    aws-neuronx-dkms=2.22.2.0 \
    aws-neuronx-collectives=2.26.43.0-47cc904ea \
    aws-neuronx-runtime-lib=2.26.42.0-2ff3b5c7d  \
    aws-neuronx-tools=2.24.54.0 \
    libxml2 \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"

# Install manually torch CPU version to avoid pulling CUDA
RUN pip3 install \
    torch==2.7.0 \
    torchvision==0.22.0 \
    --index-url https://download.pytorch.org/whl/cpu

RUN pip3 install \
    neuronx-cc==2.19.8089.0+8ab9f450 \
    torch-neuronx==2.7.0.2.8.6734+ac864f72 \
    neuronx-distributed==0.13.14393+b8569585 \
    libneuronxla==2.2.4410.0+835a67fb \
    --extra-index-url=https://pip.repos.neuron.amazonaws.com

# Install HuggingFace packages
RUN pip3 install \
    hf_transfer huggingface_hub

# Install optimum-neuron
COPY --from=optimum-neuron /optimum-neuron optimum-neuron
RUN pip3 install ./optimum-neuron

# TGI base env
ENV HUGGINGFACE_HUB_CACHE=/tmp \
    HF_HUB_ENABLE_HF_TRANSFER=1 \
    PORT=80

# Disable color logs as they are not supported by CloudWatch
ENV LOGURU_COLORIZE=NO
ENV LOG_COLORIZE=0

# Install router
COPY --from=builder /usr/src/target/release/text-generation-router-v2 /usr/local/bin/text-generation-router
# Install launcher
COPY --from=builder /usr/src/target/release/text-generation-launcher /usr/local/bin/text-generation-launcher
# Install python server
COPY --from=pyserver /pyserver/build/dist dist
RUN pip install dist/text_generation_server*.tar.gz

# Final image
FROM neuron

COPY backends/neuron/tgi_entry_point.py /tgi_entry_point.py
COPY backends/neuron/tgi-entrypoint.sh /tgi-entrypoint.sh
RUN chmod +x /tgi-entrypoint.sh

ENTRYPOINT ["/tgi-entrypoint.sh"]


================================================
FILE: Dockerfile.nix
================================================
# Build the image and get out the docker file:
#
# docker build -t tgi-nix-builder -f Dockerfile.nix
# docker run --log-driver=none tgi-nix-builder | docker load

FROM nixos/nix:2.18.8 AS builder
RUN echo "experimental-features = nix-command flakes" >> /etc/nix/nix.conf
RUN nix profile install nixpkgs#cachix
RUN cachix use huggingface
WORKDIR /root
ADD . .
RUN nix build .
RUN mkdir /tmp/nix-store-closure
RUN cp -R $(nix-store -qR result/) /tmp/nix-store-closure

FROM ubuntu:24.04

WORKDIR /app

# Copy /nix/store
COPY --from=builder /tmp/nix-store-closure /nix/store
COPY --from=builder /root/result /app
RUN ldconfig
CMD ["ldconfig", "/app/bin/text-generation-launcher"]


================================================
FILE: Dockerfile_amd
================================================
# Rust builder
FROM lukemathwalker/cargo-chef:latest-rust-1.85.1 AS chef
WORKDIR /usr/src

ARG CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse

FROM chef AS planner
COPY Cargo.lock Cargo.lock
COPY Cargo.toml Cargo.toml
COPY rust-toolchain.toml rust-toolchain.toml
COPY proto proto
COPY benchmark benchmark
COPY router router
COPY backends backends
COPY launcher launcher
RUN cargo chef prepare --recipe-path recipe.json

FROM chef AS builder

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    python3.11-dev
RUN PROTOC_ZIP=protoc-21.12-linux-x86_64.zip && \
    curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP && \
    unzip -o $PROTOC_ZIP -d /usr/local bin/protoc && \
    unzip -o $PROTOC_ZIP -d /usr/local 'include/*' && \
    rm -f $PROTOC_ZIP

COPY --from=planner /usr/src/recipe.json recipe.json
RUN cargo chef cook --profile release-opt --recipe-path recipe.json

ARG GIT_SHA
ARG DOCKER_LABEL

COPY Cargo.lock Cargo.lock
COPY Cargo.toml Cargo.toml
COPY rust-toolchain.toml rust-toolchain.toml
COPY proto proto
COPY benchmark benchmark
COPY router router
COPY backends backends
COPY launcher launcher
RUN cargo build --profile release-opt --frozen

FROM rocm/dev-ubuntu-22.04:6.3.1-complete AS base

ARG HIPBLASLT_BRANCH="4d40e36"
ARG HIPBLAS_COMMON_BRANCH="7c1566b"
ARG LEGACY_HIPBLASLT_OPTION=
ARG RCCL_BRANCH="648a58d"
ARG RCCL_REPO="https://github.com/ROCm/rccl"
ARG TRITON_BRANCH="e5be006"
ARG TRITON_REPO="https://github.com/triton-lang/triton.git"
ARG PYTORCH_BRANCH="3a585126"
ARG PYTORCH_VISION_BRANCH="v0.19.1"
ARG PYTORCH_REPO="https://github.com/pytorch/pytorch.git"
ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git"
ARG FA_BRANCH="b7d29fb"
ARG FA_REPO="https://github.com/ROCm/flash-attention.git"
ARG AITER_BRANCH="21d47a9"
ARG AITER_REPO="https://github.com/ROCm/aiter.git"

ENV PATH=/opt/rocm/llvm/bin:$PATH
ENV ROCM_PATH=/opt/rocm
ENV LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
ARG PYTORCH_ROCM_ARCH=gfx90a;gfx942
ENV PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}

ARG PYTHON_VERSION=3.11

RUN mkdir -p /app
WORKDIR /app
ENV DEBIAN_FRONTEND=noninteractive

# Install Python and other dependencies
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        build-essential \
        ca-certificates \
        ccache \
        curl \
        git \
        ninja-build \
        cmake \
        software-properties-common \
        python3.11-dev \
        python3.11-venv && \
        rm -rf /var/lib/apt/lists/*

COPY --from=ghcr.io/astral-sh/uv:0.5.31 /uv /uvx /bin/
ENV PATH="$PATH:/root/.local/bin"
RUN uv python install ${PYTHON_VERSION}
RUN uv venv --python ${PYTHON_VERSION} && uv pip install pip setuptools packaging
ENV VIRTUAL_ENV=/usr/src/.venv/
ENV PATH="$PATH:/usr/src/.venv/bin/"

RUN . .venv/bin/activate && pip install -U packaging cmake ninja wheel setuptools pybind11 Cython

FROM base AS build_hipblaslt
ARG HIPBLASLT_BRANCH
ARG HIPBLAS_COMMON_BRANCH
# Set to "--legacy_hipblas_direct" for ROCm<=6.2
ARG LEGACY_HIPBLASLT_OPTION
RUN git clone https://github.com/ROCm/hipBLAS-common.git
RUN . .venv/bin/activate && cd hipBLAS-common \
    && git checkout ${HIPBLAS_COMMON_BRANCH} \
    && mkdir build \
    && cd build \
    && cmake .. \
    && make package \
    && dpkg -i ./*.deb
RUN git clone https://github.com/ROCm/hipBLASLt
RUN . .venv/bin/activate && cd hipBLASLt \
    && git checkout ${HIPBLASLT_BRANCH} \
    && ./install.sh -d --architecture ${PYTORCH_ROCM_ARCH} ${LEGACY_HIPBLASLT_OPTION} \
    && cd build/release \
    && make package
RUN mkdir -p /app/install && cp /app/hipBLASLt/build/release/*.deb /app/hipBLAS-common/build/*.deb /app/install

FROM base AS build_rccl
ARG RCCL_BRANCH
ARG RCCL_REPO
RUN git clone ${RCCL_REPO}
RUN . .venv/bin/activate && cd rccl \
    && git checkout ${RCCL_BRANCH} \
    && ./install.sh -p --amdgpu_targets ${PYTORCH_ROCM_ARCH}
RUN mkdir -p /app/install && cp /app/rccl/build/release/*.deb /app/install

FROM base AS build_triton
ARG TRITON_BRANCH
ARG TRITON_REPO
RUN git clone ${TRITON_REPO}
RUN . .venv/bin/activate && cd triton \
    && git checkout ${TRITON_BRANCH} \
    && cd python \
    && python3 setup.py bdist_wheel --dist-dir=dist
RUN mkdir -p /app/install && cp /app/triton/python/dist/*.whl /app/install

FROM base AS build_amdsmi
RUN . .venv/bin/activate && cd /opt/rocm/share/amd_smi \
    && pip wheel . --wheel-dir=dist
RUN mkdir -p /app/install && cp /opt/rocm/share/amd_smi/dist/*.whl /app/install

FROM base AS build_pytorch
ARG PYTORCH_BRANCH
ARG PYTORCH_VISION_BRANCH
ARG PYTORCH_REPO
ARG PYTORCH_VISION_REPO
ARG FA_BRANCH
ARG FA_REPO
RUN git clone ${PYTORCH_REPO} pytorch
RUN . .venv/bin/activate && cd pytorch && git checkout ${PYTORCH_BRANCH} && \
    pip install -r requirements.txt && git submodule update --init --recursive \
    && python3 tools/amd_build/build_amd.py \
    && CMAKE_PREFIX_PATH=$(python3 -c 'import sys; print(sys.prefix)') python3 setup.py bdist_wheel --dist-dir=dist \
    && pip install dist/*.whl
RUN git clone ${PYTORCH_VISION_REPO} vision
RUN . .venv/bin/activate && cd vision && git checkout ${PYTORCH_VISION_BRANCH} \
    && python3 setup.py bdist_wheel --dist-dir=dist \
    && pip install dist/*.whl
RUN git clone ${FA_REPO}
RUN . .venv/bin/activate && cd flash-attention \
    && git checkout ${FA_BRANCH} \
    && git submodule update --init \
    && MAX_JOBS=64 GPU_ARCHS=${PYTORCH_ROCM_ARCH} python3 setup.py bdist_wheel --dist-dir=dist
RUN mkdir -p /app/install && cp /app/pytorch/dist/*.whl /app/install \
    && cp /app/vision/dist/*.whl /app/install \
    && cp /app/flash-attention/dist/*.whl /app/install

FROM base AS final
RUN --mount=type=bind,from=build_hipblaslt,src=/app/install/,target=/install \
    dpkg -i /install/*deb \
    && sed -i 's/, hipblaslt-dev \(.*\), hipcub-dev/, hipcub-dev/g' /var/lib/dpkg/status \
    && sed -i 's/, hipblaslt \(.*\), hipfft/, hipfft/g' /var/lib/dpkg/status
RUN --mount=type=bind,from=build_rccl,src=/app/install/,target=/install \
    dpkg -i /install/*deb \
    && sed -i 's/, rccl-dev \(.*\), rocalution/, rocalution/g' /var/lib/dpkg/status \
    && sed -i 's/, rccl \(.*\), rocalution/, rocalution/g' /var/lib/dpkg/status
RUN --mount=type=bind,from=build_triton,src=/app/install/,target=/install \
    . .venv/bin/activate && \
    pip install /install/*.whl
RUN --mount=type=bind,from=build_amdsmi,src=/app/install/,target=/install \
    . .venv/bin/activate && \
    pip install /install/*.whl
RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \
    . .venv/bin/activate && \
    pip install /install/*.whl

ARG AITER_REPO
ARG AITER_BRANCH
RUN git clone --recursive ${AITER_REPO}
RUN . .venv/bin/activate && cd aiter \
    && git checkout ${AITER_BRANCH} \
    && git submodule update --init --recursive \
    && pip install -r requirements.txt \
    && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop && pip show aiter

RUN rm -rf /var/lib/apt/lists/*

FROM final AS kernel-builder
# # Build vllm kernels
FROM kernel-builder AS vllm-builder

COPY server/Makefile-vllm Makefile
RUN . .venv/bin/activate && pip install setuptools_scm

# Build specific version of vllm
RUN . .venv/bin/activate && make build-vllm-rocm

# Build Transformers CUDA kernels (gpt-neox and bloom)
FROM kernel-builder AS custom-kernels-builder
COPY server/custom_kernels/ .
RUN . .venv/bin/activate && python3 setup.py bdist_wheel --dist-dir=dist

# Build exllama kernels
FROM kernel-builder AS exllama-kernels-builder
COPY server/exllama_kernels/ .
RUN . .venv/bin/activate && python3 setup.py bdist_wheel --dist-dir=dist

# Build exllama v2 kernels
FROM kernel-builder AS exllamav2-kernels-builder
COPY server/exllamav2_kernels/ .
RUN . .venv/bin/activate && python3 setup.py bdist_wheel --dist-dir=dist

FROM kernel-builder AS marlin-kernels
ENV MARLIN_KERNELS_BRANCH=v0.3.6
ENV VLLM_TARGET_DEVICE=rocm
RUN . .venv/bin/activate && git clone https://github.com/danieldk/marlin-kernels.git && \
    cd marlin-kernels && \
    git checkout ${MARLIN_KERNELS_BRANCH} && \
    python3 setup.py bdist_wheel --dist-dir=dist

FROM kernel-builder AS moe-kernels
ENV MOE_KERNELS_BRANCH=v0.8.2
ENV VLLM_TARGET_DEVICE=rocm
RUN . .venv/bin/activate && git clone https://github.com/danieldk/moe-kernels.git && \
    cd moe-kernels && \
    git checkout ${MOE_KERNELS_BRANCH} && \
    python3 setup.py bdist_wheel --dist-dir=dist

FROM final AS base-copy

# Text Generation Inference base env
ENV HF_HOME=/data \
    HF_HUB_ENABLE_HF_TRANSFER=1 \
    PORT=80

ENV VIRTUAL_ENV=/app/.venv/
ENV PATH="$PATH:/app/.venv/bin/"

# Install server
COPY proto proto
COPY server server
COPY server/Makefile server/Makefile
RUN cd server && \
    uv pip install grpcio-tools mypy-protobuf && \
    uv pip install -e ".[accelerate, compressed-tensors, peft, outlines]" --no-cache-dir && \
    make gen-server-raw
RUN cd server && \
    pwd && \
    text-generation-server --help

RUN --mount=type=bind,from=vllm-builder,src=/app/vllm/dist,target=/install \
    uv pip install /install/*.whl
RUN --mount=type=bind,from=custom-kernels-builder,src=/app/dist,target=/install \
    uv pip install /install/*.whl
RUN --mount=type=bind,from=custom-kernels-builder,src=/app/dist,target=/install \
    uv pip install /install/*.whl
RUN --mount=type=bind,from=exllama-kernels-builder,src=/app/dist,target=/install \
    uv pip install /install/*.whl
RUN --mount=type=bind,from=exllamav2-kernels-builder,src=/app/dist,target=/install \
    uv pip install /install/*.whl
RUN --mount=type=bind,from=marlin-kernels,src=/app/marlin-kernels/dist,target=/install \
    uv pip install /install/*.whl
RUN --mount=type=bind,from=moe-kernels,src=/app/moe-kernels/dist,target=/install \
    uv pip install /install/*.whl

# Install benchmarker
COPY --from=builder /usr/src/target/release-opt/text-generation-benchmark /usr/local/bin/text-generation-benchmark
# Install router
COPY --from=builder /usr/src/target/release-opt/text-generation-router /usr/local/bin/text-generation-router
# Install launcher
COPY --from=builder /usr/src/target/release-opt/text-generation-launcher /usr/local/bin/text-generation-launcher

# AWS Sagemaker compatible image
FROM base AS sagemaker

COPY sagemaker-entrypoint.sh entrypoint.sh
RUN chmod +x entrypoint.sh

ENTRYPOINT ["./entrypoint.sh"]

# Final image
FROM base-copy

# Set AS recommended: https://github.com/ROCm/triton/wiki/A-script-to-set-program-execution-environment-in-ROCm
ENV HIP_FORCE_DEV_KERNARG=1

# On MI250 and MI300, performances for flash with Triton FA are slightly better than CK.
# However, Triton requires a tunning for each prompt length, which is prohibitive.
ENV ROCM_USE_FLASH_ATTN_V2_TRITON=0
ENV ROCM_USE_CUSTOM_PAGED_ATTN=1
ENV PYTORCH_TUNABLEOP_TUNING_AFTER_WARMUP=0
ENV VLLM_MOE_PADDING=0
ENV ATTENTION=paged
ENV PREFIX_CACHING=0
ENV PREFILL_CHUNKING=0
ENV ROCM_USE_SKINNY_GEMM=1

COPY ./tgi-entrypoint.sh /tgi-entrypoint.sh
RUN chmod +x /tgi-entrypoint.sh

ENTRYPOINT ["/tgi-entrypoint.sh"]
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib"
ENV PYTHONPATH=/app/.venv/lib/python3.11/site-packages
# CMD ["--json-output"]


================================================
FILE: Dockerfile_gaudi
================================================
# Those arguments are required to build the image
ARG HABANA_VERSION=1.21.0
ARG PYTORCH_VERSION=2.6.0

# Rust builder
FROM lukemathwalker/cargo-chef:latest-rust-1.85.1 AS chef
WORKDIR /usr/src

ARG CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse

FROM chef AS planner
COPY Cargo.lock Cargo.lock
COPY Cargo.toml Cargo.toml
COPY rust-toolchain.toml rust-toolchain.toml
COPY proto proto
COPY benchmark benchmark
COPY router router
COPY backends backends
COPY launcher launcher
RUN cargo chef prepare --recipe-path recipe.json

FROM chef AS builder

ENV PYO3_PYTHON="/root/.local/bin/python" \
    PYTHON_SYS_EXECUTABLE="/root/.local/bin/python" \
    PYO3_PYTHON_VERSION="3.10"

RUN curl -LsSf https://astral.sh/uv/install.sh | sh \
    && . $HOME/.local/bin/env \
    && uv python install 3.10 --default --preview \
    && test -f /root/.local/bin/python || (echo "Python 3.10 not found at /root/.local/bin/python" && exit 1)

RUN PROTOC_ZIP=protoc-21.12-linux-x86_64.zip && \
    curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP && \
    unzip -o $PROTOC_ZIP -d /usr/local bin/protoc && \
    unzip -o $PROTOC_ZIP -d /usr/local 'include/*' && \
    rm -f $PROTOC_ZIP

COPY --from=planner /usr/src/recipe.json recipe.json
RUN cargo chef cook --profile release-opt --recipe-path recipe.json

ARG GIT_SHA
ARG DOCKER_LABEL

COPY Cargo.toml Cargo.toml
COPY rust-toolchain.toml rust-toolchain.toml
COPY proto proto
COPY benchmark benchmark
COPY router router
COPY backends backends
COPY launcher launcher
RUN cargo build --profile release-opt

# Text Generation Inference base image
ARG HABANA_VERSION
ARG PYTORCH_VERSION

FROM vault.habana.ai/gaudi-docker/${HABANA_VERSION}/ubuntu22.04/habanalabs/pytorch-installer-${PYTORCH_VERSION}:latest AS base

ENV ATTENTION=paged
ENV PREFIX_CACHING=0
ENV PREFILL_CHUNKING=0
ENV PT_HPU_LAZY_MODE=1
ENV PT_HPU_WEIGHT_SHARING=0
ENV VLLM_EXPONENTIAL_BUCKETING=true

# Text Generation Inference base env
ENV HF_HOME=/data \
    HF_HUB_ENABLE_HF_TRANSFER=1 \
    PORT=80

# Assert that Python 3.10 is installed as the launcher is compiled with Python 3.10
RUN python3.10 --version || (echo "Python 3.10 is not installed" && exit 1)

# libssl.so.1.1 is not installed on Ubuntu 22.04 by default, install it
RUN wget http://nz2.archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb && \
    dpkg -i ./libssl1.1_1.1.1f-1ubuntu2_amd64.deb

WORKDIR /usr/src

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        libssl-dev \
        ca-certificates \
        make \
        curl \
        git \
        && rm -rf /var/lib/apt/lists/*

# Install server
COPY proto proto
COPY backends/gaudi/server server
COPY backends/gaudi/server/Makefile server/Makefile
ARG HABANA_VERSION
RUN cd server && \
    make gen-server && \
    pip install --no-deps -r requirements.txt && \
    bash ./dill-0.3.8-patch.sh && \
    pip install . --no-cache-dir
RUN pip install git+https://github.com/sywangyi/vllm-hpu-extension.git@bmax_fix
RUN pip install compressed-tensors==0.9.1

# Install benchmarker
COPY --from=builder /usr/src/target/release-opt/text-generation-benchmark /usr/local/bin/text-generation-benchmark
# Install router
COPY --from=builder /usr/src/target/release-opt/text-generation-router /usr/local/bin/text-generation-router
# Install launcher
COPY --from=builder /usr/src/target/release-opt/text-generation-launcher /usr/local/bin/text-generation-launcher


# AWS Sagemaker compatible image
FROM base AS sagemaker

COPY sagemaker-entrypoint.sh entrypoint.sh
RUN chmod +x entrypoint.sh

ENTRYPOINT ["./entrypoint.sh"]

# Final image
FROM base

ENV HF_HUB_ENABLE_HF_TRANSFER=1
ENV HABANA_VISIBLE_DEVICES=all
ENV OMPI_MCA_btl_vader_single_copy_mechanism=NONE

COPY backends/gaudi/tgi-entrypoint.sh /tgi-entrypoint.sh
RUN chmod +x /tgi-entrypoint.sh

ENTRYPOINT ["/tgi-entrypoint.sh"]
CMD ["--json-output"]


================================================
FILE: Dockerfile_intel
================================================
ARG PLATFORM=xpu

FROM lukemathwalker/cargo-chef:latest-rust-1.85.1 AS chef
WORKDIR /usr/src

ARG CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse

FROM chef AS planner
COPY Cargo.lock Cargo.lock
COPY Cargo.toml Cargo.toml
COPY rust-toolchain.toml rust-toolchain.toml
COPY proto proto
COPY benchmark benchmark
COPY router router
COPY backends backends
COPY launcher launcher
RUN cargo chef prepare --recipe-path recipe.json

FROM chef AS builder

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    python3.11-dev
RUN PROTOC_ZIP=protoc-21.12-linux-x86_64.zip && \
    curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP && \
    unzip -o $PROTOC_ZIP -d /usr/local bin/protoc && \
    unzip -o $PROTOC_ZIP -d /usr/local 'include/*' && \
    rm -f $PROTOC_ZIP

COPY --from=planner /usr/src/recipe.json recipe.json
RUN cargo chef cook --profile release-opt --recipe-path recipe.json

ARG GIT_SHA
ARG DOCKER_LABEL

COPY Cargo.lock Cargo.lock
COPY Cargo.toml Cargo.toml
COPY rust-toolchain.toml rust-toolchain.toml
COPY proto proto
COPY benchmark benchmark
COPY router router
COPY backends backends
COPY launcher launcher
RUN cargo build --profile release-opt --frozen


# Text Generation Inference base image for Intel

FROM intel/oneapi-basekit:2025.1.3-0-devel-ubuntu22.04 AS xpu

USER root

ARG MAMBA_VERSION=23.1.0-1
ARG PYTHON_VERSION='3.11.10'
# Automatically set by buildx
ARG TARGETPLATFORM
ENV PATH=/opt/conda/bin:$PATH

# TGI seem to require libssl.so.1.1 instead of libssl.so.3 so we can't use ubuntu 22.04. Ubuntu 20.04 has python==3.8, and TGI requires python>=3.9, hence the need for miniconda.
# Install mamba
# translating Docker's TARGETPLATFORM into mamba arches
RUN case ${TARGETPLATFORM} in \
         "linux/arm64")  MAMBA_ARCH=aarch64  ;; \
         *)              MAMBA_ARCH=x86_64   ;; \
    esac && \
    curl -fsSL -v -o ~/mambaforge.sh -O  "https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-${MAMBA_ARCH}.sh"
RUN chmod +x ~/mambaforge.sh && \
    bash ~/mambaforge.sh -b -p /opt/conda && \
    rm ~/mambaforge.sh

RUN case ${TARGETPLATFORM} in \
         "linux/arm64")  exit 1 ;; \
         *)              /opt/conda/bin/conda update -y conda &&  \
                         /opt/conda/bin/conda install -y "python=${PYTHON_VERSION}" ;; \
    esac && \
    /opt/conda/bin/conda clean -ya

# libssl.so.1.1 is not installed on Ubuntu 22.04 by default, install it
RUN wget http://nz2.archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb && \
    dpkg -i ./libssl1.1_1.1.1f-1ubuntu2_amd64.deb

RUN wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | gpg --dearmor | tee /usr/share/keyrings/intel-graphics.gpg > /dev/null

RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null && echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list

RUN echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main" > /tmp/intel-for-pytorch-gpu-dev.list

RUN mv /tmp/intel-for-pytorch-gpu-dev.list /etc/apt/sources.list.d

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt install -y xpu-smi cmake ninja-build pciutils intel-ocloc libnl-genl-3-200

# Text Generation Inference base env
ENV HF_HOME=/data \
    HF_HUB_ENABLE_HF_TRANSFER=1 \
    PORT=80




WORKDIR /usr/src

RUN pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/xpu

# Install server
COPY proto proto
COPY server server
COPY server/Makefile server/Makefile
ENV UV_SYSTEM_PYTHON=1
RUN cd server && \
    make gen-server && \
    pip install -U pip uv && \
    uv pip install -e ".[accelerate, compressed-tensors, peft, outlines]" --no-cache-dir

ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/lib
ENV CCL_ZE_IPC_EXCHANGE=sockets
ENV TORCH_LLM_ALLREDUCE=1
ENV CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0
ENV TORCH_DEVICE_BACKEND_AUTOLOAD=0

RUN pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.8.10%2Bxpu-cp311-cp311-linux_x86_64.whl
# Install benchmarker
COPY --from=builder /usr/src/target/release-opt/text-generation-benchmark /usr/local/bin/text-generation-benchmark
# Install router
COPY --from=builder /usr/src/target/release-opt/text-generation-router /usr/local/bin/text-generation-router
# Install launcher
COPY --from=builder /usr/src/target/release-opt/text-generation-launcher /usr/local/bin/text-generation-launcher


# Text Generation Inference base image for Intel-cpu
FROM ubuntu:22.04 AS cpu

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    curl \
    ca-certificates \
    make \
    g++-12 \
    gcc-12 \
    git \
    wget \
    cmake \
    libnuma-dev

RUN update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-12 12
RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12
RUN update-alternatives --install /usr/bin/cc cc /usr/bin/gcc 30
RUN update-alternatives --set cc /usr/bin/gcc

RUN update-alternatives --install /usr/bin/c++ c++ /usr/bin/g++ 30
RUN update-alternatives --set c++ /usr/bin/g++


ENV HUGGINGFACE_HUB_CACHE=/data \
    HF_HUB_ENABLE_HF_TRANSFER=1 \
    PORT=80

ARG MAMBA_VERSION=23.1.0-1
ARG PYTHON_VERSION='3.11.10'
# Automatically set by buildx
ARG TARGETPLATFORM
ENV PATH=/opt/conda/bin:$PATH

# TGI seem to require libssl.so.1.1 instead of libssl.so.3 so we can't use ubuntu 22.04. Ubuntu 20.04 has python==3.8, and TGI requires python>=3.9, hence the need for miniconda.
# Install mamba
# translating Docker's TARGETPLATFORM into mamba arches
RUN case ${TARGETPLATFORM} in \
         "linux/arm64")  MAMBA_ARCH=aarch64  ;; \
         *)              MAMBA_ARCH=x86_64   ;; \
    esac && \
    curl -fsSL -v -o ~/mambaforge.sh -O  "https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-${MAMBA_ARCH}.sh"
RUN chmod +x ~/mambaforge.sh && \
    bash ~/mambaforge.sh -b -p /opt/conda && \
    rm ~/mambaforge.sh

RUN case ${TARGETPLATFORM} in \
         "linux/arm64")  exit 1 ;; \
         *)              /opt/conda/bin/conda update -y conda &&  \
                         /opt/conda/bin/conda install -y "python=${PYTHON_VERSION}" ;; \
    esac && \
    /opt/conda/bin/conda clean -ya

RUN conda install -c conda-forge gperftools mkl

RUN pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cpu
RUN pip install triton==3.2.0 py-libnuma

WORKDIR /usr/src

RUN pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/cpu/intel_extension_for_pytorch-2.7.0%2Bcpu-cp311-cp311-linux_x86_64.whl
RUN pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/cpu/oneccl_bind_pt-2.7.0%2Bcpu-cp311-cp311-linux_x86_64.whl


ENV LD_PRELOAD=/opt/conda/lib/libtcmalloc.so
ENV CCL_ROOT=/opt/conda/lib/python3.11/site-packages/oneccl_bindings_for_pytorch
ENV I_MPI_ROOT=/opt/conda/lib/python3.11/site-packages/oneccl_bindings_for_pytorch
ENV FI_PROVIDER_PATH=/opt/conda/lib/python3.11/site-packages/oneccl_bindings_for_pytorch/opt/mpi/libfabric/lib/prov:/usr/lib64/libfabric
ENV LD_LIBRARY_PATH=/opt/conda/lib/python3.11/site-packages/oneccl_bindings_for_pytorch/opt/mpi/libfabric/lib:/opt/conda/lib/python3.11/site-packages/oneccl_bindings_for_pytorch/lib
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/opt/conda/lib/"

# Install server
COPY proto proto
COPY server server
COPY server/Makefile server/Makefile
ENV UV_SYSTEM_PYTHON=1
RUN cd server && \
    make gen-server && \
    pip install -U pip uv && \
    uv pip install -e ".[accelerate, compressed-tensors, peft, outlines]" --no-cache-dir

# Install benchmarker
COPY --from=builder /usr/src/target/release-opt/text-generation-benchmark /usr/local/bin/text-generation-benchmark
# Install router
COPY --from=builder /usr/src/target/release-opt/text-generation-router /usr/local/bin/text-generation-router
# Install launcher
COPY --from=builder /usr/src/target/release-opt/text-generation-launcher /usr/local/bin/text-generation-launcher

FROM ${PLATFORM} AS final
ENV ATTENTION=flashdecoding-ipex
ENV PREFIX_CACHING=1
ENV PREFILL_CHUNKING=1
ENV CUDA_GRAPHS=0
ENTRYPOINT ["text-generation-launcher"]
CMD ["--json-output"]


================================================
FILE: Dockerfile_llamacpp
================================================
FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04 AS deps

ARG llamacpp_version=b4827
ARG llamacpp_cuda=OFF
ARG llamacpp_native=ON
ARG llamacpp_cpu_arm_arch=native
ARG cuda_arch=75-real;80-real;86-real;89-real;90-real

WORKDIR /opt/src

ENV DEBIAN_FRONTEND=noninteractive
RUN apt update && apt upgrade -y && apt install -y \
    clang \
    cmake \
    curl \
    git \
    python3-dev \
    libssl-dev \
    pkg-config \
    tar

ADD https://github.com/ggml-org/llama.cpp/archive/refs/tags/${llamacpp_version}.tar.gz /opt/src/
RUN mkdir -p llama.cpp \
 && tar -xzf ${llamacpp_version}.tar.gz -C llama.cpp --strip-components=1 \
 && cd llama.cpp \
 && cmake -B build \
    -DCMAKE_INSTALL_PREFIX=/usr \
    -DCMAKE_INSTALL_LIBDIR=/usr/lib \
    -DCMAKE_C_COMPILER=clang \
    -DCMAKE_CXX_COMPILER=clang++ \
    -DCMAKE_CUDA_ARCHITECTURES=${cuda_arch} \
    -DGGML_CUDA=${llamacpp_cuda} \
    -DGGML_NATIVE=${llamacpp_native} \
    -DGGML_CPU_ARM_ARCH=${llamacpp_cpu_arm_arch} \
    -DLLAMA_BUILD_COMMON=OFF \
    -DLLAMA_BUILD_TESTS=OFF \
    -DLLAMA_BUILD_EXAMPLES=OFF \
    -DLLAMA_BUILD_SERVER=OFF \
 && cmake --build build --parallel --config Release \
 && cmake --install build

WORKDIR /app
COPY rust-toolchain.toml rust-toolchain.toml
RUN curl -sSf https://sh.rustup.rs | sh -s -- --no-modify-path --default-toolchain 1.85.1 --profile minimal -y
ENV PATH="/root/.cargo/bin:$PATH"
RUN cargo install cargo-chef --locked

FROM deps AS planner
COPY . .
RUN cargo chef prepare --recipe-path recipe.json

FROM deps AS builder
COPY --from=planner /app/recipe.json recipe.json
RUN cargo chef cook \
    --recipe-path recipe.json \
    --profile release \
    --package text-generation-router-llamacpp
COPY . .
RUN cargo build \
    --profile release \
    --package text-generation-router-llamacpp --frozen

FROM nvidia/cuda:12.8.0-cudnn-runtime-ubuntu24.04
WORKDIR /app

ENV DEBIAN_FRONTEND=noninteractive
RUN apt update && apt upgrade -y && apt install -y \
    python3-venv \
    python3-pip

RUN python3 -m venv /venv
ENV PATH="/venv/bin:$PATH"

COPY backends/llamacpp/requirements.txt requirements.txt
COPY --from=builder /opt/src/llama.cpp/gguf-py gguf-py
COPY --from=builder /opt/src/llama.cpp/convert_hf_to_gguf.py /bin/

RUN pip3 install --no-cache-dir \
    -r requirements.txt \
    -e gguf-py

COPY --from=builder /usr/lib/libllama.so /usr/lib/
COPY --from=builder /usr/lib/libggml*.so /usr/lib/
COPY --from=builder /app/target/release/text-generation-router-llamacpp /usr/bin/

ENV HF_HUB_ENABLE_HF_TRANSFER=1

ENTRYPOINT ["text-generation-router-llamacpp"]


================================================
FILE: Dockerfile_trtllm
================================================
ARG cuda_arch_list="75-real;80-real;86-real;89-real;90-real;100-real;120-real"
ARG cuda_base=12.8.0
ARG build_type=release
ARG ompi_version=4.1.7
ARG sccache_gha_enabled=off
ARG actions_results_url=""
ARG actions_runtime_token=""

# CUDA dependent dependencies resolver stage
FROM nvidia/cuda:${cuda_base}-cudnn-devel-ubuntu24.04 AS cuda-builder

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
    build-essential \
    cmake \
    curl \
    gcc-14  \
    g++-14 \
    git \
    git-lfs \
    lld \
    libssl-dev \
    libucx-dev \
    libasan8 \
    libubsan1 \
    ninja-build \
    pkg-config \
    pipx \
    python3 \
    python3-dev \
    python3-setuptools \
    tar \
    wget --no-install-recommends && \
    pipx ensurepath

ENV TGI_INSTALL_PREFIX=/usr/local/tgi
ENV TENSORRT_INSTALL_PREFIX=/usr/local/tensorrt

# Install OpenMPI
FROM cuda-builder AS mpi-builder
WORKDIR /opt/src/mpi

ARG ompi_version
ENV OMPI_VERSION=${ompi_version}
ENV OMPI_TARBALL_FILENAME=openmpi-${OMPI_VERSION}.tar.bz2
ADD --checksum=sha256:54a33cb7ad81ff0976f15a6cc8003c3922f0f3d8ceed14e1813ef3603f22cd34 \
    https://download.open-mpi.org/release/open-mpi/v4.1/${OMPI_TARBALL_FILENAME} .

RUN tar --strip-components=1 -xf ${OMPI_TARBALL_FILENAME} &&\
    ./configure --prefix=/usr/local/mpi --with-cuda=/usr/local/cuda --with-slurm && \
    make -j all && \
    make install && \
    rm -rf ${OMPI_TARBALL_FILENAME}/..

# Install TensorRT
FROM cuda-builder AS trt-builder
COPY backends/trtllm/scripts/install_tensorrt.sh /opt/install_tensorrt.sh
RUN chmod +x /opt/install_tensorrt.sh && \
    /opt/install_tensorrt.sh

# Build Backend
FROM cuda-builder AS tgi-builder
WORKDIR /usr/src/text-generation-inference

# Scoped global args reuse
ARG cuda_arch_list
ARG build_type
ARG sccache_gha_enabled

# Install Rust
ENV PATH="/root/.cargo/bin:$PATH"
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain 1.85.1 --profile minimal -y && \
    chmod -R a+w /root/.rustup && \
    chmod -R a+w /root/.cargo && \
    cargo install sccache --version ">=0.10.0" --locked

ENV LD_LIBRARY_PATH="/usr/local/mpi/lib:$LD_LIBRARY_PATH"
ENV PKG_CONFIG_PATH="/usr/local/mpi/lib/pkgconfig"
ENV CMAKE_PREFIX_PATH="/usr/local/mpi:/usr/local/tensorrt"

ENV USE_LLD_LINKER=ON
ENV CUDA_ARCH_LIST=${cuda_arch_list}

# SCCACHE Specifics args - before finding a better, more generic, way...
ENV SCCACHE_GHA_ENABLED=${sccache_gha_enabled}

COPY Cargo.lock Cargo.lock
COPY Cargo.toml Cargo.toml
COPY rust-toolchain.toml rust-toolchain.toml
COPY router router
COPY backends backends
COPY benchmark benchmark
COPY launcher launcher
COPY --from=trt-builder /usr/local/tensorrt /usr/local/tensorrt
COPY --from=mpi-builder /usr/local/mpi /usr/local/mpi

ENV RUSTC_WRAPPER=sccache
ENV CMAKE_INSTALL_PREFIX=$TGI_INSTALL_PREFIX
RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL \
    --mount=type=secret,id=actions_runtime_token,env=ACTIONS_RUNTIME_TOKEN \
    export CMAKE_C_COMPILER_LAUNCHER=sccache && \
    export CMAKE_CXX_COMPILER_LAUNCHER=sccache && \
    export CMAKE_CUDA_COMPILER_LAUNCHER=sccache && \
    mkdir $TGI_INSTALL_PREFIX && mkdir "$TGI_INSTALL_PREFIX/include" && mkdir "$TGI_INSTALL_PREFIX/lib" && \
    cargo build --profile ${build_type} --package text-generation-backends-trtllm --bin text-generation-backends-trtllm && \
    sccache --show-stats

FROM nvidia/cuda:${cuda_base}-cudnn-runtime-ubuntu24.04 AS runtime
RUN apt update && apt install -y libucx0 pipx python3-minimal python3-dev python3-pip python3-venv && \
    rm -rf /var/lib/{apt,dpkg,cache,log}/ && \
    pipx ensurepath && \
    pipx install --include-deps transformers tokenizers

WORKDIR /usr/local/tgi/bin

ENV PATH=/root/.local/share/pipx/venvs/transformers/bin/:$PATH
ENV LD_LIBRARY_PATH="/usr/local/tgi/lib:/usr/local/mpi/lib:/usr/local/tensorrt/lib:/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH"
ENV TOKENIZERS_PARALLELISM=false
ENV OMPI_MCA_plm_rsh_agent=""

COPY --from=mpi-builder /usr/local/mpi /usr/local/mpi
COPY --from=trt-builder /usr/local/tensorrt /usr/local/tensorrt
COPY --from=tgi-builder /usr/local/tgi /usr/local/tgi
COPY --from=tgi-builder /usr/src/text-generation-inference/target/release/text-generation-backends-trtllm /usr/local/tgi/bin/text-generation-launcher

# This is used only for the CI/CD
FROM nvidia/cuda:${cuda_base}-cudnn-runtime-ubuntu24.04 AS ci-runtime
RUN apt update && apt install -y libasan8 libubsan1 libucx0 pipx python3-minimal python3-dev python3-pip python3-venv && \
    rm -rf /var/lib/{apt,dpkg,cache,log}/ && \
    pipx ensurepath && \
    pipx install --include-deps transformers tokenizers

WORKDIR /usr/local/tgi/bin

ENV PATH=/root/.local/share/pipx/venvs/transformers/bin/:$PATH
ENV LD_LIBRARY_PATH="/usr/local/tgi/lib:/usr/local/mpi/lib:/usr/local/tensorrt/lib:/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH"
ENV TOKENIZERS_PARALLELISM=false
ENV OMPI_MCA_plm_rsh_agent=""

COPY --from=mpi-builder /usr/local/mpi /usr/local/mpi
COPY --from=trt-builder /usr/local/tensorrt /usr/local/tensorrt
COPY --from=tgi-builder /usr/local/tgi /usr/local/tgi

# Basically we copy from target/debug instead of target/release
COPY --from=tgi-builder /usr/src/text-generation-inference/target/debug/text-generation-backends-trtllm /usr/local/tgi/bin/text-generation-launcher

# This is the final image
FROM runtime

LABEL co.huggingface.vendor="Hugging Face Inc."
LABEL org.opencontainers.image.authors="hardware@hf.co"
LABEL org.opencontainers.title="Text-Generation-Inference TensorRT-LLM Backend"

ENTRYPOINT ["./text-generation-launcher"]
CMD ["--executor-worker", "/usr/local/tgi/bin/executorWorker"]


================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright 2022 Hugging Face

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: Makefile
================================================
install-server:
	cd server && make install

install-server-cpu:
	cd server && make install-server

install-router:
	cargo install --path backends/v3/

install-launcher:
	cargo install --path launcher/

install-benchmark:
	cargo install --path benchmark/

install: install-server install-router install-launcher


install-cpu: install-server-cpu install-router install-launcher

server-dev:
	cd server && make run-dev

router-dev:
	cd router && cargo run -- --port 8080

rust-tests: install-router install-launcher
	cargo test

install-integration-tests:
	cd integration-tests && pip install -r requirements.txt
	cd clients/python && pip install .

integration-tests: install-integration-tests
	pytest -s -vv -m "not private" integration-tests

update-integration-tests: install-integration-tests
	pytest -s -vv --snapshot-update integration-tests

python-server-tests:
	HF_HUB_ENABLE_HF_TRANSFER=1 pytest -s -vv -m "not private" server/tests

python-client-tests:
	pytest clients/python/tests

python-tests: python-server-tests python-client-tests

run-falcon-7b-instruct:
	text-generation-launcher --model-id tiiuae/falcon-7b-instruct --port 8080

run-falcon-7b-instruct-quantize:
	text-generation-launcher --model-id tiiuae/falcon-7b-instruct --quantize bitsandbytes --port 8080

clean:
	rm -rf target aml

preview_doc:
	doc-builder preview text-generation-inference docs/source --not_python_module


================================================
FILE: README.md
================================================
> [!CAUTION]
> text-generation-inference is now in maintenance mode. Going forward, we will accept pull requests for minor bug fixes, documentation improvements and lightweight maintenance tasks.
>
> TGI has initiated the movement for optimized inference engines to rely on a `transformers` model architectures. This approach is now adopted by downstream inference engines, which we contribute to and recommend using going forward: [vllm](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), as well as local engines with inter-compatibility such as llama.cpp or MLX.

<div align="center">

<a href="https://www.youtube.com/watch?v=jlMAX2Oaht0">
  <img width=560 alt="Making TGI deployment optimal" src="https://huggingface.co/datasets/Narsil/tgi_assets/resolve/main/thumbnail.png">
</a>

# Text Generation Inference

<a href="https://github.com/huggingface/text-generation-inference">
  <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social">
</a>
<a href="https://huggingface.github.io/text-generation-inference">
  <img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational">
</a>

A Rust, Python and gRPC server for text generation inference. Used in production at [Hugging Face](https://huggingface.co)
to power Hugging Chat, the Inference API and Inference Endpoints.

</div>

## Table of contents

  - [Get Started](#get-started)
    - [Docker](#docker)
    - [API documentation](#api-documentation)
    - [Using a private or gated model](#using-a-private-or-gated-model)
    - [A note on Shared Memory (shm)](#a-note-on-shared-memory-shm)
    - [Distributed Tracing](#distributed-tracing)
    - [Architecture](#architecture)
    - [Local install](#local-install)
    - [Local install (Nix)](#local-install-nix)
  - [Optimized architectures](#optimized-architectures)
  - [Run locally](#run-locally)
    - [Run](#run)
    - [Quantization](#quantization)
  - [Develop](#develop)
  - [Testing](#testing)

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https://huggingface.co/docs/text-generation-inference/supported_models). TGI implements many features, such as:

- Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
- Tensor Parallelism for faster inference on multiple GPUs
- Token streaming using Server-Sent Events (SSE)
- Continuous batching of incoming requests for increased total throughput
- [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) compatible with Open AI Chat Completion API
- Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
- Quantization with :
  - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
  - [GPT-Q](https://arxiv.org/abs/2210.17323)
  - [EETQ](https://github.com/NetEase-FuXi/EETQ)
  - [AWQ](https://github.com/casper-hansen/AutoAWQ)
  - [Marlin](https://github.com/IST-DASLab/marlin)
  - [fp8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/)
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
- Stop sequences
- Log probabilities
- [Speculation](https://huggingface.co/docs/text-generation-inference/conceptual/speculation) ~2x latency
- [Guidance/JSON](https://huggingface.co/docs/text-generation-inference/conceptual/guidance). Specify output format to speed up inference and make sure the output is valid according to some specs..
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

### Hardware support

- [Nvidia](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference)
- [AMD](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference) (-rocm)
- [Inferentia](https://github.com/huggingface/optimum-neuron/tree/main/text-generation-inference)
- [Intel GPU](https://github.com/huggingface/text-generation-inference/pull/1475)
- [Gaudi](https://github.com/huggingface/tgi-gaudi)
- [Google TPU](https://huggingface.co/docs/optimum-tpu/howto/serving)


## Get Started

### Docker

For a detailed starting guide, please see the [Quick Tour](https://huggingface.co/docs/text-generation-inference/quicktour). The easiest way of getting started is using the official Docker container:

```shell
model=HuggingFaceH4/zephyr-7b-beta
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.3.5 --model-id $model
```

And then you can make requests like

```bash
curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'
```

You can also use [TGI's Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) to obtain Open AI Chat Completion API compatible responses.

```bash
curl localhost:8080/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'
```

**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.

**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/installation_amd#using-tgi-with-amd-gpus). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.3.5-rocm --model-id $model` instead of the command above.

To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
```
text-generation-launcher --help
```

### API documentation

You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).

### Using a private or gated model

You have the option to utilize the `HF_TOKEN` environment variable for configuring the token employed by
`text-generation-inference`. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

1. Go to https://huggingface.co/settings/tokens
2. Copy your CLI READ token
3. Export `HF_TOKEN=<your CLI READ token>`

or with Docker:

```shell
model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.3.5 --model-id $model
```

### A note on Shared Memory (shm)

[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
`PyTorch` to do distributed training/inference. `text-generation-inference` makes
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.

If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by
creating a volume with:

```yaml
- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi
```

and mounting it to `/dev/shm`.

Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
this will impact performance.

### Distributed Tracing

`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the `--otlp-endpoint` argument. The default service name can be
overridden with the `--otlp-service-name` argument

### Architecture

![TGI architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/TGI.png)

Detailed blogpost by Adyen on TGI inner workings: [LLM inference at scale with TGI (Martin Iglesias Goyanes - Adyen, 2024)](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi)

### Local install

You can also opt to install `text-generation-inference` locally.

First clone the repository and change directory into it:

```shell
git clone https://github.com/huggingface/text-generation-inference
cd text-generation-inference
```

Then [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
Python 3.9, e.g. using `conda` or `python venv`:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

#using conda
conda create -n text-generation-inference python=3.11
conda activate text-generation-inference

#using python venv
python3 -m venv .venv
source .venv/bin/activate
```

You may also need to install Protoc.

On Linux:

```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```

On MacOS, using Homebrew:

```shell
brew install protobuf
```

Then run:

```shell
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
```

**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

```shell
sudo apt-get install libssl-dev gcc -y
```

### Local install (Nix)

Another option is to install `text-generation-inference` locally using [Nix](https://nixos.org). Currently,
we only support Nix on x86_64 Linux with CUDA GPUs. When using Nix, all dependencies can
be pulled from a binary cache, removing the need to build them locally.

First follow the instructions to [install Cachix and enable the Hugging Face cache](https://app.cachix.org/cache/huggingface).
Setting up the cache is important, otherwise Nix will build many of the dependencies
locally, which can take hours.

After that you can run TGI with `nix run`:

```shell
cd text-generation-inference
nix run --extra-experimental-features nix-command --extra-experimental-features flakes . -- --model-id meta-llama/Llama-3.1-8B-Instruct
```

**Note:** when you are using Nix on a non-NixOS system, you have to [make some symlinks](https://danieldk.eu/Nix-CUDA-on-non-NixOS-systems#make-runopengl-driverlib-and-symlink-the-driver-library)
to make the CUDA driver libraries visible to Nix packages.

For TGI development, you can use the `impure` dev shell:

```shell
nix develop .#impure

# Only needed the first time the devshell is started or after updating the protobuf.
(
cd server
mkdir text_generation_server/pb || true
python -m grpc_tools.protoc -I../proto/v3 --python_out=text_generation_server/pb \
       --grpc_python_out=text_generation_server/pb --mypy_out=text_generation_server/pb ../proto/v3/generate.proto
find text_generation_server/pb/ -type f -name "*.py" -print0 -exec sed -i -e 's/^\(import.*pb2\)/from . \1/g' {} \;
touch text_generation_server/pb/__init__.py
)
```

All development dependencies (cargo, Python, Torch), etc. are available in this
dev shell.

## Optimized architectures

TGI works out of the box to serve optimized models for all modern models. They can be found in [this list](https://huggingface.co/docs/text-generation-inference/supported_models).

Other architectures are supported on a best-effort basis using:

`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`

or

`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`



## Run locally

### Run

```shell
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
```

### Quantization

You can also run pre-quantized weights (AWQ, GPTQ, Marlin) or on-the-fly quantize weights with bitsandbytes, EETQ, fp8, to reduce the VRAM requirement:

```shell
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize
```

4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.

Read more about quantization in the [Quantization documentation](https://huggingface.co/docs/text-generation-inference/en/conceptual/quantization).

## Develop

```shell
make server-dev
make router-dev
```

## Testing

```shell
# python
make python-server-tests
make python-client-tests
# or both server and client tests
make python-tests
# rust cargo tests
make rust-tests
# integration tests
make integration-tests
```


================================================
FILE: assets/tgi_grafana.json
================================================
{
  "__inputs": [
    {
      "name": "DS_PROMETHEUS_EKS API INFERENCE PROD",
      "label": "Prometheus EKS API Inference Prod",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ],
  "__elements": {},
  "__requires": [
    {
      "type": "panel",
      "id": "gauge",
      "name": "Gauge",
      "version": ""
    },
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "10.0.2"
    },
    {
      "type": "panel",
      "id": "heatmap",
      "name": "Heatmap",
      "version": ""
    },
    {
      "type": "datasource",
      "id": "prometheus",
      "name": "Prometheus",
      "version": "1.0.0"
    },
    {
      "type": "panel",
      "id": "timeseries",
      "name": "Time series",
      "version": ""
    }
  ],
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 2,
  "id": 551,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "fieldMinMax": false,
          "mappings": [],
          "min": 0,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 1000
              }
            ]
          },
          "unit": "ms"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 7,
        "w": 8,
        "x": 0,
        "y": 0
      },
      "id": 49,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showPercentChange": false,
        "textMode": "auto",
        "wideLayout": true
      },
      "pluginVersion": "10.4.2",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "(histogram_quantile(0.5, sum by (le) (rate(tgi_request_queue_duration_bucket{container=\"$service\"}[10m]))) * 1000) > 0",
          "hide": true,
          "instant": false,
          "legendFormat": "__auto",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "(histogram_quantile(0.5, sum by (le) (rate(tgi_batch_inference_duration_bucket{method=\"prefill\", container=\"$service\"}[10m]))) * 1000) > 0",
          "hide": true,
          "instant": false,
          "legendFormat": "__auto",
          "range": true,
          "refId": "C"
        },
        {
          "datasource": {
            "name": "Expression",
            "type": "__expr__",
            "uid": "__expr__"
          },
          "expression": "$B + $C",
          "hide": false,
          "refId": "D",
          "type": "math"
        }
      ],
      "title": "Time to first token",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "min": 0,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "ms"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 7,
        "w": 8,
        "x": 9,
        "y": 0
      },
      "id": 44,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showPercentChange": false,
        "textMode": "auto",
        "wideLayout": true
      },
      "pluginVersion": "10.4.2",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "(histogram_quantile(0.5, sum by (le) (rate(tgi_batch_forward_duration_bucket{method=\"decode\", container=\"$service\"}[10m]))) * 1000)>0",
          "instant": false,
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Decode per-token latency",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "min": 0,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 7,
        "w": 7,
        "x": 17,
        "y": 0
      },
      "id": 45,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showPercentChange": false,
        "textMode": "auto",
        "wideLayout": true
      },
      "pluginVersion": "10.4.2",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "sum((rate(tgi_request_generated_tokens_sum{container=\"$service\"}[10m]) / rate(tgi_request_generated_tokens_count{container=\"$service\"}[10m]))>0)",
          "instant": false,
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Throughput (generated tok/s)",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "none"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "p50"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "green",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p90"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "orange",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p99"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "red",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 7
      },
      "id": 48,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max"
          ],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.5, sum by (le) (rate(tgi_request_input_length_bucket{container=\"$service\"}[10m])))",
          "legendFormat": "p50",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.9, sum by (le) (rate(tgi_request_input_length_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p90",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.99, sum by (le) (rate(tgi_request_input_length_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p99",
          "range": true,
          "refId": "C"
        }
      ],
      "title": "Number of tokens per prompt",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "none"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "p50"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "green",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p90"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "orange",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p99"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "red",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 7
      },
      "id": 30,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max"
          ],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.5, sum by (le) (rate(tgi_request_generated_tokens_bucket{container=\"$service\"}[10m])))",
          "legendFormat": "p50",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.9, sum by (le) (rate(tgi_request_generated_tokens_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p90",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.99, sum by (le) (rate(tgi_request_generated_tokens_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p99",
          "range": true,
          "refId": "C"
        }
      ],
      "title": "Number of generated tokens per request",
      "type": "timeseries"
    },
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 15
      },
      "id": 20,
      "panels": [],
      "title": "General",
      "type": "row"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 30,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 6,
        "x": 0,
        "y": 16
      },
      "id": 4,
      "maxDataPoints": 100,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max"
          ],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "sum(increase(tgi_request_success{container=\"$service\"}[1m]))",
          "legendFormat": "Success",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "sum(increase(tgi_request_failure{container=\"$service\"}[1m])) by (err)",
          "hide": false,
          "legendFormat": "Error: {{err}}",
          "range": true,
          "refId": "B"
        }
      ],
      "title": "Requests",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "p50"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "green",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p90"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "orange",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p99"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "red",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 13,
        "w": 9,
        "x": 6,
        "y": 16
      },
      "id": 6,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max"
          ],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.5, sum by (le) (rate(tgi_request_mean_time_per_token_duration_bucket{container=\"$service\"}[10m])))",
          "legendFormat": "p50",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.9, sum by (le) (rate(tgi_request_mean_time_per_token_duration_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p90",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.99, sum by (le) (rate(tgi_request_mean_time_per_token_duration_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p99",
          "range": true,
          "refId": "C"
        }
      ],
      "title": "Mean Time Per Token quantiles",
      "type": "timeseries"
    },
    {
      "cards": {},
      "color": {
        "cardColor": "#5794F2",
        "colorScale": "linear",
        "colorScheme": "interpolateSpectral",
        "exponent": 0.5,
        "min": 0,
        "mode": "opacity"
      },
      "dataFormat": "tsbuckets",
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "scaleDistribution": {
              "type": "linear"
            }
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 13,
        "w": 9,
        "x": 15,
        "y": 16
      },
      "heatmap": {},
      "hideZeroBuckets": false,
      "highlightCards": true,
      "id": 13,
      "legend": {
        "show": false
      },
      "maxDataPoints": 25,
      "options": {
        "calculate": false,
        "calculation": {},
        "cellGap": 2,
        "cellValues": {},
        "color": {
          "exponent": 0.5,
          "fill": "#5794F2",
          "min": 0,
          "mode": "scheme",
          "reverse": false,
          "scale": "exponential",
          "scheme": "Spectral",
          "steps": 128
        },
        "exemplars": {
          "color": "rgba(255,0,255,0.7)"
        },
        "filterValues": {
          "le": 1e-9
        },
        "legend": {
          "show": false
        },
        "rowsFrame": {
          "layout": "auto"
        },
        "showValue": "never",
        "tooltip": {
          "mode": "single",
          "showColorScale": false,
          "yHistogram": false
        },
        "yAxis": {
          "axisPlacement": "left",
          "decimals": 1,
          "reverse": false,
          "unit": "s"
        }
      },
      "pluginVersion": "10.4.2",
      "reverseYBuckets": false,
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "exemplar": true,
          "expr": "sum(increase(tgi_request_mean_time_per_token_duration_bucket{container=\"$service\"}[5m])) by (le)",
          "format": "heatmap",
          "interval": "",
          "legendFormat": "{{ le }}",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Mean Time Per Token",
      "tooltip": {
        "show": true,
        "showHistogram": false
      },
      "type": "heatmap",
      "xAxis": {
        "show": true
      },
      "yAxis": {
        "decimals": 1,
        "format": "s",
        "logBase": 1,
        "show": true
      },
      "yBucketBound": "auto"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "percentage",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "orange",
                "value": 70
              },
              {
                "color": "red",
                "value": 85
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 3,
        "x": 0,
        "y": 24
      },
      "id": 18,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": false
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "9.1.0",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "count(tgi_request_count{container=\"$service\"})",
          "legendFormat": "Replicas",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Number of replicas",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "mappings": [],
          "thresholds": {
            "mode": "percentage",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "orange",
                "value": 70
              },
              {
                "color": "red",
                "value": 85
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 3,
        "x": 3,
        "y": 24
      },
      "id": 32,
      "options": {
        "minVizHeight": 75,
        "minVizWidth": 75,
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true,
        "sizing": "auto"
      },
      "pluginVersion": "10.4.2",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "sum(tgi_queue_size{container=\"$service\"})",
          "legendFormat": "__auto",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Queue Size",
      "type": "gauge"
    },
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 29
      },
      "id": 26,
      "panels": [],
      "title": "Batching",
      "type": "row"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "bars",
            "fillOpacity": 50,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "normal"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 6,
        "x": 0,
        "y": 30
      },
      "id": 29,
      "maxDataPoints": 40,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": false
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "9.1.0",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "avg(tgi_batch_current_max_tokens{container=\"$service\"})",
          "legendFormat": "{{ pod }}",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Max tokens per batch",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "none"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "p50"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "green",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p90"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "orange",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p99"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "red",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 9,
        "w": 4,
        "x": 6,
        "y": 30
      },
      "id": 33,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max"
          ],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.5, sum by (le) (rate(tgi_request_skipped_tokens_bucket{container=\"$service\"}[10m])))",
          "legendFormat": "p50",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.9, sum by (le) (rate(tgi_request_skipped_tokens_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p90",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.99, sum by (le) (rate(tgi_request_skipped_tokens_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p99",
          "range": true,
          "refId": "C"
        }
      ],
      "title": "Speculated Tokens",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "none"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "p50"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "green",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p90"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "orange",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p99"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "red",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 9,
        "w": 5,
        "x": 10,
        "y": 30
      },
      "id": 46,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max"
          ],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.5, sum by (le) (rate(tgi_request_input_length_bucket{container=\"$service\"}[10m])))",
          "legendFormat": "p50",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.9, sum by (le) (rate(tgi_request_input_length_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p90",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.99, sum by (le) (rate(tgi_request_input_length_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p99",
          "range": true,
          "refId": "C"
        }
      ],
      "title": "Prompt Tokens",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "p50"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "green",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p90"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "orange",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p99"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "red",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 9,
        "w": 9,
        "x": 15,
        "y": 30
      },
      "id": 8,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max"
          ],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.5, sum by (le) (rate(tgi_request_duration_bucket{container=\"$service\"}[10m])))",
          "legendFormat": "p50",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.9, sum by (le) (rate(tgi_request_duration_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p90",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.99, sum by (le) (rate(tgi_request_duration_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p99",
          "range": true,
          "refId": "C"
        }
      ],
      "title": "Latency quantiles",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "bars",
            "fillOpacity": 50,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "normal"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 4,
        "w": 6,
        "x": 0,
        "y": 35
      },
      "id": 27,
      "maxDataPoints": 40,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": false
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "9.1.0",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "avg(tgi_batch_current_size{container=\"$service\"})",
          "legendFormat": "{{ pod }}",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Batch Size",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 30,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 9,
        "w": 6,
        "x": 0,
        "y": 39
      },
      "id": 28,
      "maxDataPoints": 100,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max"
          ],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "sum(increase(tgi_batch_concat{container=\"$service\"}[1m])) by (reason)",
          "hide": false,
          "legendFormat": "Reason: {{ reason }}",
          "range": true,
          "refId": "B"
        }
      ],
      "title": "Concatenates",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "p50"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "green",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p90"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "orange",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p99"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "red",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 9,
        "w": 9,
        "x": 6,
        "y": 39
      },
      "id": 31,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max"
          ],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.5, sum by (le) (rate(tgi_request_queue_duration_bucket{container=\"$service\"}[10m])))",
          "legendFormat": "p50",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.9, sum by (le) (rate(tgi_request_queue_duration_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p90",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.99, sum by (le) (rate(tgi_request_queue_duration_bucket{container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p99",
          "range": true,
          "refId": "C"
        }
      ],
      "title": "Queue quantiles",
      "type": "timeseries"
    },
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 48
      },
      "id": 22,
      "panels": [],
      "title": "Prefill",
      "type": "row"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "p50"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "green",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p90"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "orange",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p99"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "red",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 11,
        "w": 12,
        "x": 0,
        "y": 49
      },
      "id": 7,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max"
          ],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.5, sum by (le) (rate(tgi_batch_inference_duration_bucket{method=\"prefill\", container=\"$service\"}[10m])))",
          "legendFormat": "p50",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.9, sum by (le) (rate(tgi_batch_inference_duration_bucket{method=\"prefill\", container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p90",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.99, sum by (le) (rate(tgi_batch_inference_duration_bucket{method=\"prefill\", container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p99",
          "range": true,
          "refId": "C"
        }
      ],
      "title": "Prefill Quantiles",
      "type": "timeseries"
    },
    {
      "cards": {},
      "color": {
        "cardColor": "#5794F2",
        "colorScale": "linear",
        "colorScheme": "interpolateSpectral",
        "exponent": 0.5,
        "min": 0,
        "mode": "opacity"
      },
      "dataFormat": "tsbuckets",
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "scaleDistribution": {
              "type": "linear"
            }
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 11,
        "w": 12,
        "x": 12,
        "y": 49
      },
      "heatmap": {},
      "hideZeroBuckets": false,
      "highlightCards": true,
      "id": 14,
      "legend": {
        "show": false
      },
      "maxDataPoints": 25,
      "options": {
        "calculate": false,
        "calculation": {},
        "cellGap": 2,
        "cellValues": {},
        "color": {
          "exponent": 0.5,
          "fill": "#5794F2",
          "min": 0,
          "mode": "scheme",
          "reverse": false,
          "scale": "exponential",
          "scheme": "Spectral",
          "steps": 128
        },
        "exemplars": {
          "color": "rgba(255,0,255,0.7)"
        },
        "filterValues": {
          "le": 1e-9
        },
        "legend": {
          "show": false
        },
        "rowsFrame": {
          "layout": "auto"
        },
        "showValue": "never",
        "tooltip": {
          "mode": "single",
          "showColorScale": false,
          "yHistogram": false
        },
        "yAxis": {
          "axisPlacement": "left",
          "decimals": 1,
          "reverse": false,
          "unit": "s"
        }
      },
      "pluginVersion": "10.4.2",
      "reverseYBuckets": false,
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "exemplar": true,
          "expr": "sum(increase(tgi_batch_inference_duration_bucket{method=\"prefill\", container=\"$service\"}[5m])) by (le)",
          "format": "heatmap",
          "interval": "",
          "legendFormat": "{{ le }}",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Prefill Latency",
      "tooltip": {
        "show": true,
        "showHistogram": false
      },
      "type": "heatmap",
      "xAxis": {
        "show": true
      },
      "yAxis": {
        "decimals": 1,
        "format": "s",
        "logBase": 1,
        "show": true
      },
      "yBucketBound": "auto"
    },
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 60
      },
      "id": 24,
      "panels": [],
      "title": "Decode",
      "type": "row"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "never",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "p50"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "green",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p90"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "orange",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "p99"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "red",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 11,
        "w": 12,
        "x": 0,
        "y": 61
      },
      "id": 11,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max"
          ],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.5, sum by (le) (rate(tgi_batch_inference_duration_bucket{method=\"decode\", container=\"$service\"}[10m])))",
          "legendFormat": "p50",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS_EKS API INFERENCE PROD}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.9, sum by (le) (rate(tgi_batch_inference_duration_bucket{method=\"decode\", container=\"$service\"}[10m])))",
          "hide": false,
          "legendFormat": "p90",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
   
Download .txt
gitextract_uvlvpncm/

├── .dockerignore
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug-report.yml
│   │   ├── config.yml
│   │   ├── feature-request.yml
│   │   └── new-model-addition.yml
│   ├── PULL_REQUEST_TEMPLATE.md
│   └── workflows/
│       ├── autodocs.yaml
│       ├── build.yaml
│       ├── build_documentation.yaml
│       ├── build_pr_documentation.yaml
│       ├── ci_build.yaml
│       ├── client-tests.yaml
│       ├── codeql.yml
│       ├── integration_tests.yaml
│       ├── load_test.yaml
│       ├── nix_build.yaml
│       ├── nix_cache.yaml
│       ├── nix_tests.yaml
│       ├── stale.yaml
│       ├── tests.yaml
│       ├── trufflehog.yaml
│       └── upload_pr_documentation.yaml
├── .gitignore
├── .pre-commit-config.yaml
├── .redocly.lint-ignore.yaml
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Cargo.toml
├── Dockerfile
├── Dockerfile.neuron
├── Dockerfile.nix
├── Dockerfile_amd
├── Dockerfile_gaudi
├── Dockerfile_intel
├── Dockerfile_llamacpp
├── Dockerfile_trtllm
├── LICENSE
├── Makefile
├── README.md
├── assets/
│   └── tgi_grafana.json
├── backends/
│   ├── client/
│   │   ├── Cargo.toml
│   │   ├── build.rs
│   │   └── src/
│   │       ├── lib.rs
│   │       ├── v2/
│   │       │   ├── client.rs
│   │       │   ├── mod.rs
│   │       │   └── sharded_client.rs
│   │       └── v3/
│   │           ├── client.rs
│   │           ├── mod.rs
│   │           └── sharded_client.rs
│   ├── gaudi/
│   │   ├── Makefile
│   │   ├── README.md
│   │   ├── examples/
│   │   │   └── docker_commands/
│   │   │       └── docker_commands.md
│   │   ├── server/
│   │   │   ├── .gitignore
│   │   │   ├── Makefile
│   │   │   ├── Makefile-awq
│   │   │   ├── Makefile-eetq
│   │   │   ├── Makefile-fbgemm
│   │   │   ├── Makefile-flash-att
│   │   │   ├── Makefile-flash-att-v2
│   │   │   ├── Makefile-selective-scan
│   │   │   ├── Makefile-vllm
│   │   │   ├── README.md
│   │   │   ├── dill-0.3.7-patch.sh
│   │   │   ├── dill-0.3.8-patch.sh
│   │   │   ├── pyproject.toml
│   │   │   ├── requirements.txt
│   │   │   └── text_generation_server/
│   │   │       ├── __init__.py
│   │   │       ├── adapters/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── config.py
│   │   │       │   ├── lora.py
│   │   │       │   └── weights.py
│   │   │       ├── cache.py
│   │   │       ├── cli.py
│   │   │       ├── interceptor.py
│   │   │       ├── layers/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── attention/
│   │   │       │   │   ├── __init__.py
│   │   │       │   │   ├── common.py
│   │   │       │   │   ├── hpu.py
│   │   │       │   │   └── kv_cache.py
│   │   │       │   ├── awq/
│   │   │       │   │   ├── conversion_utils.py
│   │   │       │   │   └── quantize/
│   │   │       │   │       ├── __init__.py
│   │   │       │   │       └── hpu.py
│   │   │       │   ├── bnb.py
│   │   │       │   ├── compressed_tensors/
│   │   │       │   │   ├── __init__.py
│   │   │       │   │   ├── loader.py
│   │   │       │   │   └── w8an_fp.py
│   │   │       │   ├── conv.py
│   │   │       │   ├── exl2.py
│   │   │       │   ├── fp8.py
│   │   │       │   ├── gptq/
│   │   │       │   │   ├── __init__.py
│   │   │       │   │   ├── hpu.py
│   │   │       │   │   ├── quantize.py
│   │   │       │   │   └── utils.py
│   │   │       │   ├── layernorm.py
│   │   │       │   ├── linear.py
│   │   │       │   ├── lora.py
│   │   │       │   ├── medusa.py
│   │   │       │   ├── mlp.py
│   │   │       │   ├── moe/
│   │   │       │   │   ├── __init__.py
│   │   │       │   │   ├── fp8.py
│   │   │       │   │   ├── fused_moe.py
│   │   │       │   │   └── unquantized.py
│   │   │       │   ├── rotary.py
│   │   │       │   ├── speculative.py
│   │   │       │   └── tensor_parallel.py
│   │   │       ├── models/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── custom_modeling/
│   │   │       │   │   ├── __init__.py
│   │   │       │   │   ├── bloom_modeling.py
│   │   │       │   │   ├── clip.py
│   │   │       │   │   ├── flash_cohere_modeling.py
│   │   │       │   │   ├── flash_dbrx_modeling.py
│   │   │       │   │   ├── flash_deepseek_v2_modeling.py
│   │   │       │   │   ├── flash_deepseek_v3_modeling.py
│   │   │       │   │   ├── flash_gemma2_modeling.py
│   │   │       │   │   ├── flash_gemma3_modeling.py
│   │   │       │   │   ├── flash_gemma_modeling.py
│   │   │       │   │   ├── flash_gpt2_modeling.py
│   │   │       │   │   ├── flash_gptj_modeling.py
│   │   │       │   │   ├── flash_llama4_modeling.py
│   │   │       │   │   ├── flash_llama_modeling.py
│   │   │       │   │   ├── flash_llava_next.py
│   │   │       │   │   ├── flash_mistral_modeling.py
│   │   │       │   │   ├── flash_mixtral_modeling.py
│   │   │       │   │   ├── flash_mllama.py
│   │   │       │   │   ├── flash_neox_modeling.py
│   │   │       │   │   ├── flash_pali_gemma_modeling.py
│   │   │       │   │   ├── flash_phi_modeling.py
│   │   │       │   │   ├── flash_phi_moe_modeling.py
│   │   │       │   │   ├── flash_qwen2_modeling.py
│   │   │       │   │   ├── flash_qwen3_modeling.py
│   │   │       │   │   ├── flash_qwen3_moe_modeling.py
│   │   │       │   │   ├── flash_rw_modeling.py
│   │   │       │   │   ├── flash_santacoder_modeling.py
│   │   │       │   │   ├── flash_starcoder2_modeling.py
│   │   │       │   │   ├── idefics2.py
│   │   │       │   │   ├── idefics3.py
│   │   │       │   │   ├── mamba_modeling.py
│   │   │       │   │   ├── qwen2_5_vl.py
│   │   │       │   │   ├── qwen2_vl.py
│   │   │       │   │   ├── siglip.py
│   │   │       │   │   └── vlm.py
│   │   │       │   ├── flash_causal_lm.py
│   │   │       │   ├── flash_vlm_causal_lm.py
│   │   │       │   ├── globals.py
│   │   │       │   ├── mllama_causal_lm.py
│   │   │       │   ├── model.py
│   │   │       │   ├── seq2seq_lm.py
│   │   │       │   └── types.py
│   │   │       ├── pb/
│   │   │       │   └── .gitignore
│   │   │       ├── server.py
│   │   │       ├── tracing.py
│   │   │       └── utils/
│   │   │           ├── __init__.py
│   │   │           ├── adapter.py
│   │   │           ├── chunks.py
│   │   │           ├── convert.py
│   │   │           ├── debug.py
│   │   │           ├── dist.py
│   │   │           ├── hub.py
│   │   │           ├── import_utils.py
│   │   │           ├── kernels.py
│   │   │           ├── log.py
│   │   │           ├── logits_process.py
│   │   │           ├── merges/
│   │   │           │   ├── strategies.py
│   │   │           │   └── utils.py
│   │   │           ├── peft.py
│   │   │           ├── prefill_chunking.py
│   │   │           ├── quantization.py
│   │   │           ├── segments.py
│   │   │           ├── sgmv.py
│   │   │           ├── speculate.py
│   │   │           ├── tokens.py
│   │   │           ├── version.py
│   │   │           ├── watermark.py
│   │   │           └── weights.py
│   │   └── tgi-entrypoint.sh
│   ├── grpc-metadata/
│   │   ├── Cargo.toml
│   │   └── src/
│   │       └── lib.rs
│   ├── llamacpp/
│   │   ├── Cargo.toml
│   │   ├── README.md
│   │   ├── build.rs
│   │   ├── requirements.txt
│   │   └── src/
│   │       ├── backend.rs
│   │       ├── llamacpp.rs
│   │       ├── main.rs
│   │       └── quantize.rs
│   ├── neuron/
│   │   ├── Cargo.toml
│   │   ├── Makefile
│   │   ├── README.md
│   │   ├── server/
│   │   │   ├── .gitignore
│   │   │   ├── Makefile
│   │   │   ├── build-requirements.txt
│   │   │   ├── pyproject.toml
│   │   │   └── text_generation_server/
│   │   │       ├── cli.py
│   │   │       ├── generator.py
│   │   │       ├── interceptor.py
│   │   │       ├── model.py
│   │   │       ├── server.py
│   │   │       └── tgi_env.py
│   │   ├── tests/
│   │   │   ├── conftest.py
│   │   │   ├── fixtures/
│   │   │   │   └── model.py
│   │   │   ├── prune_test_models.py
│   │   │   ├── pytest.ini
│   │   │   ├── requirements.txt
│   │   │   ├── server/
│   │   │   │   ├── helpers.py
│   │   │   │   ├── test_cached_model.py
│   │   │   │   ├── test_continuous_batching.py
│   │   │   │   ├── test_decode.py
│   │   │   │   ├── test_generator_slot.py
│   │   │   │   ├── test_info.py
│   │   │   │   └── test_prefill.py
│   │   │   └── test_entry_point.py
│   │   ├── tgi-entrypoint.sh
│   │   └── tgi_entry_point.py
│   ├── trtllm/
│   │   ├── CMakeLists.txt
│   │   ├── Cargo.toml
│   │   ├── README.md
│   │   ├── build.rs
│   │   ├── cmake/
│   │   │   ├── json.cmake
│   │   │   ├── spdlog.cmake
│   │   │   ├── trtllm.cmake
│   │   │   └── utils/
│   │   │       └── detect_cuda_arch.cu
│   │   ├── csrc/
│   │   │   ├── backend.cpp
│   │   │   ├── backend.hpp
│   │   │   ├── ffi.hpp
│   │   │   └── hardware.hpp
│   │   ├── scripts/
│   │   │   ├── install_tensorrt.sh
│   │   │   └── setup_sccache.py
│   │   ├── src/
│   │   │   ├── errors.rs
│   │   │   ├── lib.rs
│   │   │   ├── looper.rs
│   │   │   ├── main.rs
│   │   │   └── utils.rs
│   │   └── tests/
│   │       ├── test_backend.cpp
│   │       └── test_hardware.cpp
│   ├── v2/
│   │   ├── Cargo.toml
│   │   ├── build.rs
│   │   └── src/
│   │       ├── backend.rs
│   │       ├── client/
│   │       │   ├── grpc_client.rs
│   │       │   ├── mod.rs
│   │       │   └── sharded_client.rs
│   │       ├── lib.rs
│   │       ├── main.rs
│   │       └── queue.rs
│   └── v3/
│       ├── Cargo.toml
│       ├── benches/
│       │   └── prefix_cache.rs
│       ├── build.rs
│       └── src/
│           ├── backend.rs
│           ├── block_allocator.rs
│           ├── client/
│           │   ├── grpc_client.rs
│           │   ├── mod.rs
│           │   └── sharded_client.rs
│           ├── lib.rs
│           ├── main.rs
│           ├── queue.rs
│           └── radix.rs
├── benchmark/
│   ├── Cargo.toml
│   ├── README.md
│   └── src/
│       ├── app.rs
│       ├── event.rs
│       ├── generation.rs
│       ├── lib.rs
│       ├── main.rs
│       ├── table.rs
│       └── utils.rs
├── clients/
│   └── python/
│       ├── .gitignore
│       ├── Makefile
│       ├── README.md
│       ├── pyproject.toml
│       ├── tests/
│       │   ├── conftest.py
│       │   ├── test_client.py
│       │   ├── test_errors.py
│       │   ├── test_inference_api.py
│       │   └── test_types.py
│       └── text_generation/
│           ├── __init__.py
│           ├── client.py
│           ├── errors.py
│           ├── inference_api.py
│           └── types.py
├── crate-hashes.json
├── docs/
│   ├── README.md
│   ├── index.html
│   ├── openapi.json
│   └── source/
│       ├── _toctree.yml
│       ├── architecture.md
│       ├── backends/
│       │   ├── gaudi.mdx
│       │   ├── llamacpp.md
│       │   ├── neuron.md
│       │   └── trtllm.md
│       ├── basic_tutorials/
│       │   ├── consuming_tgi.md
│       │   ├── gated_model_access.md
│       │   ├── monitoring.md
│       │   ├── non_core_models.md
│       │   ├── preparing_model.md
│       │   ├── safety.md
│       │   ├── train_medusa.md
│       │   ├── using_cli.md
│       │   ├── using_guidance.md
│       │   └── visual_language_models.md
│       ├── conceptual/
│       │   ├── chunking.md
│       │   ├── external.md
│       │   ├── flash_attention.md
│       │   ├── guidance.md
│       │   ├── lora.md
│       │   ├── paged_attention.md
│       │   ├── quantization.md
│       │   ├── safetensors.md
│       │   ├── speculation.md
│       │   ├── streaming.md
│       │   └── tensor_parallelism.md
│       ├── index.md
│       ├── installation.md
│       ├── installation_amd.md
│       ├── installation_gaudi.md
│       ├── installation_inferentia.md
│       ├── installation_intel.md
│       ├── installation_nvidia.md
│       ├── installation_tpu.md
│       ├── multi_backend_support.md
│       ├── quicktour.md
│       ├── reference/
│       │   ├── api_reference.md
│       │   ├── launcher.md
│       │   └── metrics.md
│       ├── supported_models.md
│       └── usage_statistics.md
├── flake.nix
├── integration-tests/
│   ├── conftest.py
│   ├── fixtures/
│   │   ├── gaudi/
│   │   │   └── service.py
│   │   └── neuron/
│   │       ├── export_models.py
│   │       └── service.py
│   ├── gaudi/
│   │   ├── capture_expected_outputs.py
│   │   └── test_gaudi_generate.py
│   ├── models/
│   │   ├── __snapshots__/
│   │   │   ├── test.py
│   │   │   ├── test_bloom_560m/
│   │   │   │   ├── test_bloom_560m.json
│   │   │   │   ├── test_bloom_560m_all_params.json
│   │   │   │   └── test_bloom_560m_load.json
│   │   │   ├── test_bloom_560m_sharded/
│   │   │   │   ├── test_bloom_560m_sharded.json
│   │   │   │   └── test_bloom_560m_sharded_load.json
│   │   │   ├── test_chat_llama/
│   │   │   │   └── test_flash_llama_simple.json
│   │   │   ├── test_completion_prompts/
│   │   │   │   ├── test_chat_hfhub_nousage.json
│   │   │   │   ├── test_chat_hfhub_usage.json
│   │   │   │   ├── test_chat_openai_nousage.json
│   │   │   │   ├── test_chat_openai_usage.json
│   │   │   │   ├── test_flash_llama_completion_many_prompts.json
│   │   │   │   ├── test_flash_llama_completion_many_prompts_stream.json
│   │   │   │   ├── test_flash_llama_completion_single_prompt.json
│   │   │   │   └── test_flash_llama_completion_stream_usage.json
│   │   │   ├── test_compressed_tensors_w8a8_int/
│   │   │   │   ├── test_compressed_tensors_w8a8_int.json
│   │   │   │   ├── test_compressed_tensors_w8a8_int_all_params.json
│   │   │   │   └── test_compressed_tensors_w8a8_int_load.json
│   │   │   ├── test_compressed_tensors_w8a8_int_dynamic_weight/
│   │   │   │   ├── test_compressed_tensors_w8a8_int_dynamic_weight.json
│   │   │   │   ├── test_compressed_tensors_w8a8_int_dynamic_weight_all_params.json
│   │   │   │   └── test_compressed_tensors_w8a8_int_dynamic_weight_load.json
│   │   │   ├── test_compressed_tensors_w8an_fp/
│   │   │   │   ├── test_compressed_tensors_w8an.json
│   │   │   │   ├── test_compressed_tensors_w8an_all_params.json
│   │   │   │   └── test_compressed_tensors_w8an_load.json
│   │   │   ├── test_compressed_tensors_wna16_int/
│   │   │   │   ├── test_compressed_tensors_wna16.json
│   │   │   │   ├── test_compressed_tensors_wna16_all_params.json
│   │   │   │   └── test_compressed_tensors_wna16_load.json
│   │   │   ├── test_compressed_tensors_wna16_int_24/
│   │   │   │   ├── test_compressed_tensors_wna16_int_24.json
│   │   │   │   ├── test_compressed_tensors_wna16_int_24_all_params.json
│   │   │   │   └── test_compressed_tensors_wna16_int_24_load.json
│   │   │   ├── test_continue_final_message/
│   │   │   │   ├── test_llama_completion_single_prompt.json
│   │   │   │   └── test_llama_completion_single_prompt_continue.json
│   │   │   ├── test_flash_awq/
│   │   │   │   ├── test_flash_llama_awq.json
│   │   │   │   ├── test_flash_llama_awq_all_params.json
│   │   │   │   └── test_flash_llama_awq_load.json
│   │   │   ├── test_flash_awq_sharded/
│   │   │   │   ├── test_flash_llama_awq_load_sharded.json
│   │   │   │   └── test_flash_llama_awq_sharded.json
│   │   │   ├── test_flash_deepseek_v2/
│   │   │   │   ├── test_flash_deepseek_v2.json
│   │   │   │   ├── test_flash_deepseek_v2_all_params.json
│   │   │   │   └── test_flash_deepseek_v2_load.json
│   │   │   ├── test_flash_falcon/
│   │   │   │   ├── test_flash_falcon.json
│   │   │   │   ├── test_flash_falcon_all_params.json
│   │   │   │   └── test_flash_falcon_load.json
│   │   │   ├── test_flash_gemma/
│   │   │   │   ├── test_flash_gemma_all_params.json
│   │   │   │   ├── test_flash_gemma_load.json
│   │   │   │   └── test_flash_gemma_simple.json
│   │   │   ├── test_flash_gemma2/
│   │   │   │   ├── test_flash_gemma2.json
│   │   │   │   └── test_flash_gemma2_load.json
│   │   │   ├── test_flash_gemma3/
│   │   │   │   ├── test_exceed_window.json
│   │   │   │   ├── test_flash_gemma3.json
│   │   │   │   ├── test_flash_gemma3_image_base64_rgb_jpg.json
│   │   │   │   ├── test_flash_gemma3_image_base64_rgb_png.json
│   │   │   │   ├── test_flash_gemma3_image_base64_rgba.json
│   │   │   │   ├── test_flash_gemma3_image_cow.json
│   │   │   │   └── test_flash_gemma3_image_cow_dog.json
│   │   │   ├── test_flash_gemma_gptq/
│   │   │   │   ├── test_flash_gemma_gptq.json
│   │   │   │   ├── test_flash_gemma_gptq_all_params.json
│   │   │   │   └── test_flash_gemma_gptq_load.json
│   │   │   ├── test_flash_gpt2/
│   │   │   │   ├── test_flash_gpt2.json
│   │   │   │   └── test_flash_gpt2_load.json
│   │   │   ├── test_flash_grammar_llama/
│   │   │   │   ├── test_flash_llama_grammar.json
│   │   │   │   ├── test_flash_llama_grammar_json.json
│   │   │   │   ├── test_flash_llama_grammar_load.json
│   │   │   │   ├── test_flash_llama_grammar_regex.json
│   │   │   │   └── test_flash_llama_grammar_single_load_instance.json
│   │   │   ├── test_flash_llama/
│   │   │   │   ├── test_flash_llama_all_params.json
│   │   │   │   ├── test_flash_llama_load.json
│   │   │   │   └── test_flash_llama_simple.json
│   │   │   ├── test_flash_llama_exl2/
│   │   │   │   ├── test_flash_llama_exl2.json
│   │   │   │   ├── test_flash_llama_exl2_all_params.json
│   │   │   │   └── test_flash_llama_exl2_load.json
│   │   │   ├── test_flash_llama_fp8/
│   │   │   │   ├── test_flash_llama_fp8.json
│   │   │   │   ├── test_flash_llama_fp8_all_params.json
│   │   │   │   └── test_flash_llama_fp8_load.json
│   │   │   ├── test_flash_llama_fp8_kv_cache/
│   │   │   │   ├── test_flash_llama_fp8_kv_cache.json
│   │   │   │   ├── test_flash_llama_fp8_kv_cache_all_params.json
│   │   │   │   └── test_flash_llama_fp8_kv_cache_load.json
│   │   │   ├── test_flash_llama_gptq/
│   │   │   │   ├── test_flash_llama_gptq.json
│   │   │   │   ├── test_flash_llama_gptq_all_params.json
│   │   │   │   └── test_flash_llama_gptq_load.json
│   │   │   ├── test_flash_llama_marlin/
│   │   │   │   ├── test_flash_llama_marlin.json
│   │   │   │   ├── test_flash_llama_marlin_all_params.json
│   │   │   │   └── test_flash_llama_marlin_load.json
│   │   │   ├── test_flash_llama_marlin_24/
│   │   │   │   ├── test_flash_llama_marlin.json
│   │   │   │   ├── test_flash_llama_marlin24_all_params.json
│   │   │   │   └── test_flash_llama_marlin24_load.json
│   │   │   ├── test_flash_llama_prefix/
│   │   │   │   └── test_flash_llama_load.json
│   │   │   ├── test_flash_llama_prefix_flashdecoding/
│   │   │   │   └── test_flash_llama_flashdecoding.json
│   │   │   ├── test_flash_medusa/
│   │   │   │   ├── test_flash_medusa_all_params.json
│   │   │   │   ├── test_flash_medusa_load.json
│   │   │   │   └── test_flash_medusa_simple.json
│   │   │   ├── test_flash_mistral/
│   │   │   │   ├── test_flash_mistral.json
│   │   │   │   ├── test_flash_mistral_all_params.json
│   │   │   │   └── test_flash_mistral_load.json
│   │   │   ├── test_flash_mixtral/
│   │   │   │   ├── test_flash_mixtral.json
│   │   │   │   ├── test_flash_mixtral_all_params.json
│   │   │   │   └── test_flash_mixtral_load.json
│   │   │   ├── test_flash_mixtral_awq/
│   │   │   │   ├── test_flash_mixtral_awq.json
│   │   │   │   ├── test_flash_mixtral_awq_all_params.json
│   │   │   │   └── test_flash_mixtral_awq_load.json
│   │   │   ├── test_flash_mixtral_gptq/
│   │   │   │   ├── test_flash_mixtral_gptq.json
│   │   │   │   ├── test_flash_mixtral_gptq_all_params.json
│   │   │   │   └── test_flash_mixtral_gptq_load.json
│   │   │   ├── test_flash_neox/
│   │   │   │   ├── test_flash_neox.json
│   │   │   │   └── test_flash_neox_load.json
│   │   │   ├── test_flash_neox_sharded/
│   │   │   │   ├── test_flash_neox.json
│   │   │   │   └── test_flash_neox_load.json
│   │   │   ├── test_flash_pali_gemma/
│   │   │   │   ├── test_flash_pali_gemma.json
│   │   │   │   └── test_flash_pali_gemma_two_images.json
│   │   │   ├── test_flash_pali_gemma2/
│   │   │   │   └── test_flash_pali_gemma_image.json
│   │   │   ├── test_flash_phi/
│   │   │   │   ├── test_flash_phi.json
│   │   │   │   ├── test_flash_phi_all_params.json
│   │   │   │   └── test_flash_phi_load.json
│   │   │   ├── test_flash_phi35_moe/
│   │   │   │   ├── test_flash_phi35_moe.json
│   │   │   │   ├── test_flash_phi35_moe_all_params.json
│   │   │   │   └── test_flash_phi35_moe_load.json
│   │   │   ├── test_flash_qwen2/
│   │   │   │   ├── test_flash_qwen2.json
│   │   │   │   ├── test_flash_qwen2_all_params.json
│   │   │   │   └── test_flash_qwen2_load.json
│   │   │   ├── test_flash_qwen2_5_vl/
│   │   │   │   ├── test_flash_qwen2_5_vl_bay.json
│   │   │   │   ├── test_flash_qwen2_5_vl_inpaint.json
│   │   │   │   ├── test_flash_qwen2_5_vl_simple.json
│   │   │   │   └── test_flash_qwen2_5_vl_simple_streaming.json
│   │   │   ├── test_flash_qwen2_vl/
│   │   │   │   ├── test_flash_qwen2_vl_bay.json
│   │   │   │   ├── test_flash_qwen2_vl_inpaint.json
│   │   │   │   ├── test_flash_qwen2_vl_simple.json
│   │   │   │   └── test_flash_qwen2_vl_simple_streaming.json
│   │   │   ├── test_flash_santacoder/
│   │   │   │   ├── test_flash_santacoder.json
│   │   │   │   └── test_flash_santacoder_load.json
│   │   │   ├── test_flash_starcoder/
│   │   │   │   ├── test_flash_starcoder.json
│   │   │   │   ├── test_flash_starcoder_default_params.json
│   │   │   │   └── test_flash_starcoder_load.json
│   │   │   ├── test_flash_starcoder2/
│   │   │   │   ├── test_flash_starcoder2.json
│   │   │   │   ├── test_flash_starcoder2_default_params.json
│   │   │   │   └── test_flash_starcoder2_load.json
│   │   │   ├── test_flash_starcoder2_lora/
│   │   │   │   ├── test_flash_starcoder2.json
│   │   │   │   ├── test_flash_starcoder2_default_params.json
│   │   │   │   ├── test_flash_starcoder2_load.json
│   │   │   │   └── test_flash_starcoder2_with_hugcode_adapter.json
│   │   │   ├── test_flash_starcoder_gptq/
│   │   │   │   ├── test_flash_starcoder_gptq.json
│   │   │   │   ├── test_flash_starcoder_gptq_default_params.json
│   │   │   │   └── test_flash_starcoder_gptq_load.json
│   │   │   ├── test_grammar_llama/
│   │   │   │   └── test_non_flash_llama_grammar_json.json
│   │   │   ├── test_grammar_response_format_llama/
│   │   │   │   ├── test_grammar_response_format_llama_json.1.json
│   │   │   │   ├── test_grammar_response_format_llama_json.2.json
│   │   │   │   └── test_grammar_response_format_llama_json.json
│   │   │   ├── test_idefics/
│   │   │   │   ├── test_idefics.json
│   │   │   │   ├── test_idefics_load.json
│   │   │   │   └── test_idefics_two_images.json
│   │   │   ├── test_idefics2/
│   │   │   │   ├── test_flash_idefics2_next_all_params.json
│   │   │   │   ├── test_flash_idefics2_next_load.json
│   │   │   │   ├── test_flash_idefics2_next_simple.json
│   │   │   │   └── test_flash_idefics2_two_images.json
│   │   │   ├── test_idefics3/
│   │   │   │   └── test_flash_idefics3_next_simple_url.json
│   │   │   ├── test_json_schema_constrain/
│   │   │   │   ├── test_json_schema_basic.json
│   │   │   │   ├── test_json_schema_complex.json
│   │   │   │   └── test_json_schema_stream.json
│   │   │   ├── test_llava_next/
│   │   │   │   ├── test_flash_llava_next_all_params.json
│   │   │   │   ├── test_flash_llava_next_load.json
│   │   │   │   └── test_flash_llava_next_simple.json
│   │   │   ├── test_lora_mistral/
│   │   │   │   ├── test_lora_mistral_with_customer_support_adapter.json
│   │   │   │   ├── test_lora_mistral_with_dbpedia_adapter.json
│   │   │   │   ├── test_lora_mistral_without_adapter.json
│   │   │   │   └── test_lora_mistral_without_customer_support_adapter.json
│   │   │   ├── test_mamba/
│   │   │   │   ├── test_mamba.json
│   │   │   │   ├── test_mamba_all_params.json
│   │   │   │   └── test_mamba_load.json
│   │   │   ├── test_mllama/
│   │   │   │   ├── test_mllama_load.json
│   │   │   │   └── test_mllama_simpl.json
│   │   │   ├── test_mpt/
│   │   │   │   ├── test_mpt.json
│   │   │   │   └── test_mpt_load.json
│   │   │   ├── test_mt0_base/
│   │   │   │   ├── test_mt0_base.json
│   │   │   │   ├── test_mt0_base_all_params.json
│   │   │   │   └── test_mt0_base_load.json
│   │   │   ├── test_neox/
│   │   │   │   ├── test_neox.json
│   │   │   │   └── test_neox_load.json
│   │   │   ├── test_neox_sharded/
│   │   │   │   ├── test_neox.json
│   │   │   │   └── test_neox_load.json
│   │   │   ├── test_server_gptq_quantized/
│   │   │   │   ├── test_server_gptq_quantized.json
│   │   │   │   ├── test_server_gptq_quantized_all_params.json
│   │   │   │   └── test_server_gptq_quantized_load.json
│   │   │   ├── test_smolvlm/
│   │   │   │   └── test_flash_smolvlm_next_simple_url.json
│   │   │   ├── test_t5_sharded/
│   │   │   │   ├── test_t5_sharded.json
│   │   │   │   └── test_t5_sharded_load.json
│   │   │   ├── test_tools_llama/
│   │   │   │   ├── test_flash_llama_grammar_tools_auto_nostream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_choice_nostream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_choice_stream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_insufficient_information_nostream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_insufficient_information_stream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_nostream.json
│   │   │   │   ├── test_flash_llama_grammar_tools_openai.json
│   │   │   │   ├── test_flash_llama_grammar_tools_sea_creatures_stream_auto.json
│   │   │   │   ├── test_flash_llama_grammar_tools_sea_creatures_stream_function_object.json
│   │   │   │   ├── test_flash_llama_grammar_tools_sea_creatures_stream_none.json
│   │   │   │   ├── test_flash_llama_grammar_tools_sea_creatures_stream_required.json
│   │   │   │   └── test_flash_llama_tool_reply_response.json
│   │   │   ├── test_transformers_llama4/
│   │   │   │   ├── test_flash_llama4.json
│   │   │   │   ├── test_flash_llama4_image_base64_rgb_jpg.json
│   │   │   │   ├── test_flash_llama4_image_base64_rgb_png.json
│   │   │   │   ├── test_flash_llama4_image_base64_rgba.json
│   │   │   │   ├── test_flash_llama4_image_cow.json
│   │   │   │   └── test_flash_llama4_image_cow_dog.json
│   │   │   └── test_transformers_olmo/
│   │   │       ├── test_flash_llama_load.json
│   │   │       └── test_flash_llama_simple.json
│   │   ├── test_bloom_560m.py
│   │   ├── test_bloom_560m_sharded.py
│   │   ├── test_chat_llama.py
│   │   ├── test_chat_stream_options.py
│   │   ├── test_completion_prompts.py
│   │   ├── test_compressed_tensors_w8a8_int.py
│   │   ├── test_compressed_tensors_w8a8_int_dynamic_weight.py
│   │   ├── test_compressed_tensors_w8an_fp.py
│   │   ├── test_compressed_tensors_wna16_int.py
│   │   ├── test_compressed_tensors_wna16_int_24.py
│   │   ├── test_continue_final_message.py
│   │   ├── test_flash_awq.py
│   │   ├── test_flash_awq_sharded.py
│   │   ├── test_flash_deepseek_v2.py
│   │   ├── test_flash_falcon.py
│   │   ├── test_flash_gemma.py
│   │   ├── test_flash_gemma2.py
│   │   ├── test_flash_gemma3.py
│   │   ├── test_flash_gemma_gptq.py
│   │   ├── test_flash_gpt2.py
│   │   ├── test_flash_grammar_llama.py
│   │   ├── test_flash_llama.py
│   │   ├── test_flash_llama_exl2.py
│   │   ├── test_flash_llama_fp8.py
│   │   ├── test_flash_llama_fp8_kv_cache.py
│   │   ├── test_flash_llama_gptq.py
│   │   ├── test_flash_llama_marlin.py
│   │   ├── test_flash_llama_marlin_24.py
│   │   ├── test_flash_llama_prefix.py
│   │   ├── test_flash_llama_prefix_flashdecoding.py
│   │   ├── test_flash_medusa.py
│   │   ├── test_flash_mistral.py
│   │   ├── test_flash_mixtral.py
│   │   ├── test_flash_mixtral_awq.py
│   │   ├── test_flash_mixtral_gptq.py
│   │   ├── test_flash_neox.py
│   │   ├── test_flash_neox_sharded.py
│   │   ├── test_flash_pali_gemma.py
│   │   ├── test_flash_pali_gemma2.py
│   │   ├── test_flash_phi.py
│   │   ├── test_flash_phi35_moe.py
│   │   ├── test_flash_qwen2.py
│   │   ├── test_flash_qwen2_5_vl.py
│   │   ├── test_flash_qwen2_vl.py
│   │   ├── test_flash_santacoder.py
│   │   ├── test_flash_starcoder.py
│   │   ├── test_flash_starcoder2.py
│   │   ├── test_flash_starcoder2_lora.py
│   │   ├── test_flash_starcoder_gptq.py
│   │   ├── test_grammar_llama.py
│   │   ├── test_grammar_response_format_llama.py
│   │   ├── test_idefics.py
│   │   ├── test_idefics2.py
│   │   ├── test_idefics3.py
│   │   ├── test_json_schema_constrain.py
│   │   ├── test_llava_next.py
│   │   ├── test_lora_mistral.py
│   │   ├── test_mamba.py
│   │   ├── test_mllama.py
│   │   ├── test_mpt.py
│   │   ├── test_mt0_base.py
│   │   ├── test_neox.py
│   │   ├── test_neox_sharded.py
│   │   ├── test_opt.py
│   │   ├── test_smolvlm.py
│   │   ├── test_t5_sharded.py
│   │   ├── test_tools_llama.py
│   │   ├── test_transformers_llama4.py
│   │   └── test_transformers_olmo.py
│   ├── neuron/
│   │   ├── test_generate.py
│   │   └── test_implicit_env.py
│   ├── pyproject.toml
│   ├── pytest.ini
│   └── requirements.txt
├── launcher/
│   ├── Cargo.toml
│   ├── build.rs
│   └── src/
│       ├── env_runtime.rs
│       ├── gpu.rs
│       └── main.rs
├── load_tests/
│   ├── Makefile
│   ├── benchmarks.py
│   ├── common.js
│   ├── filter.py
│   ├── long.js
│   ├── long.py
│   ├── long_prompt2.py
│   ├── orca.py
│   └── pyproject.toml
├── nix/
│   ├── client.nix
│   ├── crate-overrides.nix
│   ├── docker.nix
│   ├── impure-shell.nix
│   ├── overlay.nix
│   └── server.nix
├── proto/
│   ├── generate.proto
│   └── v3/
│       └── generate.proto
├── router/
│   ├── Cargo.toml
│   ├── README.md
│   ├── build.rs
│   └── src/
│       ├── chat.rs
│       ├── config.rs
│       ├── infer/
│       │   ├── chat_template.rs
│       │   ├── mod.rs
│       │   └── tool_grammar.rs
│       ├── kserve.rs
│       ├── lib.rs
│       ├── logging.rs
│       ├── sagemaker.rs
│       ├── server.rs
│       ├── usage_stats.rs
│       ├── validation.rs
│       └── vertex.rs
├── rust-toolchain.toml
├── sagemaker-entrypoint.sh
├── server/
│   ├── .gitignore
│   ├── Makefile
│   ├── Makefile-awq
│   ├── Makefile-eetq
│   ├── Makefile-exllamav2
│   ├── Makefile-flash-att
│   ├── Makefile-flash-att-v2
│   ├── Makefile-flashinfer
│   ├── Makefile-selective-scan
│   ├── Makefile-vllm
│   ├── README.md
│   ├── bounds-from-nix.py
│   ├── custom_kernels/
│   │   ├── custom_kernels/
│   │   │   ├── fused_attention_cuda.cu
│   │   │   └── fused_bloom_attention_cuda.cu
│   │   └── setup.py
│   ├── exllama_kernels/
│   │   ├── exllama_kernels/
│   │   │   ├── cu_compat.cuh
│   │   │   ├── cuda_buffers.cu
│   │   │   ├── cuda_buffers.cuh
│   │   │   ├── cuda_func/
│   │   │   │   ├── column_remap.cu
│   │   │   │   ├── column_remap.cuh
│   │   │   │   ├── q4_matmul.cu
│   │   │   │   ├── q4_matmul.cuh
│   │   │   │   ├── q4_matrix.cu
│   │   │   │   └── q4_matrix.cuh
│   │   │   ├── exllama_ext.cpp
│   │   │   ├── hip_compat.cuh
│   │   │   ├── matrix.cuh
│   │   │   ├── tuning.h
│   │   │   └── util.cuh
│   │   └── setup.py
│   ├── exllamav2_kernels/
│   │   ├── exllamav2_kernels/
│   │   │   ├── config.h
│   │   │   ├── cpp/
│   │   │   │   └── util.h
│   │   │   ├── cuda/
│   │   │   │   ├── compat.cuh
│   │   │   │   ├── matrix_view.cuh
│   │   │   │   ├── q_gemm.cu
│   │   │   │   ├── q_gemm.cuh
│   │   │   │   ├── q_gemm_kernel.cuh
│   │   │   │   ├── q_gemm_kernel_gptq.cuh
│   │   │   │   ├── q_matrix.cu
│   │   │   │   ├── q_matrix.cuh
│   │   │   │   ├── quant/
│   │   │   │   │   ├── qdq_2.cuh
│   │   │   │   │   ├── qdq_3.cuh
│   │   │   │   │   ├── qdq_4.cuh
│   │   │   │   │   ├── qdq_5.cuh
│   │   │   │   │   ├── qdq_6.cuh
│   │   │   │   │   ├── qdq_8.cuh
│   │   │   │   │   └── qdq_util.cuh
│   │   │   │   └── util.cuh
│   │   │   └── ext.cpp
│   │   └── setup.py
│   ├── pyproject.toml
│   ├── req.txt
│   ├── requirements_cuda.txt
│   ├── requirements_gen.txt
│   ├── requirements_intel.txt
│   ├── requirements_rocm.txt
│   ├── tests/
│   │   ├── conftest.py
│   │   ├── models/
│   │   │   ├── test_bloom.py
│   │   │   ├── test_causal_lm.py
│   │   │   ├── test_model.py
│   │   │   ├── test_santacoder.py
│   │   │   └── test_seq2seq_lm.py
│   │   └── utils/
│   │       ├── test_adapter.py
│   │       ├── test_convert.py
│   │       ├── test_hub.py
│   │       ├── test_layers.py
│   │       ├── test_tokens.py
│   │       ├── test_watermark.py
│   │       └── test_weights.py
│   └── text_generation_server/
│       ├── __init__.py
│       ├── adapters/
│       │   ├── __init__.py
│       │   ├── config.py
│       │   ├── lora.py
│       │   └── weights.py
│       ├── cache.py
│       ├── cli.py
│       ├── interceptor.py
│       ├── layers/
│       │   ├── __init__.py
│       │   ├── attention/
│       │   │   ├── __init__.py
│       │   │   ├── common.py
│       │   │   ├── cuda.py
│       │   │   ├── flash_attn_triton.py
│       │   │   ├── flashinfer.py
│       │   │   ├── ipex.py
│       │   │   ├── kv_cache.py
│       │   │   └── rocm.py
│       │   ├── awq/
│       │   │   ├── conversion_utils.py
│       │   │   └── quantize/
│       │   │       ├── __init__.py
│       │   │       ├── cuda.py
│       │   │       └── ipex.py
│       │   ├── bnb.py
│       │   ├── compressed_tensors/
│       │   │   ├── __init__.py
│       │   │   ├── loader.py
│       │   │   ├── w8a8_int.py
│       │   │   ├── w8an_fp.py
│       │   │   ├── wna16_int.py
│       │   │   └── wna16_int_24.py
│       │   ├── conv.py
│       │   ├── eetq.py
│       │   ├── exl2.py
│       │   ├── fp8.py
│       │   ├── gptq/
│       │   │   ├── __init__.py
│       │   │   ├── custom_autotune.py
│       │   │   ├── exllama.py
│       │   │   ├── exllamav2.py
│       │   │   ├── ipex.py
│       │   │   ├── quantize.py
│       │   │   ├── triton.py
│       │   │   └── utils.py
│       │   ├── layernorm.py
│       │   ├── linear.py
│       │   ├── lora.py
│       │   ├── marlin/
│       │   │   ├── __init__.py
│       │   │   ├── fp8.py
│       │   │   ├── gptq.py
│       │   │   ├── marlin.py
│       │   │   └── util.py
│       │   ├── medusa.py
│       │   ├── mlp.py
│       │   ├── moe/
│       │   │   ├── __init__.py
│       │   │   ├── fp8.py
│       │   │   ├── fused_moe_ipex.py
│       │   │   ├── gptq_marlin.py
│       │   │   └── unquantized.py
│       │   ├── rotary.py
│       │   ├── speculative.py
│       │   └── tensor_parallel.py
│       ├── models/
│       │   ├── __init__.py
│       │   ├── bloom.py
│       │   ├── causal_lm.py
│       │   ├── custom_modeling/
│       │   │   ├── __init__.py
│       │   │   ├── bloom_modeling.py
│       │   │   ├── clip.py
│       │   │   ├── flash_cohere_modeling.py
│       │   │   ├── flash_dbrx_modeling.py
│       │   │   ├── flash_deepseek_v2_modeling.py
│       │   │   ├── flash_deepseek_v3_modeling.py
│       │   │   ├── flash_gemma2_modeling.py
│       │   │   ├── flash_gemma3_modeling.py
│       │   │   ├── flash_gemma_modeling.py
│       │   │   ├── flash_gpt2_modeling.py
│       │   │   ├── flash_gptj_modeling.py
│       │   │   ├── flash_llama_modeling.py
│       │   │   ├── flash_mistral_modeling.py
│       │   │   ├── flash_mixtral_modeling.py
│       │   │   ├── flash_neox_modeling.py
│       │   │   ├── flash_pali_gemma_modeling.py
│       │   │   ├── flash_phi_modeling.py
│       │   │   ├── flash_phi_moe_modeling.py
│       │   │   ├── flash_qwen2_modeling.py
│       │   │   ├── flash_rw_modeling.py
│       │   │   ├── flash_santacoder_modeling.py
│       │   │   ├── flash_starcoder2_modeling.py
│       │   │   ├── gemma3/
│       │   │   │   ├── configuration_gemma3.py
│       │   │   │   ├── image_processing_gemma3.py
│       │   │   │   ├── processing_gemma3.py
│       │   │   │   └── utils.py
│       │   │   ├── idefics2.py
│       │   │   ├── idefics3.py
│       │   │   ├── idefics_config.py
│       │   │   ├── idefics_image_processing.py
│       │   │   ├── idefics_modeling.py
│       │   │   ├── idefics_perceiver.py
│       │   │   ├── idefics_processing.py
│       │   │   ├── idefics_vision.py
│       │   │   ├── llava_next.py
│       │   │   ├── mamba_modeling.py
│       │   │   ├── mllama.py
│       │   │   ├── mpt_modeling.py
│       │   │   ├── neox_modeling.py
│       │   │   ├── opt_modeling.py
│       │   │   ├── phi_modeling.py
│       │   │   ├── qwen2_5_vl.py
│       │   │   ├── qwen2_vl.py
│       │   │   ├── siglip.py
│       │   │   ├── t5_modeling.py
│       │   │   └── vlm.py
│       │   ├── flash_causal_lm.py
│       │   ├── galactica.py
│       │   ├── globals.py
│       │   ├── idefics_causal_lm.py
│       │   ├── mamba.py
│       │   ├── metadata_kernels.py
│       │   ├── mllama_causal_lm.py
│       │   ├── model.py
│       │   ├── seq2seq_lm.py
│       │   ├── transformers_flash_causal_lm.py
│       │   ├── transformers_flash_vlm.py
│       │   ├── types.py
│       │   └── vlm_causal_lm.py
│       ├── pb/
│       │   └── .gitignore
│       ├── server.py
│       ├── tracing.py
│       └── utils/
│           ├── __init__.py
│           ├── adapter.py
│           ├── chunks.py
│           ├── convert.py
│           ├── dist.py
│           ├── hub.py
│           ├── import_utils.py
│           ├── kernels.py
│           ├── log.py
│           ├── logits_process.py
│           ├── merges/
│           │   ├── strategies.py
│           │   └── utils.py
│           ├── peft.py
│           ├── prefill_chunking.py
│           ├── quantization.py
│           ├── segments.py
│           ├── speculate.py
│           ├── tokens.py
│           ├── watermark.py
│           └── weights.py
├── tgi-entrypoint.sh
└── update_doc.py
Download .txt
Showing preview only (421K chars total). Download the full file or copy to clipboard to get everything.
SYMBOL INDEX (5123 symbols across 423 files)

FILE: backends/client/build.rs
  function main (line 3) | fn main() -> Result<(), Box<dyn std::error::Error>> {

FILE: backends/client/src/lib.rs
  type Health (line 15) | pub trait Health {
    method device_health (line 17) | async fn device_health(&self) -> Result<()>;
    method model_health (line 21) | async fn model_health(&self) -> Result<()>;
  type ShardInfo (line 25) | pub struct ShardInfo {
  type ClientError (line 34) | pub enum ClientError {
    method from (line 44) | fn from(err: Status) -> Self {
    method from (line 52) | fn from(err: transport::Error) -> Self {
  method from (line 61) | fn from(chunk: Chunk) -> Self {
  type ChunksToString (line 68) | pub trait ChunksToString {
    method chunks_to_string (line 70) | fn chunks_to_string(&self) -> String;
    method chunks_to_string (line 74) | fn chunks_to_string(&self) -> String {
  type Result (line 91) | pub type Result<T> = std::result::Result<T, ClientError>;

FILE: backends/client/src/v2/client.rs
  type Client (line 16) | pub struct Client {
    method connect (line 22) | pub async fn connect(uri: Uri) -> Result<Self> {
    method connect_uds (line 31) | pub async fn connect_uds(path: String) -> Result<Self> {
    method service_discovery (line 46) | pub async fn service_discovery(&mut self) -> Result<Vec<String>> {
    method info (line 66) | pub async fn info(&mut self) -> Result<InfoResponse> {
    method health (line 74) | pub async fn health(&mut self) -> Result<HealthResponse> {
    method clear_cache (line 82) | pub async fn clear_cache(&mut self, batch_id: Option<u64>) -> Result<(...
    method filter_batch (line 90) | pub async fn filter_batch(
    method warmup (line 108) | pub async fn warmup(
    method prefill (line 189) | pub async fn prefill(
    method decode (line 207) | pub async fn decode(
  type PrefillTimings (line 226) | pub struct PrefillTimings {
    method new (line 233) | fn new(forward_ns: u64, decode_ns: u64, total_ns: u64) -> Self {
  type DecodeTimings (line 242) | pub struct DecodeTimings {
    method new (line 250) | fn new(concat_ns: Option<u64>, forward_ns: u64, decode_ns: u64, total_...

FILE: backends/client/src/v2/sharded_client.rs
  type ShardedClient (line 18) | pub struct ShardedClient {
    method new (line 23) | fn new(clients: Vec<Client>) -> Self {
    method from_master_client (line 29) | async fn from_master_client(mut master_client: Client) -> Result<Self> {
    method connect (line 38) | pub async fn connect(uri: Uri) -> Result<Self> {
    method connect_uds (line 44) | pub async fn connect_uds(path: String) -> Result<Self> {
    method info (line 51) | pub async fn info(&mut self) -> Result<ShardInfo> {
    method health (line 62) | pub async fn health(&mut self) -> Result<HealthResponse> {
    method clear_cache (line 73) | pub async fn clear_cache(&mut self, batch_id: Option<u64>) -> Result<(...
    method filter_batch (line 84) | pub async fn filter_batch(
    method warmup (line 102) | pub async fn warmup(
    method prefill (line 134) | pub async fn prefill(
    method decode (line 167) | pub async fn decode(
  method from (line 197) | fn from(value: InfoResponse) -> Self {
  method device_health (line 210) | async fn device_health(&self) -> Result<()> {
  method model_health (line 215) | async fn model_health(&self) -> Result<()> {

FILE: backends/client/src/v3/client.rs
  type Client (line 16) | pub struct Client {
    method connect (line 22) | pub async fn connect(uri: Uri) -> Result<Self> {
    method connect_uds (line 31) | pub async fn connect_uds(path: String) -> Result<Self> {
    method service_discovery (line 46) | pub async fn service_discovery(&mut self) -> Result<Vec<String>> {
    method info (line 66) | pub async fn info(&mut self) -> Result<InfoResponse> {
    method health (line 74) | pub async fn health(&mut self) -> Result<HealthResponse> {
    method clear_cache (line 82) | pub async fn clear_cache(&mut self, batch_id: Option<u64>) -> Result<(...
    method filter_batch (line 90) | pub async fn filter_batch(
    method warmup (line 108) | pub async fn warmup(
    method prefill (line 230) | pub async fn prefill(
    method decode (line 253) | pub async fn decode(
  type PrefillTimings (line 272) | pub struct PrefillTimings {
    method new (line 279) | fn new(forward_ns: u64, decode_ns: u64, total_ns: u64) -> Self {
  type DecodeTimings (line 288) | pub struct DecodeTimings {
    method new (line 296) | fn new(concat_ns: Option<u64>, forward_ns: u64, decode_ns: u64, total_...

FILE: backends/client/src/v3/sharded_client.rs
  type ShardedClient (line 18) | pub struct ShardedClient {
    method new (line 23) | fn new(clients: Vec<Client>) -> Self {
    method from_master_client (line 29) | async fn from_master_client(mut master_client: Client) -> Result<Self> {
    method connect (line 38) | pub async fn connect(uri: Uri) -> Result<Self> {
    method connect_uds (line 44) | pub async fn connect_uds(path: String) -> Result<Self> {
    method info (line 51) | pub async fn info(&mut self) -> Result<ShardInfo> {
    method health (line 62) | pub async fn health(&mut self) -> Result<HealthResponse> {
    method clear_cache (line 73) | pub async fn clear_cache(&mut self, batch_id: Option<u64>) -> Result<(...
    method filter_batch (line 84) | pub async fn filter_batch(
    method warmup (line 102) | pub async fn warmup(
    method prefill (line 142) | pub async fn prefill(
    method decode (line 176) | pub async fn decode(
  method from (line 206) | fn from(value: InfoResponse) -> Self {
  method device_health (line 219) | async fn device_health(&self) -> Result<()> {
  method model_health (line 224) | async fn model_health(&self) -> Result<()> {

FILE: backends/gaudi/server/text_generation_server/adapters/config.py
  class ModuleMap (line 15) | class ModuleMap:
  class AdapterConfig (line 21) | class AdapterConfig(ABC):
    method map_weights_for_model (line 25) | def map_weights_for_model(

FILE: backends/gaudi/server/text_generation_server/adapters/lora.py
  function get_start_stop_idxs_for_rank (line 30) | def get_start_stop_idxs_for_rank(offset, size, rank, world_size):
  function shard_on_dim (line 37) | def shard_on_dim(
  function shard_lora_weights (line 56) | def shard_lora_weights(
  class LoraConfig (line 74) | class LoraConfig(AdapterConfig):
    method map_weights_for_model (line 81) | def map_weights_for_model(
    method load (line 103) | def load(cls, adapter_id: str, api_token: str) -> "LoraConfig":
  class LoraWeights (line 117) | class LoraWeights(AdapterWeights):
    method __init__ (line 120) | def __init__(
    method weights_a (line 142) | def weights_a(self) -> torch.Tensor:
    method weights_b (line 148) | def weights_b(self) -> torch.Tensor:
    method weights_a_t (line 154) | def weights_a_t(self) -> torch.Tensor:
    method weights_b_t (line 160) | def weights_b_t(self) -> torch.Tensor:
    method _transpose_weights (line 165) | def _transpose_weights(self):
    method get_batch_types (line 173) | def get_batch_types(cls) -> List[Type[BatchAdapterWeights]]:
    method prepare_weights (line 190) | def prepare_weights(
  class RankSegments (line 256) | class RankSegments:
  class BatchLoraWeights (line 273) | class BatchLoraWeights(BatchAdapterWeights):
    method has_adapter (line 280) | def has_adapter(self, adapter_index: int) -> bool:
    method can_vectorize (line 283) | def can_vectorize(self, pg: ProcessGroup) -> bool:
    method load (line 290) | def load(
  function get_scaling_factor (line 457) | def get_scaling_factor(
  function _convert_lora (line 468) | def _convert_lora(v: AdapterWeights) -> AdapterWeights:

FILE: backends/gaudi/server/text_generation_server/adapters/weights.py
  class AdapterBatchMetadata (line 14) | class AdapterBatchMetadata:
  class AdapterWeights (line 30) | class AdapterWeights(ABC):
    method get_batch_types (line 32) | def get_batch_types(cls) -> List[Type["BatchAdapterWeights"]]:
    method speculative_tokens (line 36) | def speculative_tokens(self) -> int:
  class BatchAdapterWeights (line 40) | class BatchAdapterWeights(ABC):
    method has_adapter (line 42) | def has_adapter(self, adapter_index: int) -> bool:
    method load (line 46) | def load(
  class LayerAdapterWeights (line 56) | class LayerAdapterWeights:
    method __init__ (line 59) | def __init__(self):
    method add_adapter (line 62) | def add_adapter(self, adapter_idx: int, weights: AdapterWeights):
    method remove_adapter (line 65) | def remove_adapter(self, adapter_idx: int):
    method is_empty (line 70) | def is_empty(self) -> bool:
    method get_data (line 73) | def get_data(
  class AdapterBatchData (line 98) | class AdapterBatchData:
    method from_meta (line 107) | def from_meta(
    method ranks (line 122) | def ranks(self) -> Set[int]:
    method layer_names (line 134) | def layer_names(self) -> Set[str]:
    method adapter_keys (line 137) | def adapter_keys(self) -> Set[str]:
    method max_rank (line 144) | def max_rank(self) -> int:

FILE: backends/gaudi/server/text_generation_server/cache.py
  class Cache (line 10) | class Cache:
    method __init__ (line 11) | def __init__(self):
    method pop (line 14) | def pop(self, batch_id: int) -> Optional[B]:
    method set (line 17) | def set(self, entry: B):
    method delete (line 21) | def delete(self, batch_id: int):
    method clear (line 28) | def clear(self):
    method __len__ (line 33) | def __len__(self):

FILE: backends/gaudi/server/text_generation_server/cli.py
  class Quantization (line 16) | class Quantization(str, Enum):
  class Dtype (line 23) | class Dtype(str, Enum):
  class KVCacheDtype (line 28) | class KVCacheDtype(str, Enum):
  function serve (line 34) | def serve(
  function download_weights (line 132) | def download_weights(
  function quantize (line 336) | def quantize(

FILE: backends/gaudi/server/text_generation_server/interceptor.py
  class ExceptionInterceptor (line 15) | class ExceptionInterceptor(AsyncServerInterceptor):
    method intercept (line 16) | async def intercept(

FILE: backends/gaudi/server/text_generation_server/layers/attention/common.py
  class HPUPagedAttentionMetadata (line 11) | class HPUPagedAttentionMetadata:
  function subtuple (line 27) | def subtuple(
  function trim_attn_metadata (line 47) | def trim_attn_metadata(metadata: HPUPagedAttentionMetadata) -> object:
  class Seqlen (line 89) | class Seqlen:
    method __init__ (line 93) | def __init__(
    method clamp (line 99) | def clamp(self, max):
    method make_sliding_window_bias (line 103) | def make_sliding_window_bias(
  function _async_h2d_tensor_copy (line 146) | def _async_h2d_tensor_copy(source, device="hpu"):
  function trim_seqlen_metadata (line 157) | def trim_seqlen_metadata(metadata: Seqlen) -> object:

FILE: backends/gaudi/server/text_generation_server/layers/attention/hpu.py
  class FP8Matmul (line 16) | class FP8Matmul(torch.nn.Module):
    method __init__ (line 18) | def __init__(self, scale_other):
    method quant_input (line 23) | def quant_input(self, x, scale):
    method matmul_fp8 (line 28) | def matmul_fp8(
    method forward (line 44) | def forward(self, input, other):
  class FetchFromCache (line 57) | class FetchFromCache(torch.nn.Module):
    method __init__ (line 59) | def __init__(self, scale_inv):
    method forward (line 63) | def forward(self, cache, blocks):
  function attention (line 73) | def attention(
  function set_block_mapping (line 110) | def set_block_mapping(hpu_attention_meta: HPUPagedAttentionMetadata, bat...
  function paged_attention (line 134) | def paged_attention(
  function paged_attention_mla (line 185) | def paged_attention_mla(

FILE: backends/gaudi/server/text_generation_server/layers/attention/kv_cache.py
  class KVScales (line 11) | class KVScales:
    method __post_init__ (line 26) | def __post_init__(self):
  class KVCache (line 34) | class KVCache:
    method __init__ (line 41) | def __init__(
    method dtype (line 69) | def dtype(self):
    method key (line 74) | def key(self):
    method value (line 80) | def value(self):
    method store (line 85) | def store(
  class KVCompressCache (line 110) | class KVCompressCache(KVCache):
    method __init__ (line 117) | def __init__(
    method dtype (line 137) | def dtype(self):
    method key (line 142) | def key(self):
    method value (line 148) | def value(self):
    method store (line 153) | def store(
  function paged_reshape_and_cache (line 170) | def paged_reshape_and_cache(
  function get_kv_scales (line 190) | def get_kv_scales(weights: Weights, prefix: str) -> KVScales:

FILE: backends/gaudi/server/text_generation_server/layers/awq/conversion_utils.py
  function pack (line 9) | def pack(imatrix: torch.Tensor, direction: str = "column"):
  function unpack (line 35) | def unpack(qmatrix: torch.Tensor, direction: str = "column"):
  function apply_order (line 61) | def apply_order(
  function fast_awq_to_gptq (line 83) | def fast_awq_to_gptq(qweight, qzeros):

FILE: backends/gaudi/server/text_generation_server/layers/awq/quantize/hpu.py
  function error_raiser_hpu (line 12) | def error_raiser_hpu(*args, **kwargs):
  function unpack_awq (line 22) | def unpack_awq(qweight: torch.Tensor, qzeros: torch.Tensor, bits: int):
  function reverse_awq_order (line 45) | def reverse_awq_order(iweights: torch.Tensor, izeros: torch.Tensor, bits...
  function unpack_weight_and_zeros (line 62) | def unpack_weight_and_zeros(qweight, qzeros, bits):
  function pack_tensor (line 75) | def pack_tensor(input, bits=4):
  class WQLinear (line 93) | class WQLinear(nn.Module):
    method __init__ (line 94) | def __init__(
    method _preprocessing (line 117) | def _preprocessing(self):
    method forward (line 126) | def forward(self, x):

FILE: backends/gaudi/server/text_generation_server/layers/bnb.py
  class BNBWeight (line 10) | class BNBWeight(UnquantizedWeight):
    method get_linear (line 13) | def get_linear(self, bias: torch.Tensor):
  class Linear8bitLt (line 17) | class Linear8bitLt(torch.nn.Module):
    method __init__ (line 18) | def __init__(
    method init_8bit_state (line 49) | def init_8bit_state(self):
    method forward (line 55) | def forward(self, x: torch.Tensor):
  class BNBFP4Weight (line 76) | class BNBFP4Weight(UnquantizedWeight):
    method get_linear (line 79) | def get_linear(self, bias: torch.Tensor):
  class BNBNF4Weight (line 84) | class BNBNF4Weight(UnquantizedWeight):
    method get_linear (line 87) | def get_linear(self, bias: torch.Tensor):
  class Linear4bit (line 91) | class Linear4bit(torch.nn.Module):
    method __init__ (line 92) | def __init__(self, weight, bias, quant_type):
    method forward (line 104) | def forward(self, x: torch.Tensor):

FILE: backends/gaudi/server/text_generation_server/layers/compressed_tensors/loader.py
  class CompressedTensorsLoader (line 29) | class CompressedTensorsLoader(WeightsLoader):
    method __init__ (line 32) | def __init__(self, config: Dict[str, Any]):
    method get_weights (line 69) | def get_weights(self, weights: Weights, prefix: str):
    method get_weights_col_packed (line 73) | def get_weights_col_packed(
    method get_multi_weights_col (line 82) | def get_multi_weights_col(self, weights: Weights, prefixes: List[str],...
    method get_multi_weights (line 86) | def get_multi_weights(self, weights: Weights, prefixes: List[str], dim...
    method get_weights_row (line 90) | def get_weights_row(self, weights: Weights, prefix: str):
    method _get_target_loaders (line 94) | def _get_target_loaders(
    method _create_loader_for_group (line 125) | def _create_loader_for_group(
    method _lookup_loader (line 154) | def _lookup_loader(self, prefix: str) -> WeightsLoader:

FILE: backends/gaudi/server/text_generation_server/layers/compressed_tensors/w8an_fp.py
  class W8ANFpLoader (line 14) | class W8ANFpLoader(WeightsLoader):
    method __init__ (line 19) | def __init__(
    method __str__ (line 41) | def __str__(self) -> str:
    method get_weights (line 49) | def get_weights(self, weights: "Weights", prefix: str):
    method get_weights_col_packed (line 81) | def get_weights_col_packed(
    method get_multi_weights_col (line 130) | def get_multi_weights_col(self, weights: "Weights", prefixes: List[str...
    method get_multi_weights (line 177) | def get_multi_weights(self, weights: "Weights", prefixes: List[str], d...
    method get_weights_row (line 227) | def get_weights_row(self, weights: "Weights", prefix: str):

FILE: backends/gaudi/server/text_generation_server/layers/conv.py
  function load_conv2d (line 6) | def load_conv2d(cls, prefix, weights, in_channels, out_channels, kernel_...
  function load_conv2d_no_bias (line 23) | def load_conv2d_no_bias(

FILE: backends/gaudi/server/text_generation_server/layers/exl2.py
  class Exl2Weight (line 9) | class Exl2Weight(Weight):
    method __post_init__ (line 20) | def __post_init__(self):
    method device (line 25) | def device(self) -> torch.device:
    method get_linear (line 28) | def get_linear(self, bias: torch.Tensor):
  class Exl2WeightsLoader (line 34) | class Exl2WeightsLoader(WeightsLoader):
    method get_weights (line 37) | def get_weights(self, weights: "Weights", prefix: str):
    method get_weights_col_packed (line 61) | def get_weights_col_packed(
    method get_weights_col (line 69) | def get_weights_col(self, weights: Weights, prefix: str):
    method get_multi_weights_col (line 73) | def get_multi_weights_col(self, weights: Weights, prefixes: List[str],...
    method get_weights_row (line 76) | def get_weights_row(self, weights: Weights, prefix: str):

FILE: backends/gaudi/server/text_generation_server/layers/fp8.py
  function pad_weight (line 22) | def pad_weight(weight, block_size):
  function unpad_weight (line 37) | def unpad_weight(weight, original_M, original_N, keep_first_dim=False):
  function pad_block_fp8_weight_naive (line 47) | def pad_block_fp8_weight_naive(weight, weight_scale, block_size):
  function dynamic_quant (line 63) | def dynamic_quant(data, single_scale=False):
  function dequant_block_fp8_weight_naive (line 75) | def dequant_block_fp8_weight_naive(
  function apply_block_fp8_linear_hpu_dynamic (line 132) | def apply_block_fp8_linear_hpu_dynamic(
  function get_fp8_linear (line 162) | def get_fp8_linear(force_w8a16: bool = False) -> Type[torch.nn.Module]:
  function normalize_e4m3fn_to_native_float8 (line 170) | def normalize_e4m3fn_to_native_float8(
  function per_tensor_dequantize (line 178) | def per_tensor_dequantize(
  function requantize_with_max_scale (line 194) | def requantize_with_max_scale(
  function fp8_quantize (line 220) | def fp8_quantize(
  class HybridFP8UnquantLoader (line 245) | class HybridFP8UnquantLoader(WeightsLoader):
    method __init__ (line 248) | def __init__(
    method get_weights (line 258) | def get_weights(self, weights: "Weights", prefix: str):
    method get_weights_col_packed (line 299) | def get_weights_col_packed(
    method get_multi_weights_col (line 352) | def get_multi_weights_col(self, weights: "Weights", prefixes: List[str...
    method get_multi_weights (line 414) | def get_multi_weights(self, weights: "Weights", prefixes: List[str], d...
    method get_weights_row (line 476) | def get_weights_row(self, weights: "Weights", prefix: str):
  class Fp8Weight (line 524) | class Fp8Weight(Weight):
    method get_linear (line 533) | def get_linear(self, bias: torch.Tensor):
  class Fp8Linear (line 552) | class Fp8Linear(torch.nn.Module):
    method __init__ (line 555) | def __init__(
    method from_unquant (line 577) | def from_unquant(cls, weight, bias, dtype):
    method from_fp8 (line 589) | def from_fp8(
    method forward (line 627) | def forward(self, input: torch.Tensor) -> torch.Tensor:
  function _load_scalar_or_matrix_scale (line 650) | def _load_scalar_or_matrix_scale(weights: Weights, prefix: str, shape: t...

FILE: backends/gaudi/server/text_generation_server/layers/gptq/__init__.py
  class GPTQWeight (line 19) | class GPTQWeight(Weight):
    method __post_init__ (line 29) | def __post_init__(self):
    method device (line 34) | def device(self) -> torch.device:
    method get_linear (line 37) | def get_linear(self, bias: torch.Tensor):
  class GPTQWeightsLoader (line 66) | class GPTQWeightsLoader(WeightsLoader):
    method __init__ (line 71) | def __init__(
    method is_layer_skipped_quantization (line 90) | def is_layer_skipped_quantization(
    method get_weights (line 95) | def get_weights(self, weights: Weights, prefix: str):
    method get_weights_col_packed (line 157) | def get_weights_col_packed(
    method get_multi_weights_col (line 217) | def get_multi_weights_col(self, weights: Weights, prefixes: List[str],...
    method get_multi_weights (line 279) | def get_multi_weights(self, weights: Weights, prefixes: List[str], dim...
    method get_weights_row (line 336) | def get_weights_row(self, weights: Weights, prefix: str):
    method _get_gptq_params (line 426) | def _get_gptq_params(self, weights: Weights):

FILE: backends/gaudi/server/text_generation_server/layers/gptq/hpu.py
  function error_raiser_hpu (line 12) | def error_raiser_hpu(*args, **kwargs):
  function pack_tensor (line 20) | def pack_tensor(input, bits=4):
  class QuantLinear (line 34) | class QuantLinear(nn.Module):
    method __init__ (line 35) | def __init__(self, qweight, qzeros, scales, g_idx, bias, bits, groupsi...
    method unpack_zeros_from_cuda_old_format (line 58) | def unpack_zeros_from_cuda_old_format(self):
    method unpack_weight_from_cuda_old_format (line 71) | def unpack_weight_from_cuda_old_format(self):
    method _preprocessing (line 80) | def _preprocessing(self):
    method new (line 119) | def new(cls, bits, groupsize, infeatures, outfeatures, bias):
    method pack (line 140) | def pack(self, linear, scales, zeros, g_idx=None):
    method forward (line 197) | def forward(self, x):

FILE: backends/gaudi/server/text_generation_server/layers/gptq/quantize.py
  class Quantizer (line 25) | class Quantizer(nn.Module):
    method __init__ (line 26) | def __init__(self, shape=1):
    method configure (line 32) | def configure(
    method _quantize (line 54) | def _quantize(self, x, scale, zero, maxq):
    method find_params (line 60) | def find_params(self, x, weight=False):
    method quantize (line 145) | def quantize(self, x):
    method enabled (line 151) | def enabled(self):
    method ready (line 154) | def ready(self):
  class GPTQ (line 158) | class GPTQ:
    method __init__ (line 159) | def __init__(self, layer, observe=False):
    method add_batch (line 174) | def add_batch(self, inp, out):
    method print_loss (line 209) | def print_loss(self, name, q_weight, weight_error, timecost):
    method fasterquant (line 243) | def fasterquant(
    method free (line 357) | def free(self):
  function get_wikitext2 (line 366) | def get_wikitext2(nsamples, seed, seqlen, model_id, trust_remote_code):
  function get_ptb (line 398) | def get_ptb(nsamples, seed, seqlen, model_id, trust_remote_code):
  function get_c4 (line 430) | def get_c4(nsamples, seed, seqlen, model_id, trust_remote_code):
  function get_ptb_new (line 498) | def get_ptb_new(nsamples, seed, seqlen, model_id, trust_remote_code):
  function get_c4_new (line 530) | def get_c4_new(nsamples, seed, seqlen, model_id, trust_remote_code):
  function get_loaders (line 584) | def get_loaders(
  function find_layers (line 599) | def find_layers(module, layers=(nn.Conv2d, nn.Linear), name=""):
  function sequential (line 615) | def sequential(
  function make_quant_linear (line 754) | def make_quant_linear(module, names, bits, groupsize, name=""):
  function pack (line 780) | def pack(model, quantizers, bits, groupsize):
  function setdeepattr (line 794) | def setdeepattr(module, full_name, tensor):
  function getdeepattr (line 802) | def getdeepattr(module, full_name):
  function load_weights_pre_hook (line 810) | def load_weights_pre_hook(module_name, weights, recursive=False):
  function load_weights_post_hook (line 842) | def load_weights_post_hook(module_name, weights, recursive=False):
  function quantize (line 867) | def quantize(

FILE: backends/gaudi/server/text_generation_server/layers/gptq/utils.py
  function torch_snr_error (line 5) | def torch_snr_error(

FILE: backends/gaudi/server/text_generation_server/layers/layernorm.py
  function load_layer_norm (line 8) | def load_layer_norm(cls, prefix, weights, eps):
  function load_layer_norm_no_bias (line 20) | def load_layer_norm_no_bias(cls, prefix, weights, eps):
  class FastLayerNorm (line 34) | class FastLayerNorm(nn.LayerNorm):
    method forward (line 35) | def forward(self, hidden_states, residual=None):
  class FastRMSNorm (line 43) | class FastRMSNorm(nn.Module):
    method __init__ (line 44) | def __init__(self, weight: torch.Tensor, eps: float):
    method load (line 51) | def load(cls, prefix, weights, eps=1e-6):
    method forward (line 55) | def forward(self, hidden_states, residual=None):

FILE: backends/gaudi/server/text_generation_server/layers/linear.py
  class FastLinear (line 5) | class FastLinear(torch.nn.Module):
    method __init__ (line 6) | def __init__(
    method load (line 19) | def load(cls, config, prefix: str, weights, bias: bool):
    method forward (line 27) | def forward(self, input: torch.Tensor) -> torch.Tensor:
  function get_linear (line 31) | def get_linear(weight, bias):

FILE: backends/gaudi/server/text_generation_server/layers/lora.py
  class LoraLinear (line 22) | class LoraLinear(nn.Module):
    method __init__ (line 23) | def __init__(
    method forward_layer_type (line 31) | def forward_layer_type(
    method forward_lora (line 135) | def forward_lora(
    method collect_lora_a (line 154) | def collect_lora_a(self, a_out: torch.Tensor) -> torch.Tensor:
  class TensorParallelMultiAdapterLinear (line 158) | class TensorParallelMultiAdapterLinear(LoraLinear):
    method __init__ (line 159) | def __init__(
    method load (line 172) | def load(
    method forward (line 184) | def forward(
    method collect_lora_a (line 227) | def collect_lora_a(self, a_out: torch.Tensor) -> torch.Tensor:
  class TensorParallelAdapterRowLinear (line 242) | class TensorParallelAdapterRowLinear(LoraLinear):
    method __init__ (line 243) | def __init__(self, base_layer, layer_id, layer_name, process_group):
    method load (line 248) | def load(cls, base_layer, layer_id, layer_name, process_group):
    method forward (line 251) | def forward(
    method collect_lora_a (line 270) | def collect_lora_a(self, a_out: torch.Tensor) -> torch.Tensor:

FILE: backends/gaudi/server/text_generation_server/layers/medusa.py
  class ResBlock (line 12) | class ResBlock(torch.nn.Module):
    method __init__ (line 13) | def __init__(self, config, prefix, weights):
    method forward (line 20) | def forward(self, x):
  class MedusaModel (line 24) | class MedusaModel(torch.nn.Module):
    method __init__ (line 25) | def __init__(self, config, medusa_config, weights):
    method forward (line 34) | def forward(self, x):
  class MedusaHead (line 41) | class MedusaHead(torch.nn.Module):
    method __init__ (line 42) | def __init__(self, config, medusa_config, prefix, weights):
    method forward (line 55) | def forward(self, x):
  class MedusaHeadV1 (line 62) | class MedusaHeadV1(nn.Module):
    method __init__ (line 63) | def __init__(self, lm_head, medusa):
    method load (line 69) | def load(config, prefix: str, weights):
    method forward (line 97) | def forward(
  class MedusaHeadV2 (line 109) | class MedusaHeadV2(nn.Module):
    method __init__ (line 110) | def __init__(self, config, prefix, weights):
    method forward (line 150) | def forward(self, x):

FILE: backends/gaudi/server/text_generation_server/layers/mlp.py
  class MLPSpeculatorLayerNorm (line 11) | class MLPSpeculatorLayerNorm(nn.Module):
    method __init__ (line 27) | def __init__(
    method forward (line 39) | def forward(self, x):
  function simple_norm (line 51) | def simple_norm(x: torch.Tensor, eps=1e-06):
  class MLPSpeculatorModelTied (line 58) | class MLPSpeculatorModelTied(torch.nn.Module):
    method __init__ (line 59) | def __init__(self, config, prefix, weights):
    method forward (line 96) | def forward(
  class MLPSpeculatorModel (line 142) | class MLPSpeculatorModel(torch.nn.Module):
    method __init__ (line 143) | def __init__(self, config, prefix, weights):
    method forward (line 192) | def forward(
  class MLPSpeculatorHead (line 235) | class MLPSpeculatorHead(nn.Module):
    method __init__ (line 236) | def __init__(self, lm_head, mlp_speculator, scale_input: bool):
    method forward (line 242) | def forward(
    method load (line 257) | def load(config, prefix: str, weights):

FILE: backends/gaudi/server/text_generation_server/layers/moe/__init__.py
  class MoELayer (line 30) | class MoELayer(Protocol):
    method __init__ (line 31) | def __init__(
    method forward (line 49) | def forward(
  class DenseMoELayer (line 54) | class DenseMoELayer(nn.Module):
    method __init__ (line 62) | def __init__(
    method forward (line 143) | def forward(self, x: torch.Tensor, *, gating_output: torch.Tensor) -> ...
  class SparseMoELayer (line 182) | class SparseMoELayer(nn.Module):
    method __init__ (line 189) | def __init__(
    method forward (line 242) | def forward(self, x: torch.Tensor, *, gating_output: torch.Tensor) -> ...
    method is_supported (line 246) | def is_supported(weights: Weights) -> bool:

FILE: backends/gaudi/server/text_generation_server/layers/moe/fp8.py
  class FP8SparseMoELayer (line 20) | class FP8SparseMoELayer(nn.Module):
    method __init__ (line 21) | def __init__(
    method forward (line 105) | def forward(self, x: torch.Tensor, *, gating_output: torch.Tensor) -> ...
  function _load_expert_weights (line 168) | def _load_expert_weights(
  function _load_expert_multi_weights_col (line 218) | def _load_expert_multi_weights_col(
  function _load_expert_weights_row (line 248) | def _load_expert_weights_row(

FILE: backends/gaudi/server/text_generation_server/layers/moe/fused_moe.py
  function grouped_topk (line 21) | def grouped_topk(
  function fused_topk (line 83) | def fused_topk(
  function select_experts (line 98) | def select_experts(

FILE: backends/gaudi/server/text_generation_server/layers/moe/unquantized.py
  class UnquantizedSparseMoELayer (line 13) | class UnquantizedSparseMoELayer(nn.Module):
    method __init__ (line 14) | def __init__(
    method forward (line 83) | def forward(self, x: torch.Tensor, *, gating_output: torch.Tensor) -> ...
  function _load_expert_multi_weights_col (line 103) | def _load_expert_multi_weights_col(
  function _load_expert_weights_row (line 144) | def _load_expert_weights_row(

FILE: backends/gaudi/server/text_generation_server/layers/rotary.py
  function _create_inv_freq (line 11) | def _create_inv_freq(dim, base, device):
  function _get_rope_config (line 18) | def _get_rope_config(config):
  class PositionRotaryEmbedding (line 28) | class PositionRotaryEmbedding(nn.Module):
    method __init__ (line 29) | def __init__(self, inv_freq, scaling_factor, max_position_embeddings):
    method forward (line 43) | def forward(
    method static (line 76) | def static(cls, config, dim, base, device):
    method load (line 208) | def load(cls, config, prefix, weights):
    method _update_cos_sin_cache (line 253) | def _update_cos_sin_cache(self, dtype, device, seqlen):
    method get_cos_sin (line 272) | def get_cos_sin(self, position_ids: torch.Tensor):
  class SuRotaryEmbedding (line 281) | class SuRotaryEmbedding(PositionRotaryEmbedding):
    method __init__ (line 282) | def __init__(
    method _update_cos_sin_cache (line 305) | def _update_cos_sin_cache(self, dtype, device, seqlen):
  class Phi3LongRoPEScaledRotaryEmbedding (line 332) | class Phi3LongRoPEScaledRotaryEmbedding(PositionRotaryEmbedding):
    method __init__ (line 333) | def __init__(
    method _update_cos_sin_cache (line 361) | def _update_cos_sin_cache(self, dtype, device, seqlen):
  class DynamicPositionRotaryEmbedding (line 392) | class DynamicPositionRotaryEmbedding(PositionRotaryEmbedding):
    method __init__ (line 393) | def __init__(self, dim, max_position_embeddings, base, device, scaling...
    method _update_cos_sin_cache (line 400) | def _update_cos_sin_cache(self, dtype, device, seqlen):
  function find_correction_dim (line 426) | def find_correction_dim(num_rotations, dim, base=10000, max_position_emb...
  function find_correction_range (line 433) | def find_correction_range(
  function linear_ramp_mask (line 441) | def linear_ramp_mask(min, max, dim):
  function get_mscale (line 450) | def get_mscale(scale: float = 1.0, mscale: float = 1.0):
  class YarnPositionRotaryEmbedding (line 456) | class YarnPositionRotaryEmbedding(PositionRotaryEmbedding):
    method __init__ (line 457) | def __init__(
    method _update_cos_sin_cache (line 489) | def _update_cos_sin_cache(self, dtype, device, seqlen):
  function apply_llama3_scaling (line 531) | def apply_llama3_scaling(
  class RotaryPositionEmbeddingMultimodalSections (line 560) | class RotaryPositionEmbeddingMultimodalSections(PositionRotaryEmbedding):
    method __init__ (line 561) | def __init__(
    method _update_cos_sin_cache (line 579) | def _update_cos_sin_cache(
    method get_cos_sin (line 596) | def get_cos_sin(

FILE: backends/gaudi/server/text_generation_server/layers/speculative.py
  class SpeculativeHead (line 9) | class SpeculativeHead(torch.nn.Module):
    method __init__ (line 10) | def __init__(self, lm_head, speculator):
    method load (line 16) | def load(config, prefix: str, weights):
    method forward (line 44) | def forward(

FILE: backends/gaudi/server/text_generation_server/layers/tensor_parallel.py
  class LayerConcat (line 9) | class LayerConcat(torch.nn.Module):
    method __init__ (line 15) | def __init__(self, layers: Iterable[torch.nn.Module], dim: int = -1):
    method forward (line 23) | def forward(self, x: torch.Tensor):
  class SuperLayer (line 28) | class SuperLayer(torch.nn.Module):
    method __init__ (line 29) | def __init__(self, linear):
    method forward (line 33) | def forward(self, x):
  class TensorParallelHead (line 37) | class TensorParallelHead(SuperLayer):
    method __init__ (line 38) | def __init__(self, linear, process_group, should_gather: bool):
    method load (line 44) | def load(config, prefix: str, weights):
    method forward (line 73) | def forward(self, input: torch.Tensor) -> torch.Tensor:
  class TensorParallelColumnLinear (line 111) | class TensorParallelColumnLinear(SuperLayer):
    method load_gate_up (line 113) | def load_gate_up(cls, config, prefix: str, weights, bias: bool):
    method load_qkv (line 124) | def load_qkv(
    method load (line 147) | def load(cls, config, prefix: str, weights, bias: bool):
    method load_multi (line 157) | def load_multi(cls, config, prefixes: List[str], weights, bias: bool, ...
  class TensorParallelRowLinear (line 176) | class TensorParallelRowLinear(SuperLayer):
    method __init__ (line 177) | def __init__(self, linear, process_group):
    method load (line 182) | def load(cls, config, prefix: str, weights, bias: bool):
    method forward (line 195) | def forward(self, input: torch.Tensor, reduce: bool = True) -> torch.T...
  class TensorParallelEmbedding (line 206) | class TensorParallelEmbedding(torch.nn.Module):
    method __init__ (line 207) | def __init__(self, prefix: str, weights, reduce=True):
    method forward (line 229) | def forward(self, input: torch.Tensor) -> torch.Tensor:

FILE: backends/gaudi/server/text_generation_server/models/__init__.py
  class ModelType (line 163) | class ModelType(enum.Enum):
  function get_model (line 359) | def get_model(
  function get_model_with_lora_adapters (line 949) | def get_model_with_lora_adapters(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/bloom_modeling.py
  function _make_causal_mask (line 68) | def _make_causal_mask(
  function _expand_mask (line 88) | def _expand_mask(mask: torch.Tensor, tgt_length: int) -> torch.BoolTensor:
  function build_alibi_tensor (line 99) | def build_alibi_tensor(attention_mask: torch.Tensor, num_heads: int) -> ...
  function dropout_add (line 156) | def dropout_add(
  function _split_heads (line 178) | def _split_heads(
  function _merge_heads (line 210) | def _merge_heads(x: torch.Tensor, num_heads: int, head_dim: int) -> torc...
  class BloomAttention (line 236) | class BloomAttention(nn.Module):
    method __init__ (line 237) | def __init__(self, prefix, config: BloomConfig, weights):
    method compute_attention (line 280) | def compute_attention(
    method forward (line 357) | def forward(
  class BloomMLP (line 435) | class BloomMLP(nn.Module):
    method __init__ (line 436) | def __init__(self, prefix, config: BloomConfig, weights):
    method forward (line 450) | def forward(
  class BloomBlock (line 474) | class BloomBlock(nn.Module):
    method __init__ (line 475) | def __init__(self, layer_id: int, config: BloomConfig, weights):
    method forward (line 500) | def forward(
  class BloomPreTrainedModel (line 556) | class BloomPreTrainedModel(PreTrainedModel):
    method _convert_to_standard_cache (line 562) | def _convert_to_standard_cache(
    method _convert_to_bloom_cache (line 582) | def _convert_to_bloom_cache(
  class BloomModel (line 601) | class BloomModel(BloomPreTrainedModel):
    method __init__ (line 602) | def __init__(self, config: BloomConfig, weights):
    method _prepare_attn_mask (line 635) | def _prepare_attn_mask(
    method set_input_embeddings (line 664) | def set_input_embeddings(self, new_embeddings: torch.Tensor):
    method forward (line 667) | def forward(
  class BloomForCausalLM (line 818) | class BloomForCausalLM(BloomPreTrainedModel):
    method __init__ (line 819) | def __init__(self, prefix: str, config, weights):
    method prepare_inputs_for_generation (line 829) | def prepare_inputs_for_generation(
    method forward (line 860) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/clip.py
  class CLIPVisionEmbeddings (line 23) | class CLIPVisionEmbeddings(nn.Module):
    method __init__ (line 24) | def __init__(self, prefix, config: CLIPVisionConfig, weights):
    method forward (line 56) | def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
  class CLIPTextEmbeddings (line 70) | class CLIPTextEmbeddings(nn.Module):
    method __init__ (line 71) | def __init__(self, config: CLIPTextConfig):
    method forward (line 87) | def forward(
  class CLIPAttention (line 109) | class CLIPAttention(nn.Module):
    method __init__ (line 112) | def __init__(self, prefix, config, weights):
    method _shape (line 142) | def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
    method forward (line 149) | def forward(
  class CLIPMLP (line 234) | class CLIPMLP(nn.Module):
    method __init__ (line 235) | def __init__(self, prefix, config, weights):
    method forward (line 246) | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
  class CLIPEncoderLayer (line 253) | class CLIPEncoderLayer(nn.Module):
    method __init__ (line 254) | def __init__(self, prefix, config: CLIPConfig, weights):
    method forward (line 268) | def forward(
  class CLIPPreTrainedModel (line 299) | class CLIPPreTrainedModel(nn.Module):
  class CLIPEncoder (line 386) | class CLIPEncoder(nn.Module):
    method __init__ (line 395) | def __init__(self, prefix, config: CLIPConfig, weights):
    method forward (line 407) | def forward(
  class CLIPTextTransformer (line 446) | class CLIPTextTransformer(nn.Module):
    method __init__ (line 447) | def __init__(self, prefix: str, config: CLIPTextConfig, weights=None):
    method forward (line 461) | def forward(
  class CLIPTextModel (line 533) | class CLIPTextModel(CLIPPreTrainedModel):
    method __init__ (line 538) | def __init__(self, prefix, config: CLIPTextConfig):
    method forward (line 544) | def forward(
  class CLIPVisionTransformer (line 575) | class CLIPVisionTransformer(nn.Module):
    method __init__ (line 576) | def __init__(self, prefix, config: CLIPVisionConfig, weights):
    method forward (line 591) | def forward(
  class CLIPVisionModel (line 619) | class CLIPVisionModel(CLIPPreTrainedModel):
    method __init__ (line 624) | def __init__(self, config: CLIPVisionConfig):
    method get_input_embeddings (line 630) | def get_input_embeddings(self) -> nn.Module:
    method forward (line 633) | def forward(
  class CLIPModel (line 665) | class CLIPModel(nn.Module):
    method __init__ (line 666) | def __init__(self, prefix, config: CLIPConfig, weights):
    method get_text_features (line 691) | def get_text_features(
    method get_image_features (line 724) | def get_image_features(
    method forward (line 760) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_cohere_modeling.py
  class CohereRotary (line 58) | class CohereRotary(PositionRotaryEmbedding):
    method forward (line 59) | def forward(
  class CohereLayerNorm (line 88) | class CohereLayerNorm(nn.Module):
    method __init__ (line 89) | def __init__(self, prefix, weights, eps):
    method forward (line 97) | def forward(self, hidden_states):
  function load_attention (line 112) | def load_attention(config, prefix, weights):
  function _load_gqa (line 125) | def _load_gqa(config, prefix: str, weights):
  class FlashCohereAttention (line 157) | class FlashCohereAttention(torch.nn.Module):
    method __init__ (line 158) | def __init__(
    method forward (line 214) | def forward(
  class CohereMLP (line 283) | class CohereMLP(nn.Module):
    method __init__ (line 284) | def __init__(self, prefix, config, weights):
    method forward (line 315) | def forward(self, hidden_states):
  class FlashCohereLayer (line 323) | class FlashCohereLayer(nn.Module):
    method __init__ (line 324) | def __init__(self, prefix: str, layer_id, config, weights, rotary_emb):
    method forward (line 342) | def forward(
  class FlashCohereModel (line 377) | class FlashCohereModel(torch.nn.Module):
    method __init__ (line 378) | def __init__(self, prefix: str, config, weights):
    method forward (line 415) | def forward(
  class FlashCohereForCausalLM (line 459) | class FlashCohereForCausalLM(torch.nn.Module):
    method __init__ (line 460) | def __init__(self, prefix: str, config, weights):
    method forward (line 483) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_dbrx_modeling.py
  class DbrxAttentionConfig (line 51) | class DbrxAttentionConfig(PretrainedConfig):
    method __init__ (line 52) | def __init__(
  class DbrxFFNConfig (line 73) | class DbrxFFNConfig(PretrainedConfig):
    method __init__ (line 74) | def __init__(
  class DbrxConfig (line 108) | class DbrxConfig(PretrainedConfig):
    method __init__ (line 115) | def __init__(
    method num_key_value_heads (line 168) | def num_key_value_heads(self):
  function promote_scalar (line 174) | def promote_scalar(x: torch.Tensor) -> torch.Tensor:
  function load_attention (line 178) | def load_attention(config, prefix, weights):
  function _load_experts (line 189) | def _load_experts(config, prefix, weights):
  function _load_experts_quantized (line 220) | def _load_experts_quantized(config, prefix, weights, cls):
  class DbrxAttention (line 260) | class DbrxAttention(torch.nn.Module):
    method __init__ (line 261) | def __init__(
    method forward (line 302) | def forward(
  class DbrxNormAttentionNorm (line 363) | class DbrxNormAttentionNorm(nn.Module):
    method __init__ (line 364) | def __init__(
    method forward (line 387) | def forward(
  function select_experts (line 420) | def select_experts(
  function round_up (line 438) | def round_up(x: torch.Tensor, value: int):
  class BlockSparseMoE (line 442) | class BlockSparseMoE(nn.Module):
    method __init__ (line 443) | def __init__(self, prefix, config: DbrxConfig, weights):
    method forward (line 493) | def forward(self, x: torch.Tensor) -> torch.Tensor:
  class DenseMoE (line 505) | class DenseMoE(nn.Module):
    method __init__ (line 506) | def __init__(self, prefix, config: DbrxConfig, weights):
    method forward (line 556) | def forward(self, x: torch.Tensor) -> torch.Tensor:
  class DbrxLayer (line 603) | class DbrxLayer(nn.Module):
    method __init__ (line 604) | def __init__(self, prefix: str, layer_id, config, weights, rotary_emb):
    method forward (line 618) | def forward(
  class DbrxModel (line 648) | class DbrxModel(torch.nn.Module):
    method __init__ (line 649) | def __init__(self, prefix: str, config, weights):
    method forward (line 682) | def forward(
  class FlashDbrxForCausalLM (line 725) | class FlashDbrxForCausalLM(torch.nn.Module):
    method __init__ (line 726) | def __init__(self, prefix: str, config, weights):
    method forward (line 741) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py
  function get_and_maybe_dequant_weights (line 48) | def get_and_maybe_dequant_weights(layer: torch.nn.Module) -> torch.Tensor:
  class DeepseekV2Config (line 60) | class DeepseekV2Config(PretrainedConfig):
    method __init__ (line 61) | def __init__(
  class DeepseekV2Attention (line 166) | class DeepseekV2Attention(torch.nn.Module):
    method __init__ (line 167) | def __init__(
    method _q_proj_and_k_up_proj (line 277) | def _q_proj_and_k_up_proj(self, x):
    method _v_up_proj_and_o_proj (line 292) | def _v_up_proj_and_o_proj(self, x):
    method forward (line 301) | def forward(
  class DeepseekV2MLP (line 422) | class DeepseekV2MLP(nn.Module):
    method __init__ (line 423) | def __init__(self, prefix: str, config, weights, intermediate_size: int):
    method forward (line 453) | def forward(self, hidden_states: torch.Tensor, reduce: bool = True):
  class DeepseekV2MoE (line 461) | class DeepseekV2MoE(nn.Module):
    method __init__ (line 462) | def __init__(
    method forward (line 504) | def forward(self, x: torch.Tensor) -> torch.Tensor:
  class DeepseekV2Layer (line 524) | class DeepseekV2Layer(nn.Module):
    method __init__ (line 525) | def __init__(self, prefix, layer_id, config, weights, rotary_emb):
    method forward (line 564) | def forward(
  class DeepseekV2Model (line 600) | class DeepseekV2Model(torch.nn.Module):
    method __init__ (line 601) | def __init__(self, prefix: str, config, weights: Weights):
    method forward (line 634) | def forward(
  class FlashDeepseekV2ForCausalLM (line 678) | class FlashDeepseekV2ForCausalLM(torch.nn.Module):
    method __init__ (line 679) | def __init__(self, prefix: str, config, weights: Weights):
    method forward (line 691) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_deepseek_v3_modeling.py
  function get_and_maybe_dequant_weights (line 48) | def get_and_maybe_dequant_weights(layer: torch.nn.Module) -> torch.Tensor:
  class DeepseekV3Config (line 60) | class DeepseekV3Config(PretrainedConfig):
    method __init__ (line 61) | def __init__(
  class DeepseekV3Attention (line 166) | class DeepseekV3Attention(torch.nn.Module):
    method __init__ (line 167) | def __init__(
    method _q_proj_and_k_up_proj (line 276) | def _q_proj_and_k_up_proj(self, x):
    method _v_up_proj_and_o_proj (line 291) | def _v_up_proj_and_o_proj(self, x):
    method forward (line 300) | def forward(
  class DeepseekV3MLP (line 421) | class DeepseekV3MLP(nn.Module):
    method __init__ (line 422) | def __init__(self, prefix: str, config, weights, intermediate_size: int):
    method forward (line 452) | def forward(self, hidden_states: torch.Tensor, reduce: bool = True):
  class DeepseekV3MoE (line 460) | class DeepseekV3MoE(nn.Module):
    method __init__ (line 461) | def __init__(
    method forward (line 512) | def forward(self, x: torch.Tensor) -> torch.Tensor:
  class DeepseekV3Layer (line 532) | class DeepseekV3Layer(nn.Module):
    method __init__ (line 533) | def __init__(self, prefix, layer_id, config, weights, rotary_emb):
    method forward (line 572) | def forward(
  class DeepseekV3Model (line 608) | class DeepseekV3Model(torch.nn.Module):
    method __init__ (line 609) | def __init__(self, prefix: str, config, weights: Weights):
    method forward (line 642) | def forward(
  class FlashDeepseekV3ForCausalLM (line 686) | class FlashDeepseekV3ForCausalLM(torch.nn.Module):
    method __init__ (line 687) | def __init__(self, prefix: str, config, weights: Weights):
    method forward (line 699) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py
  class Gemma2Config (line 53) | class Gemma2Config(PretrainedConfig):
    method __init__ (line 54) | def __init__(
  class Gemma2FastRMSNorm (line 109) | class Gemma2FastRMSNorm(FastRMSNorm):
    method load (line 111) | def load(cls, prefix: str, weights, eps=1e-6):
    method forward (line 121) | def forward(self, hidden_states, residual=None):
  function load_attention (line 132) | def load_attention(config, prefix: str, weights):
  function _load_gqa (line 145) | def _load_gqa(config, prefix: str, weights):
  class FlashGemma2Attention (line 167) | class FlashGemma2Attention(torch.nn.Module):
    method __init__ (line 168) | def __init__(
    method forward (line 234) | def forward(
  class Gemma2MLP (line 299) | class Gemma2MLP(nn.Module):
    method __init__ (line 300) | def __init__(self, prefix, config, weights, layer_id):
    method forward (line 349) | def forward(self, hidden_states, adapter_data):
  class FlashGemma2Layer (line 357) | class FlashGemma2Layer(nn.Module):
    method __init__ (line 358) | def __init__(
    method forward (line 401) | def forward(
  class FlashGemma2Model (line 441) | class FlashGemma2Model(torch.nn.Module):
    method __init__ (line 442) | def __init__(self, prefix: str, config, weights, causal: bool):
    method forward (line 477) | def forward(
  class FlashGemma2ForCausalLM (line 524) | class FlashGemma2ForCausalLM(torch.nn.Module):
    method __init__ (line 525) | def __init__(self, prefix: str, config, weights, *, causal: bool = True):
    method forward (line 554) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py
  class Gemma3FastRMSNorm (line 62) | class Gemma3FastRMSNorm(FastRMSNorm):
    method load (line 64) | def load(cls, prefix: str, weights, eps=1e-6):
    method forward (line 74) | def forward(self, hidden_states, residual=None):
  function load_attention (line 85) | def load_attention(config, prefix: str, weights):
  function _load_gqa (line 98) | def _load_gqa(config, prefix: str, weights):
  class FlashGemma3Attention (line 120) | class FlashGemma3Attention(torch.nn.Module):
    method __init__ (line 121) | def __init__(
    method forward (line 198) | def forward(
  class Gemma3MLP (line 275) | class Gemma3MLP(nn.Module):
    method __init__ (line 276) | def __init__(self, prefix, config, weights, layer_id):
    method forward (line 325) | def forward(self, hidden_states, adapter_data):
  class FlashGemma3Layer (line 333) | class FlashGemma3Layer(nn.Module):
    method __init__ (line 334) | def __init__(
    method forward (line 379) | def forward(
  class FlashGemma3Model (line 419) | class FlashGemma3Model(torch.nn.Module):
    method __init__ (line 420) | def __init__(self, prefix: str, config, weights, causal: bool):
    method forward (line 464) | def forward(
  class FlashGemma3ForCausalLM (line 514) | class FlashGemma3ForCausalLM(torch.nn.Module):
    method __init__ (line 515) | def __init__(self, prefix: str, config, weights, *, causal: bool = True):
    method forward (line 545) | def forward(
  class Gemma3MultimodalInputProjection (line 576) | class Gemma3MultimodalInputProjection(torch.nn.Module):
    method __init__ (line 577) | def __init__(self, prefix, config, weights):
    method forward (line 599) | def forward(self, vision_outputs: torch.Tensor):
  class Gemma3ForConditionalGeneration (line 620) | class Gemma3ForConditionalGeneration(nn.Module):
    method __init__ (line 621) | def __init__(self, prefix, config, weights):
    method get_vision_embeds (line 671) | def get_vision_embeds(
    method get_inputs_embeds (line 687) | def get_inputs_embeds(
    method forward (line 704) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_gemma_modeling.py
  class GemmaConfig (line 51) | class GemmaConfig(PretrainedConfig):
    method __init__ (line 52) | def __init__(
  class GemmaFastRMSNorm (line 107) | class GemmaFastRMSNorm(FastRMSNorm):
    method load (line 109) | def load(cls, prefix: str, weights, eps=1e-6):
    method forward (line 119) | def forward(self, hidden_states, residual=None):
  function load_attention (line 130) | def load_attention(config, prefix: str, weights):
  function _load_gqa (line 143) | def _load_gqa(config, prefix: str, weights):
  class FlashGemmaAttention (line 165) | class FlashGemmaAttention(torch.nn.Module):
    method __init__ (line 166) | def __init__(self, prefix: str, config, weights, causal: bool, rotary_...
    method forward (line 198) | def forward(
  class GemmaMLP (line 257) | class GemmaMLP(nn.Module):
    method __init__ (line 258) | def __init__(self, prefix: str, config, weights):
    method forward (line 289) | def forward(self, hidden_states):
  class FlashGemmaLayer (line 295) | class FlashGemmaLayer(nn.Module):
    method __init__ (line 296) | def __init__(self, prefix: str, config, weights, causal: bool, rotary_...
    method forward (line 316) | def forward(
  class FlashGemmaModel (line 352) | class FlashGemmaModel(torch.nn.Module):
    method __init__ (line 353) | def __init__(self, prefix: str, config, weights, causal: bool):
    method forward (line 386) | def forward(
  class FlashGemmaForCausalLM (line 431) | class FlashGemmaForCausalLM(torch.nn.Module):
    method __init__ (line 432) | def __init__(self, prefix: str, config, weights, *, causal: bool = True):
    method forward (line 459) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_gpt2_modeling.py
  function load_qkv (line 45) | def load_qkv(config, prefix: str, weights, head_size, num_heads):
  function _load_qkv_gptq (line 56) | def _load_qkv_gptq(config, prefix: str, weights):
  function _load_qkv (line 87) | def _load_qkv(config, prefix: str, weights, head_size, num_heads):
  function load_row (line 134) | def load_row(config, prefix: str, weights, bias: bool):
  function load_col (line 153) | def load_col(config, prefix: str, weights, bias: bool):
  class FlashGPT2Attention (line 168) | class FlashGPT2Attention(torch.nn.Module):
    method __init__ (line 169) | def __init__(
    method forward (line 209) | def forward(
  class GPT2MLP (line 259) | class GPT2MLP(nn.Module):
    method __init__ (line 260) | def __init__(self, prefix: str, config, weights):
    method forward (line 290) | def forward(self, hidden_states):
  class FlashGPT2Layer (line 296) | class FlashGPT2Layer(nn.Module):
    method __init__ (line 297) | def __init__(self, prefix: str, config, weights):
    method forward (line 313) | def forward(
  class FlashGPT2Model (line 346) | class FlashGPT2Model(torch.nn.Module):
    method __init__ (line 347) | def __init__(self, prefix: str, config, weights):
    method forward (line 377) | def forward(
  class FlashGPT2ForCausalLM (line 416) | class FlashGPT2ForCausalLM(torch.nn.Module):
    method __init__ (line 417) | def __init__(self, prefix: str, config, weights):
    method forward (line 436) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_gptj_modeling.py
  function load_attention (line 55) | def load_attention(config, prefix: str, weights):
  function load_row (line 65) | def load_row(config, prefix: str, weights, bias: bool):
  class GPTJRotary (line 78) | class GPTJRotary(PositionRotaryEmbedding):
    method forward (line 79) | def forward(
  class FlashGPTJAttention (line 107) | class FlashGPTJAttention(torch.nn.Module):
    method __init__ (line 108) | def __init__(
    method forward (line 149) | def forward(
  class GPTJMLP (line 209) | class GPTJMLP(nn.Module):
    method __init__ (line 210) | def __init__(self, prefix: str, config, weights):
    method forward (line 235) | def forward(self, hidden_states):
  class FlashGPTJLayer (line 241) | class FlashGPTJLayer(nn.Module):
    method __init__ (line 242) | def __init__(self, prefix: str, config, weights, rotary_emb):
    method forward (line 256) | def forward(
  class FlashGPTJModel (line 286) | class FlashGPTJModel(torch.nn.Module):
    method __init__ (line 287) | def __init__(self, prefix: str, config, weights):
    method forward (line 323) | def forward(
  class FlashGPTJForCausalLM (line 367) | class FlashGPTJForCausalLM(torch.nn.Module):
    method __init__ (line 368) | def __init__(self, prefix: str, config, weights):
    method forward (line 381) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_llama4_modeling.py
  function reshape_for_broadcast (line 55) | def reshape_for_broadcast(freqs: torch.Tensor, target):
  function apply_rotary_emb (line 61) | def apply_rotary_emb(
  function repeat_kv (line 94) | def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
  class Llama4TextExperts (line 108) | class Llama4TextExperts(nn.Module):
    method __init__ (line 109) | def __init__(self, prefix, config, weights):
    method forward (line 127) | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
  class Llama4TextMLP (line 156) | class Llama4TextMLP(nn.Module):
    method __init__ (line 157) | def __init__(self, prefix, config, weights):
    method forward (line 180) | def forward(self, x):
  class Llama4TextL2Norm (line 186) | class Llama4TextL2Norm(torch.nn.Module):
    method __init__ (line 187) | def __init__(self, eps: float = 1e-6):
    method _norm (line 191) | def _norm(self, x):
    method forward (line 194) | def forward(self, x):
    method extra_repr (line 197) | def extra_repr(self):
  class Llama4TextMoe (line 201) | class Llama4TextMoe(nn.Module):
    method __init__ (line 202) | def __init__(
    method forward (line 223) | def forward(self, hidden_states, adapter_data):
  class Llama4TextRotaryEmbedding (line 265) | class Llama4TextRotaryEmbedding(nn.Module):
    method __init__ (line 266) | def __init__(self, config, device=None):
    method forward (line 281) | def forward(self, x, position_ids):
  class Llama4TextAttention (line 302) | class Llama4TextAttention(FlashLlamaAttention):
    method __init__ (line 305) | def __init__(self, prefix, config, weights, layer_idx):
    method forward (line 325) | def forward(
  class Llama4TextDecoderLayer (line 435) | class Llama4TextDecoderLayer(nn.Module):
    method __init__ (line 436) | def __init__(self, prefix, config, weights, layer_idx):
    method forward (line 460) | def forward(
  class Llama4TextModel (line 507) | class Llama4TextModel(nn.Module):
    method __init__ (line 509) | def __init__(self, prefix, config, weights):
    method forward (line 540) | def forward(
    method _update_causal_mask (line 600) | def _update_causal_mask(
    method create_chunked_attention_mask (line 735) | def create_chunked_attention_mask(
    method _prepare_4d_causal_attention_mask_with_cache_position (line 761) | def _prepare_4d_causal_attention_mask_with_cache_position(
  class Llama4ForCausalLM (line 826) | class Llama4ForCausalLM(nn.Module):
    method __init__ (line 827) | def __init__(self, prefix, config, weights):
    method forward (line 839) | def forward(
  class Llama4VisionMLP2 (line 873) | class Llama4VisionMLP2(torch.nn.Module):
    method __init__ (line 874) | def __init__(self, prefix, config, weights):
    method forward (line 887) | def forward(self, hidden_states):
  class Llama4MultiModalProjector (line 897) | class Llama4MultiModalProjector(nn.Module):
    method __init__ (line 898) | def __init__(self, prefix, config, weights):
    method forward (line 904) | def forward(self, image_features):
  function pixel_shuffle (line 909) | def pixel_shuffle(input_tensor, shuffle_ratio):
  class Llama4VisionPixelShuffleMLP (line 932) | class Llama4VisionPixelShuffleMLP(nn.Module):
    method __init__ (line 933) | def __init__(self, prefix, config, weights):
    method forward (line 944) | def forward(self, encoded_patches: torch.Tensor) -> torch.Tensor:
  function vision_reshape_for_broadcast (line 950) | def vision_reshape_for_broadcast(freqs_ci: torch.Tensor, query: torch.Te...
  class Llama4VisionAttention (line 956) | class Llama4VisionAttention(nn.Module):
    method __init__ (line 957) | def __init__(self, prefix, config, weights):
    method forward (line 981) | def forward(
  class Llama4VisionMLP (line 1027) | class Llama4VisionMLP(nn.Module):
    method __init__ (line 1028) | def __init__(self, prefix, config, weights):
    method forward (line 1039) | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
  class Llama4VisionEncoderLayer (line 1046) | class Llama4VisionEncoderLayer(nn.Module):
    method __init__ (line 1047) | def __init__(self, prefix, config, weights):
    method forward (line 1065) | def forward(
  class Llama4VisionEncoder (line 1093) | class Llama4VisionEncoder(nn.Module):
    method __init__ (line 1102) | def __init__(self, prefix, config, weights):
    method forward (line 1116) | def forward(
  class Llama4UnfoldConvolution (line 1135) | class Llama4UnfoldConvolution(nn.Module):
    method __init__ (line 1136) | def __init__(self, prefix, config, weights):
    method forward (line 1146) | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
  class Llama4VisionRotaryEmbedding (line 1153) | class Llama4VisionRotaryEmbedding(nn.Module):
    method __init__ (line 1154) | def __init__(self, config, weights):
    method forward (line 1192) | def forward(self, hidden_states):
  class Llama4VisionModel (line 1199) | class Llama4VisionModel(nn.Module):
    method __init__ (line 1201) | def __init__(self, prefix, config, weights):
    method forward (line 1243) | def forward(
  class Llama4ForConditionalGeneration (line 1298) | class Llama4ForConditionalGeneration(nn.Module):
    method __init__ (line 1300) | def __init__(self, prefix: str, config, weights):
    method get_image_features (line 1328) | def get_image_features(
    method get_vision_embeds (line 1359) | def get_vision_embeds(
    method get_inputs_embeds (line 1376) | def get_inputs_embeds(
    method forward (line 1411) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py
  function load_attention (line 64) | def load_attention(config, prefix: str, weights, layer_id):
  function no_fp8 (line 117) | def no_fp8(weights: Weights):
  class FlashLlamaAttention (line 129) | class FlashLlamaAttention(torch.nn.Module):
    method __init__ (line 130) | def __init__(
    method forward (line 189) | def forward(
  class Phi3MoE (line 250) | class Phi3MoE(nn.Module):
    method __init__ (line 251) | def __init__(
    method forward (line 274) | def forward(self, x, adapter_data) -> torch.Tensor:
  class LlamaMLP (line 286) | class LlamaMLP(nn.Module):
    method __init__ (line 287) | def __init__(self, prefix, config, weights, index):
    method forward (line 359) | def forward(self, hidden_states, adapter_data):
  class FlashLlamaLayer (line 367) | class FlashLlamaLayer(nn.Module):
    method __init__ (line 368) | def __init__(self, index, prefix, config, weights, rotary_emb):
    method forward (line 420) | def forward(
  class FlashLlamaModel (line 462) | class FlashLlamaModel(torch.nn.Module):
    method __init__ (line 463) | def __init__(self, prefix, config, weights):
    method forward (line 545) | def forward(
  class FlashLlamaForCausalLM (line 594) | class FlashLlamaForCausalLM(torch.nn.Module):
    method __init__ (line 595) | def __init__(self, prefix: str, config, weights, name=None):
    method forward (line 640) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_llava_next.py
  function get_anyres_image_grid_shape (line 37) | def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
  function unpad_image (line 60) | def unpad_image(tensor, original_size):
  class LlavaNextMultiModalProjector (line 94) | class LlavaNextMultiModalProjector(nn.Module):
    method __init__ (line 95) | def __init__(self, prefix, config, weights):
    method forward (line 106) | def forward(self, image_features):
  class FlashLlavaNextForConditionalGeneration (line 113) | class FlashLlavaNextForConditionalGeneration(nn.Module):
    method __init__ (line 114) | def __init__(self, prefix, config, weights):
    method _merge_input_ids_with_image_features (line 149) | def _merge_input_ids_with_image_features(
    method get_vision_embeds (line 166) | def get_vision_embeds(
    method get_inputs_embeds (line 254) | def get_inputs_embeds(
    method forward (line 271) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py
  class MistralConfig (line 52) | class MistralConfig(PretrainedConfig):
    method __init__ (line 55) | def __init__(
  class MistralAttention (line 106) | class MistralAttention(torch.nn.Module):
    method __init__ (line 107) | def __init__(self, prefix: str, config, weights, layer_id, rotary_emb):
    method forward (line 172) | def forward(
  class MistralMLP (line 235) | class MistralMLP(nn.Module):
    method __init__ (line 236) | def __init__(self, prefix: str, config, weights, layer_id):
    method forward (line 290) | def forward(self, hidden_states, adapter_data):
  class MistralLayer (line 298) | class MistralLayer(nn.Module):
    method __init__ (line 299) | def __init__(self, prefix: str, config, weights, layer_id, rotary_emb):
    method forward (line 321) | def forward(
  class MistralModel (line 359) | class MistralModel(torch.nn.Module):
    method __init__ (line 360) | def __init__(self, prefix: str, config, weights):
    method forward (line 401) | def forward(
  class FlashMistralForCausalLM (line 445) | class FlashMistralForCausalLM(torch.nn.Module):
    method __init__ (line 446) | def __init__(self, prefix: str, config, weights, name=None):
    method forward (line 478) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py
  class MixtralConfig (line 51) | class MixtralConfig(PretrainedConfig):
    method __init__ (line 54) | def __init__(
  function promote_scalar (line 109) | def promote_scalar(x: torch.Tensor) -> torch.Tensor:
  function load_attention (line 113) | def load_attention(config, prefix: str, weights):
  function _load_gqa (line 126) | def _load_gqa(config, prefix: str, weights):
  function _load_experts (line 149) | def _load_experts(config, prefix: str, mat, weights):
  class MixtralAttention (line 185) | class MixtralAttention(torch.nn.Module):
    method __init__ (line 186) | def __init__(
    method forward (line 228) | def forward(
  function select_experts (line 288) | def select_experts(gate_logits: torch.Tensor, top_k: int):
  function round_up (line 301) | def round_up(x: torch.Tensor, value: int):
  class MixtralMoE (line 305) | class MixtralMoE(nn.Module):
    method __init__ (line 306) | def __init__(
    method forward (line 330) | def forward(self, x: torch.Tensor) -> torch.Tensor:
  class MixtralLayer (line 342) | class MixtralLayer(nn.Module):
    method __init__ (line 343) | def __init__(self, prefix: str, layer_id, config, weights, rotary_emb):
    method forward (line 370) | def forward(
  class MixtralModel (line 406) | class MixtralModel(torch.nn.Module):
    method __init__ (line 407) | def __init__(self, prefix: str, config, weights):
    method forward (line 445) | def forward(
  class FlashMixtralForCausalLM (line 489) | class FlashMixtralForCausalLM(torch.nn.Module):
    method __init__ (line 490) | def __init__(self, prefix: str, config, weights):
    method forward (line 506) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_mllama.py
  function _prepare_aspect_ratio_attention_mask (line 44) | def _prepare_aspect_ratio_attention_mask(
  function _prepare_4d_causal_attention_mask_with_cache_position (line 76) | def _prepare_4d_causal_attention_mask_with_cache_position(
  function _prepare_cross_attention_mask (line 140) | def _prepare_cross_attention_mask(
  class MllamaVisionMLP (line 173) | class MllamaVisionMLP(nn.Module):
    method __init__ (line 174) | def __init__(self, *, prefix, config, weights):
    method forward (line 185) | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
  class MllamaVisionSdpaAttention (line 192) | class MllamaVisionSdpaAttention(nn.Module):
    method __init__ (line 193) | def __init__(self, *, prefix, config, weights):
    method forward (line 214) | def forward(
  class MllamaVisionEncoderLayer (line 260) | class MllamaVisionEncoderLayer(nn.Module):
    method __init__ (line 261) | def __init__(self, *, prefix, config, weights, is_gated: bool):
    method forward (line 292) | def forward(
  class MllamaVisionEncoder (line 313) | class MllamaVisionEncoder(nn.Module):
    method __init__ (line 314) | def __init__(self, *, prefix, config, weights, is_gated: bool, num_lay...
    method forward (line 327) | def forward(
  class MllamaPrecomputedAspectRatioEmbedding (line 350) | class MllamaPrecomputedAspectRatioEmbedding(nn.Module):
    method __init__ (line 351) | def __init__(self, *, prefix, config, weights):
    method forward (line 364) | def forward(
  class MllamaPrecomputedPositionEmbedding (line 377) | class MllamaPrecomputedPositionEmbedding(nn.Module):
    method __init__ (line 378) | def __init__(self, *, prefix, config, weights):
    method forward (line 399) | def forward(
  class MllamaVisionModel (line 419) | class MllamaVisionModel(nn.Module):
    method __init__ (line 420) | def __init__(self, *, prefix, config, weights):
    method apply_class_embedding (line 496) | def apply_class_embedding(self, hidden_state: torch.Tensor) -> torch.T...
    method forward (line 502) | def forward(
  class MllamaTextCrossAttention (line 634) | class MllamaTextCrossAttention(nn.Module):
    method __init__ (line 637) | def __init__(self, *, prefix, config, weights, layer_idx):
    method forward (line 686) | def forward(
  class MllamaTextMLP (line 744) | class MllamaTextMLP(nn.Module):
    method __init__ (line 745) | def __init__(self, *, prefix, config, weights):
    method forward (line 767) | def forward(self, x):
  class FlashLlamaCrossLayer (line 777) | class FlashLlamaCrossLayer(torch.nn.Module):
    method __init__ (line 780) | def __init__(self, *, prefix, config, weights, index) -> None:
    method forward (line 808) | def forward(
  class MllamaTextRMSNorm (line 852) | class MllamaTextRMSNorm(nn.Module):
    method __init__ (line 853) | def __init__(self, weight, eps):
    method load (line 859) | def load(cls, *, prefix, weights, eps):
    method forward (line 865) | def forward(self, hidden_states):
    method extra_repr (line 872) | def extra_repr(self):
  class FlashMllamaForConditionalGeneration (line 876) | class FlashMllamaForConditionalGeneration(nn.Module):
    method __init__ (line 877) | def __init__(self, prefix, config, weights):
    method vision_forward (line 898) | def vision_forward(self, pixel_values, aspect_ratio_ids, aspect_ratio_...
    method forward (line 916) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_neox_modeling.py
  class GPTNeoXConfig (line 54) | class GPTNeoXConfig(TransformersGPTNeoXConfig):
  function load_row (line 60) | def load_row(config, prefix: str, weights, bias: bool):
  function load_qkv (line 76) | def load_qkv(config, prefix: str, weights, num_heads, head_size, hidden_...
  class FlashNeoxAttention (line 101) | class FlashNeoxAttention(torch.nn.Module):
    method __init__ (line 102) | def __init__(self, config, prefix, weights, rotary_emb):
    method forward (line 138) | def forward(
  class FlashMLP (line 197) | class FlashMLP(nn.Module):
    method __init__ (line 198) | def __init__(self, config, prefix, weights):
    method forward (line 219) | def forward(self, hidden_states):
  class FlashNeoXLayer (line 226) | class FlashNeoXLayer(nn.Module):
    method __init__ (line 227) | def __init__(self, layer_id, config, weights, rotary_emb):
    method forward (line 253) | def forward(
  class FlashGPTNeoXPreTrainedModel (line 311) | class FlashGPTNeoXPreTrainedModel(PreTrainedModel):
  class FlashGPTNeoXModel (line 318) | class FlashGPTNeoXModel(FlashGPTNeoXPreTrainedModel):
    method __init__ (line 319) | def __init__(self, prefix: str, config, weights):
    method forward (line 353) | def forward(
  class FlashGPTNeoXForCausalLM (line 397) | class FlashGPTNeoXForCausalLM(FlashGPTNeoXPreTrainedModel):
    method __init__ (line 398) | def __init__(self, prefix, config, weights):
    method forward (line 412) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_pali_gemma_modeling.py
  class PaliGemmaForConditionalGeneration (line 29) | class PaliGemmaForConditionalGeneration(nn.Module):
    method __init__ (line 30) | def __init__(self, prefix, config, weights):
    method get_vision_embeds (line 67) | def get_vision_embeds(
    method get_inputs_embeds (line 83) | def get_inputs_embeds(
    method forward (line 96) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_phi_modeling.py
  class PhiConfig (line 33) | class PhiConfig(PretrainedConfig):
    method __init__ (line 34) | def __init__(
  function load_attention (line 73) | def load_attention(config, prefix, weights):
  function _load_gqa (line 86) | def _load_gqa(config, prefix: str, weights):
  class FlashPhiAttention (line 110) | class FlashPhiAttention(torch.nn.Module):
    method __init__ (line 111) | def __init__(
    method forward (line 153) | def forward(
  class PhiMLP (line 221) | class PhiMLP(nn.Module):
    method __init__ (line 222) | def __init__(self, prefix, config, weights):
    method forward (line 250) | def forward(self, hidden_states):
  class FlashPhiLayer (line 256) | class FlashPhiLayer(nn.Module):
    method __init__ (line 257) | def __init__(self, prefix: str, layer_id, config, weights, rotary_emb):
    method forward (line 274) | def forward(
  class FlashPhiModel (line 306) | class FlashPhiModel(torch.nn.Module):
    method __init__ (line 307) | def __init__(self, prefix: str, config, weights):
    method forward (line 350) | def forward(
  class FlashPhiForCausalLM (line 394) | class FlashPhiForCausalLM(torch.nn.Module):
    method __init__ (line 395) | def __init__(self, prefix: str, config, weights):
    method forward (line 410) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_phi_moe_modeling.py
  class PhiMoEConfig (line 29) | class PhiMoEConfig(PretrainedConfig):
    method __init__ (line 120) | def __init__(
    method _rope_scaling_validation (line 190) | def _rope_scaling_validation(self):

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py
  function load_attention (line 29) | def load_attention(config, prefix, weights):
  function _load_gqa (line 42) | def _load_gqa(config, prefix: str, weights):
  class Qwen2Attention (line 55) | class Qwen2Attention(torch.nn.Module):
    method __init__ (line 56) | def __init__(
    method forward (line 101) | def forward(
  class Qwen2MLP (line 161) | class Qwen2MLP(nn.Module):
    method __init__ (line 162) | def __init__(self, prefix, config, weights):
    method forward (line 193) | def forward(self, hidden_states):
  class Qwen2Layer (line 199) | class Qwen2Layer(nn.Module):
    method __init__ (line 200) | def __init__(self, prefix, layer_id, config, weights, rotary_emb):
    method forward (line 219) | def forward(
  class Qwen2Model (line 253) | class Qwen2Model(torch.nn.Module):
    method __init__ (line 254) | def __init__(self, prefix: str, config, weights):
    method forward (line 292) | def forward(
  class Qwen2ForCausalLM (line 336) | class Qwen2ForCausalLM(torch.nn.Module):
    method __init__ (line 337) | def __init__(self, prefix: str, config, weights):
    method forward (line 365) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_qwen3_modeling.py
  class Qwen3Attention (line 41) | class Qwen3Attention(nn.Module):
    method __init__ (line 44) | def __init__(self, config, prefix, weights, layer_idx, rotary_emb):
    method forward (line 112) | def forward(
  class Qwen3DecoderLayer (line 177) | class Qwen3DecoderLayer(nn.Module):
    method __init__ (line 178) | def __init__(self, config, prefix, weights, layer_idx: int, rotary_emb):
    method forward (line 198) | def forward(
  class Qwen3Model (line 235) | class Qwen3Model(nn.Module):
    method __init__ (line 236) | def __init__(self, config, prefix: str, weights):
    method forward (line 267) | def forward(
  class Qwen3ForCausalLM (line 314) | class Qwen3ForCausalLM(nn.Module):
    method __init__ (line 316) | def __init__(self, prefix: str, config, weights):
    method forward (line 336) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_qwen3_moe_modeling.py
  function rotate_half (line 47) | def rotate_half(x):
  function apply_rotary_pos_emb (line 54) | def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_di...
  class Qwen3MoeAttention (line 81) | class Qwen3MoeAttention(nn.Module):
    method __init__ (line 84) | def __init__(self, config, prefix, weights, layer_idx, rotary_emb):
    method forward (line 143) | def forward(
  class Qwen3MoE (line 202) | class Qwen3MoE(nn.Module):
    method __init__ (line 203) | def __init__(self, prefix, config, moe_layer_cls: Type[MoELayer], weig...
    method forward (line 226) | def forward(self, x: torch.Tensor) -> torch.Tensor:
  class Qwen3MoeMLP (line 237) | class Qwen3MoeMLP(nn.Module):
    method __init__ (line 238) | def __init__(self, prefix, config, weights, intermediate_size=None):
    method forward (line 267) | def forward(self, x):
  class Qwen3MoeSparseMoeBlock (line 273) | class Qwen3MoeSparseMoeBlock(nn.Module):
    method __init__ (line 274) | def __init__(self, prefix, config, weights):
    method forward (line 295) | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
  class Qwen3MoeDecoderLayer (line 343) | class Qwen3MoeDecoderLayer(nn.Module):
    method __init__ (line 344) | def __init__(self, config, prefix, weights, layer_idx: int, rotary_emb):
    method forward (line 387) | def forward(
  class Qwen3MoeModel (line 428) | class Qwen3MoeModel(nn.Module):
    method __init__ (line 429) | def __init__(self, config, prefix: str, weights):
    method forward (line 460) | def forward(
  class Qwen3MoeForCausalLM (line 502) | class Qwen3MoeForCausalLM(nn.Module):
    method __init__ (line 504) | def __init__(self, prefix: str, config, weights):
    method forward (line 524) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_rw_modeling.py
  function load_row (line 28) | def load_row(config, prefix: str, weights, bias: bool):
  class RWConfig (line 44) | class RWConfig(PretrainedConfig):
    method __init__ (line 51) | def __init__(
  class FlashRWAttention (line 131) | class FlashRWAttention(torch.nn.Module):
    method __init__ (line 132) | def __init__(
    method forward (line 176) | def forward(
  class FlashRWLargeAttention (line 236) | class FlashRWLargeAttention(torch.nn.Module):
    method __init__ (line 237) | def __init__(
    method forward (line 290) | def forward(
  class FlashMLP (line 351) | class FlashMLP(nn.Module):
    method __init__ (line 352) | def __init__(self, config, prefix: str, weights):
    method forward (line 363) | def forward(self, hidden_states):
  class FlashRWLayer (line 370) | class FlashRWLayer(nn.Module):
    method __init__ (line 371) | def __init__(
    method forward (line 420) | def forward(
  class FlashRWLayerNorm (line 477) | class FlashRWLayerNorm(nn.Module):
    method __init__ (line 478) | def __init__(self, config, prefix: str, weights):
    method forward (line 508) | def forward(
  class FlashRWLargeLayer (line 522) | class FlashRWLargeLayer(nn.Module):
    method __init__ (line 523) | def __init__(self, layer_id, prefix: str, config, weights, rotary_emb):
    method forward (line 541) | def forward(
  class FlashRWPreTrainedModel (line 579) | class FlashRWPreTrainedModel(PreTrainedModel):
  class FlashRWModel (line 583) | class FlashRWModel(FlashRWPreTrainedModel):
    method __init__ (line 584) | def __init__(self, prefix: str, config, weights):
    method forward (line 623) | def forward(
  class FlashRWForCausalLM (line 667) | class FlashRWForCausalLM(FlashRWPreTrainedModel):
    method __init__ (line 668) | def __init__(self, prefix: str, config, weights):
    method forward (line 680) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_santacoder_modeling.py
  function load_multi_mqa (line 30) | def load_multi_mqa(
  function _load_multi_mqa_gptq (line 43) | def _load_multi_mqa_gptq(
  function _load_multi_mqa (line 130) | def _load_multi_mqa(
  function load_col (line 200) | def load_col(config, prefix: str, weights, bias: bool):
  function load_row (line 213) | def load_row(config, prefix: str, weights, bias: bool):
  class FlashMQAttention (line 229) | class FlashMQAttention(torch.nn.Module):
    method __init__ (line 230) | def __init__(self, prefix, config, weights):
    method forward (line 265) | def forward(
  class MLP (line 319) | class MLP(nn.Module):
    method __init__ (line 320) | def __init__(self, prefix, config, weights):
    method forward (line 341) | def forward(self, hidden_states):
  class Block (line 348) | class Block(nn.Module):
    method __init__ (line 349) | def __init__(self, prefix: str, layer_id, config, weights):
    method forward (line 369) | def forward(
  class FlashSantacoderModel (line 396) | class FlashSantacoderModel(nn.Module):
    method __init__ (line 397) | def __init__(self, prefix: str, config, weights):
    method forward (line 431) | def forward(
  class FlashSantacoderForCausalLM (line 472) | class FlashSantacoderForCausalLM(nn.Module):
    method __init__ (line 473) | def __init__(self, prefix, config, weights):
    method forward (line 487) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/flash_starcoder2_modeling.py
  class Starcoder2Config (line 57) | class Starcoder2Config(PretrainedConfig):
    method __init__ (line 60) | def __init__(
  function load_attention (line 117) | def load_attention(config, prefix, weights, layer_id):
  function _load_gqa (line 144) | def _load_gqa(config, prefix: str, weights):
  class Starcoder2Attention (line 176) | class Starcoder2Attention(torch.nn.Module):
    method __init__ (line 177) | def __init__(
    method forward (line 228) | def forward(
  class Starcoder2MLP (line 291) | class Starcoder2MLP(nn.Module):
    method __init__ (line 292) | def __init__(self, prefix, config, weights, index):
    method forward (line 334) | def forward(self, hidden_states, adapter_data):
  class Starcoder2GatedMLP (line 340) | class Starcoder2GatedMLP(nn.Module):
    method __init__ (line 341) | def __init__(self, index, prefix, config, weights):
    method forward (line 390) | def forward(self, hidden_states, adapter_data):
  class Starcoder2Layer (line 409) | class Starcoder2Layer(nn.Module):
    method __init__ (line 410) | def __init__(self, layer_id, config, weights, rotary_emb):
    method forward (line 436) | def forward(
  class Starcoder2Model (line 474) | class Starcoder2Model(torch.nn.Module):
    method __init__ (line 475) | def __init__(self, prefix, config, weights):
    method forward (line 511) | def forward(
  class FlashStarcoder2ForCausalLM (line 557) | class FlashStarcoder2ForCausalLM(torch.nn.Module):
    method __init__ (line 558) | def __init__(self, prefix, config, weights):
    method forward (line 587) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/idefics2.py
  function repeat_kv (line 39) | def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
  class Idefics2VisionEmbeddings (line 53) | class Idefics2VisionEmbeddings(nn.Module):
    method __init__ (line 64) | def __init__(self, prefix, config, weights):
    method forward (line 91) | def forward(
  class Idefics2VisionAttention (line 134) | class Idefics2VisionAttention(nn.Module):
    method __init__ (line 135) | def __init__(self, prefix, config, weights):
    method forward (line 164) | def forward(
  class Idefics2VisionMLP (line 232) | class Idefics2VisionMLP(nn.Module):
    method __init__ (line 233) | def __init__(self, prefix, config, weights):
    method forward (line 244) | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
  class Idefics2EncoderLayer (line 251) | class Idefics2EncoderLayer(nn.Module):
    method __init__ (line 252) | def __init__(self, prefix, config, weights):
    method forward (line 269) | def forward(
  class Idefics2Encoder (line 291) | class Idefics2Encoder(nn.Module):
    method __init__ (line 292) | def __init__(self, prefix, config, weights):
    method forward (line 305) | def forward(
  class Idefics2VisionTransformer (line 319) | class Idefics2VisionTransformer(nn.Module):
    method __init__ (line 320) | def __init__(self, prefix, config, weights):
    method forward (line 335) | def forward(
  class Idefics2MLP (line 380) | class Idefics2MLP(nn.Module):
    method __init__ (line 381) | def __init__(self, prefix, config, weights):
    method forward (line 408) | def forward(self, hidden_states):
  class Idefics2RMSNorm (line 418) | class Idefics2RMSNorm(nn.Module):
    method __init__ (line 419) | def __init__(self, prefix, weights, eps):
    method forward (line 429) | def forward(self, hidden_states):
  class Idefics2PerceiverAttention (line 437) | class Idefics2PerceiverAttention(nn.Module):
    method __init__ (line 438) | def __init__(self, prefix, config, weights):
    method forward (line 472) | def forward(
  class Idefics2PerceiverLayer (line 544) | class Idefics2PerceiverLayer(nn.Module):
    method __init__ (line 545) | def __init__(self, prefix, config, weights):
    method forward (line 572) | def forward(
  class Idefics2PerceiverResampler (line 605) | class Idefics2PerceiverResampler(nn.Module):
    method __init__ (line 606) | def __init__(self, prefix, config, weights) -> None:
    method forward (line 632) | def forward(
  class Idefics2Connector (line 664) | class Idefics2Connector(nn.Module):
    method __init__ (line 665) | def __init__(self, prefix, config, weights):
    method forward (line 674) | def forward(self, image_hidden_states, attention_mask):
  class Idefics2ForConditionalGeneration (line 682) | class Idefics2ForConditionalGeneration(nn.Module):
    method __init__ (line 683) | def __init__(self, prefix, config, weights):
    method _merge_input_ids_with_image_features (line 723) | def _merge_input_ids_with_image_features(
    method get_vision_embeds (line 737) | def get_vision_embeds(
    method get_inputs_embeds (line 820) | def get_inputs_embeds(
    method forward (line 835) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/idefics3.py
  function repeat_kv (line 38) | def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
  class Idefics3VisionEmbeddings (line 52) | class Idefics3VisionEmbeddings(nn.Module):
    method __init__ (line 63) | def __init__(self, prefix, config, weights):
    method forward (line 90) | def forward(
  class Idefics3VisionAttention (line 133) | class Idefics3VisionAttention(nn.Module):
    method __init__ (line 134) | def __init__(self, prefix, config, weights):
    method forward (line 163) | def forward(
  class Idefics3VisionMLP (line 231) | class Idefics3VisionMLP(nn.Module):
    method __init__ (line 232) | def __init__(self, prefix, config, weights):
    method forward (line 243) | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
  class Idefics3EncoderLayer (line 250) | class Idefics3EncoderLayer(nn.Module):
    method __init__ (line 251) | def __init__(self, prefix, config, weights):
    method forward (line 268) | def forward(
  class Idefics3Encoder (line 290) | class Idefics3Encoder(nn.Module):
    method __init__ (line 291) | def __init__(self, prefix, config, weights):
    method forward (line 304) | def forward(
  class Idefics3VisionTransformer (line 318) | class Idefics3VisionTransformer(nn.Module):
    method __init__ (line 319) | def __init__(self, prefix, config, weights):
    method forward (line 334) | def forward(
  class Idefics3SimpleMLP (line 379) | class Idefics3SimpleMLP(nn.Module):
    method __init__ (line 380) | def __init__(self, prefix, config, weights):
    method forward (line 391) | def forward(self, x):
  class Idefics3Connector (line 395) | class Idefics3Connector(nn.Module):
    method __init__ (line 396) | def __init__(self, prefix, config, weights):
    method pixel_shuffle (line 401) | def pixel_shuffle(self, x, scale_factor=2):
    method forward (line 417) | def forward(self, image_hidden_states):
  class Idefics3ForConditionalGeneration (line 423) | class Idefics3ForConditionalGeneration(nn.Module):
    method __init__ (line 424) | def __init__(self, prefix, config, weights):
    method _merge_input_ids_with_image_features (line 466) | def _merge_input_ids_with_image_features(
    method get_vision_embeds (line 480) | def get_vision_embeds(
    method get_inputs_embeds (line 563) | def get_inputs_embeds(
    method forward (line 578) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/mamba_modeling.py
  class InferenceParams (line 25) | class InferenceParams:
  class MambaConfig (line 36) | class MambaConfig(PretrainedConfig):
    method __init__ (line 37) | def __init__(
  class MambaBlock (line 71) | class MambaBlock(nn.Module):
    method __init__ (line 72) | def __init__(self, prefix, config, weights, layer_id):
    method forward (line 94) | def forward(self, hidden_states: torch.Tensor, inference_params=None):
    method step (line 140) | def step(self, hidden_states, conv_state, ssm_state):
  class ResidualBlock (line 170) | class ResidualBlock(nn.Module):
    method __init__ (line 171) | def __init__(self, prefix, config, weights, layer_id):
    method forward (line 180) | def forward(
  class MambaModel (line 195) | class MambaModel(nn.Module):
    method __init__ (line 196) | def __init__(self, config, weights):
    method forward (line 218) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/qwen2_5_vl.py
  class Qwen2_5_VLVideosProcessorKwargs (line 68) | class Qwen2_5_VLVideosProcessorKwargs(VideosKwargs, total=False):
  class Qwen2_5_VLProcessorKwargs (line 72) | class Qwen2_5_VLProcessorKwargs(ProcessingKwargs, total=False):
  class Qwen2_5_VLProcessor (line 82) | class Qwen2_5_VLProcessor(ProcessorMixin):
    method __init__ (line 102) | def __init__(
    method __call__ (line 117) | def __call__(
    method batch_decode (line 237) | def batch_decode(self, *args, **kwargs):
    method decode (line 244) | def decode(self, *args, **kwargs):
    method post_process_image_text_to_text (line 251) | def post_process_image_text_to_text(self, generated_outputs):
    method model_input_names (line 270) | def model_input_names(self):
  class Qwen2_5_VLVisionConfig (line 280) | class Qwen2_5_VLVisionConfig(PretrainedConfig):
    method __init__ (line 284) | def __init__(
  class Qwen2_5_VLConfig (line 320) | class Qwen2_5_VLConfig(PretrainedConfig):
    method __init__ (line 322) | def __init__(
  class Qwen2_5VLAttention (line 384) | class Qwen2_5VLAttention(nn.Module):
    method __init__ (line 385) | def __init__(self, *, prefix, config, weights):
    method forward (line 409) | def forward(
  class Qwen2_5VLVisionMLP (line 478) | class Qwen2_5VLVisionMLP(nn.Module):
    method __init__ (line 479) | def __init__(self, *, prefix, config, weights):
    method forward (line 497) | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
  class Qwen2_5VLVisionBlock (line 505) | class Qwen2_5VLVisionBlock(nn.Module):
    method __init__ (line 506) | def __init__(self, prefix, config, weights):
    method forward (line 529) | def forward(self, hidden_states, cu_seqlens, cos, sin, max_seqlen) -> ...
  class Qwen2_5VLPatchMerger (line 539) | class Qwen2_5VLPatchMerger(nn.Module):
    method __init__ (line 540) | def __init__(self, *, prefix, config, weights):
    method forward (line 555) | def forward(self, hidden_states) -> torch.Tensor:
  class Qwen2_5VisionModel (line 564) | class Qwen2_5VisionModel(nn.Module):
    method __init__ (line 565) | def __init__(self, *, prefix, config, weights):
    method apply_class_embedding (line 612) | def apply_class_embedding(self, hidden_state: torch.Tensor) -> torch.T...
    method get_window_index (line 618) | def get_window_index(self, grid_thw):
    method forward (line 665) | def forward(
  class Qwen2_5VLForConditionalGeneration (line 774) | class Qwen2_5VLForConditionalGeneration(nn.Module):
    method __init__ (line 775) | def __init__(self, prefix, config, weights):
    method get_position_ids (line 824) | def get_position_ids(
    method get_vision_embeds (line 898) | def get_vision_embeds(
    method get_inputs_embeds (line 908) | def get_inputs_embeds(
    method forward (line 922) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/qwen2_vl.py
  class Qwen2VLAttention (line 54) | class Qwen2VLAttention(nn.Module):
    method __init__ (line 55) | def __init__(self, *, prefix, config, weights):
    method forward (line 78) | def forward(
  class Qwen2VLVisionMLP (line 147) | class Qwen2VLVisionMLP(nn.Module):
    method __init__ (line 148) | def __init__(self, *, prefix, config, weights):
    method forward (line 158) | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
  class Qwen2VLVisionBlock (line 165) | class Qwen2VLVisionBlock(nn.Module):
    method __init__ (line 166) | def __init__(self, prefix, config, weights):
    method forward (line 189) | def forward(self, hidden_states, cu_seqlens, cos, sin, max_seqlen) -> ...
  class Qwen2VLPatchMerger (line 198) | class Qwen2VLPatchMerger(nn.Module):
    method __init__ (line 199) | def __init__(self, *, prefix, config, weights):
    method forward (line 214) | def forward(self, hidden_states) -> torch.Tensor:
  class Qwen2VisionModel (line 223) | class Qwen2VisionModel(nn.Module):
    method __init__ (line 224) | def __init__(self, *, prefix, config, weights):
    method apply_class_embedding (line 266) | def apply_class_embedding(self, hidden_state: torch.Tensor) -> torch.T...
    method forward (line 272) | def forward(
  class Qwen2VLForConditionalGeneration (line 349) | class Qwen2VLForConditionalGeneration(nn.Module):
    method __init__ (line 350) | def __init__(self, prefix, config, weights):
    method get_position_ids (line 404) | def get_position_ids(
    method get_vision_embeds (line 478) | def get_vision_embeds(
    method get_inputs_embeds (line 488) | def get_inputs_embeds(
    method forward (line 502) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/siglip.py
  class SiglipVisionEmbeddings (line 21) | class SiglipVisionEmbeddings(nn.Module):
    method __init__ (line 22) | def __init__(self, prefix, config: SiglipVisionConfig, weights):
    method forward (line 52) | def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
  class SiglipAttention (line 62) | class SiglipAttention(nn.Module):
    method __init__ (line 65) | def __init__(self, prefix, config, weights):
    method _shape (line 95) | def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
    method forward (line 102) | def forward(
  class SiglipMLP (line 163) | class SiglipMLP(nn.Module):
    method __init__ (line 164) | def __init__(self, prefix, config, weights):
    method forward (line 175) | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
  class SiglipEncoderLayer (line 182) | class SiglipEncoderLayer(nn.Module):
    method __init__ (line 183) | def __init__(self, prefix, config: SiglipConfig, weights):
    method forward (line 197) | def forward(
  class SiglipMultiheadAttentionPoolingHead (line 216) | class SiglipMultiheadAttentionPoolingHead(nn.Module):
    method __init__ (line 219) | def __init__(self, prefix, config: SiglipVisionConfig, weights):
    method forward (line 229) | def forward(self, hidden_state):
  function _trunc_normal_ (line 242) | def _trunc_normal_(tensor, mean, std, a, b):
  function trunc_normal_tf_ (line 278) | def trunc_normal_tf_(
  function variance_scaling_ (line 308) | def variance_scaling_(tensor, scale=1.0, mode="fan_in", distribution="no...
  function lecun_normal_ (line 333) | def lecun_normal_(tensor):
  function default_flax_embed_init (line 337) | def default_flax_embed_init(tensor):
  class SiglipEncoder (line 341) | class SiglipEncoder(nn.Module):
    method __init__ (line 350) | def __init__(self, prefix, config: SiglipConfig, weights):
    method forward (line 362) | def forward(
  class SiglipVisionTransformer (line 377) | class SiglipVisionTransformer(nn.Module):
    method __init__ (line 378) | def __init__(self, prefix, config: SiglipVisionConfig, weights):
    method forward (line 389) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/custom_modeling/vlm.py
  function load_text_model (line 1) | def load_text_model(prefix, config, weights, name=None):
  function load_vision_model (line 42) | def load_vision_model(prefix, config, weights):

FILE: backends/gaudi/server/text_generation_server/models/flash_causal_lm.py
  function generate_block_metadata (line 84) | def generate_block_metadata(
  class FlashCausalLMBatch (line 171) | class FlashCausalLMBatch(Batch):
    method to_pb (line 261) | def to_pb(self) -> generate_pb2.CachedBatch:
    method batch_tokenized_inputs (line 275) | def batch_tokenized_inputs(
    method from_tokenized (line 295) | def from_tokenized(
    method from_pb (line 495) | def from_pb(
    method filter (line 507) | def filter(self, request_ids: List[int]) -> "FlashCausalLMBatch":
    method concatenate (line 699) | def concatenate(
    method prepare_for_decode (line 980) | def prepare_for_decode(
    method prepare_for_prefill (line 1097) | def prepare_for_prefill(
    method __len__ (line 1422) | def __len__(self):
  class FlashCausalLM (line 1438) | class FlashCausalLM(Model):
    method __init__ (line 1439) | def __init__(
    method batch_type (line 1592) | def batch_type(self) -> Type[FlashCausalLMBatch]:
    method max_past (line 1595) | def max_past(self) -> int:
    method init_kv_cache (line 1598) | def init_kv_cache(
    method warmup (line 1631) | def warmup(
    method log_warmup (line 1766) | def log_warmup(self, prefilling, i, max_i, batch_size, seq_len):
    method use_graphs (line 1782) | def use_graphs(self, prefill, seq_len, batch_size):
    method align_workers (line 1791) | def align_workers(self, value, op):
    method warmup_hpu_graph (line 1798) | def warmup_hpu_graph(self, batch):
    method warmup_prefill (line 1908) | def warmup_prefill(
    method warmup_decode (line 1964) | def warmup_decode(self, batch_size: int, block_num: int, batch: FlashC...
    method forward (line 2063) | def forward(
    method generate_token (line 2179) | def generate_token(

FILE: backends/gaudi/server/text_generation_server/models/flash_vlm_causal_lm.py
  function prompt_split_image_llama4 (line 44) | def prompt_split_image_llama4(aspect_ratio, num_patches_per_chunk):
  function _prompt_split_image (line 72) | def _prompt_split_image(
  function get_anyres_image_grid_shape (line 101) | def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
  function image_text_replacement (line 124) | def image_text_replacement(processor, image_input, config) -> str:
  function image_text_replacement_fixup (line 197) | def image_text_replacement_fixup(config, text: str) -> str:
  function preprocess_text (line 205) | def preprocess_text(config, text: str) -> str:
  function preprocess_image (line 211) | def preprocess_image(config, img):
  function get_unpadded_features (line 226) | def get_unpadded_features(
  function get_number_of_features (line 253) | def get_number_of_features(height: int, width: int, config) -> int:
  function scatter_image_embeds (line 280) | def scatter_image_embeds(
  function gather_image_embeds (line 294) | def gather_image_embeds(
  class ImagePositions (line 304) | class ImagePositions:
  class FlashVlmCausalLMBatch (line 312) | class FlashVlmCausalLMBatch(FlashCausalLMBatch):
    method concatenate (line 326) | def concatenate(cls, batches, padded_total_bs: int = 0):
    method filter (line 356) | def filter(self, request_ids: List[int]):
    method batch_tokenized_inputs (line 386) | def batch_tokenized_inputs(
    method get_image_positions (line 464) | def get_image_positions(
    method from_pb_processor (line 535) | def from_pb_processor(
    method prepare_for_prefill (line 558) | def prepare_for_prefill(
    method update_encoder_cache (line 628) | def update_encoder_cache(self, encoder_outputs, request_id, img_pos):
    method gather_vision_embeds (line 633) | def gather_vision_embeds(self):
    method free_encoder_cache (line 696) | def free_encoder_cache(self):
  class FlashVlmCausalLM (line 703) | class FlashVlmCausalLM(FlashCausalLM):
    method __init__ (line 704) | def __init__(
    method batch_type (line 736) | def batch_type(self) -> Type[FlashVlmCausalLMBatch]:
    method max_past (line 739) | def max_past(self) -> Optional[int]:
    method warmup_decode (line 742) | def warmup_decode(
    method warmup_hpu_graph (line 844) | def warmup_hpu_graph(self, batch: FlashVlmCausalLMBatch):
    method get_vision_embeds (line 908) | def get_vision_embeds(
    method get_inputs_embeds (line 923) | def get_inputs_embeds(
    method encode_images (line 933) | def encode_images(self, batch):
    method set_inputs_embeds (line 972) | def set_inputs_embeds(self, batch):
    method forward (line 986) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/globals.py
  function set_model_id (line 35) | def set_model_id(model_id: str):
  function set_adapter_to_index (line 45) | def set_adapter_to_index(adapter_to_index: Dict[str, int]):
  function get_adapter_to_index (line 50) | def get_adapter_to_index():

FILE: backends/gaudi/server/text_generation_server/models/mllama_causal_lm.py
  class FlashMllamaCausalLMBatch (line 45) | class FlashMllamaCausalLMBatch(FlashVlmCausalLMBatch):
    method prepare_for_prefill (line 51) | def prepare_for_prefill(
    method concatenate (line 60) | def concatenate(cls, batches, padded_total_bs: int = 0):
    method filter (line 83) | def filter(self, request_ids: List[int]):
    method batch_tokenized_inputs (line 115) | def batch_tokenized_inputs(
    method from_pb_processor (line 181) | def from_pb_processor(
  function generate_cross_attention_states (line 225) | def generate_cross_attention_states(
  class FlashMllamaCausalLM (line 240) | class FlashMllamaCausalLM(FlashVlmCausalLM):
    method set_inputs_embeds (line 241) | def set_inputs_embeds(self, batch):
    method warmup_decode (line 245) | def warmup_decode(
    method warmup_prefill (line 316) | def warmup_prefill(
    method warmup_hpu_graph (line 378) | def warmup_hpu_graph(self, batch: FlashMllamaCausalLMBatch):
    method forward (line 489) | def forward(

FILE: backends/gaudi/server/text_generation_server/models/model.py
  class Model (line 22) | class Model(ABC):
    method __init__ (line 23) | def __init__(
    method info (line 74) | def info(self) -> InfoResponse:
    method batch_type (line 89) | def batch_type(self) -> Type[B]:
    method generate_token (line 93) | def generate_token(
    method warmup (line 98) | def warmup(
    method decode_token (line 104) | def decode_token(
    method check_initialized (line 134) | def check_initialized(self):

FILE: backends/gaudi/server/text_generation_server/models/seq2seq_lm.py
  class Seq2SeqLMBatch (line 35) | class Seq2SeqLMBatch(Batch):
    method to_pb (line 75) | def to_pb(self) -> generate_pb2.CachedBatch:
    method from_pb (line 85) | def from_pb(
    method filter (line 179) | def filter(self, request_ids: List[int]) -> Optional["Seq2SeqLMBatch"]:
    method concatenate (line 294) | def concatenate(cls, batches: List["Seq2SeqLMBatch"]) -> "Seq2SeqLMBat...
    method __len__ (line 536) | def __len__(self):
  class Seq2SeqLM (line 540) | class Seq2SeqLM(Model):
    method __init__ (line 541) | def __init__(
    method fallback (line 609) | def fallback(
    method batch_type (line 671) | def batch_type(self) -> Type[Seq2SeqLMBatch]:
    method forward (line 674) | def forward(
    method generate_token (line 712) | def generate_token(

FILE: backends/gaudi/server/text_generation_server/models/types.py
  class Batch (line 13) | class Batch(ABC):
    method to_pb (line 15) | def to_pb(self) -> generate_pb2.CachedBatch:
    method from_pb (line 20) | def from_pb(
    method filter (line 30) | def filter(self, request_ids: List[int]) -> "Batch":
    method concatenate (line 35) | def concatenate(cls, batches: List["Batch"]) -> "Batch":
    method __len__ (line 39) | def __len__(self):
  class GeneratedText (line 44) | class GeneratedText:
    method to_pb (line 50) | def to_pb(self) -> generate_pb2.GeneratedText:
  class Tokens (line 60) | class Tokens:
    method to_pb (line 66) | def to_pb(self) -> generate_pb2.Tokens:
    method __len__ (line 74) | def __len__(self):
  class Generation (line 79) | class Generation:
    method to_pb (line 87) | def to_pb(self) -> generate_pb2.Generation:

FILE: backends/gaudi/server/text_generation_server/server.py
  class SignalHandler (line 34) | class SignalHandler:
    method __init__ (line 37) | def __init__(self):
    method exit_gracefully (line 41) | def exit_gracefully(self, signum, frame):
  class TextGenerationService (line 46) | class TextGenerationService(generate_pb2_grpc.TextGenerationServiceServi...
    method __init__ (line 47) | def __init__(
    method Info (line 65) | async def Info(self, request, context):
    method Health (line 68) | async def Health(self, request, context):
    method ServiceDiscovery (line 73) | async def ServiceDiscovery(self, request, context):
    method ClearCache (line 76) | async def ClearCache(self, request, context):
    method FilterBatch (line 83) | async def FilterBatch(self, request, context):
    method Warmup (line 92) | async def Warmup(self, request, context):
    method Prefill (line 144) | async def Prefill(self, request, context):
    method Decode (line 173) | async def Decode(self, request, context):
  function serve (line 201) | def serve(

FILE: backends/gaudi/server/text_generation_server/tracing.py
  class UDSOpenTelemetryAioServerInterceptor (line 16) | class UDSOpenTelemetryAioServerInterceptor(OpenTelemetryAioServerInterce...
    method __init__ (line 17) | def __init__(self):
    method _start_span (line 20) | def _start_span(self, handler_call_details, context, set_status_on_exc...
  function setup_tracing (line 57) | def setup_tracing(otlp_service_name: str, otlp_endpoint: str):

FILE: backends/gaudi/server/text_generation_server/utils/adapter.py
  class AdapterInfo (line 28) | class AdapterInfo:
  class AdapterParameters (line 35) | class AdapterParameters:
  class AdapterSource (line 44) | class AdapterSource:
  function parse_lora_adapters (line 50) | def parse_lora_adapters(lora_adapters: Optional[str]) -> List[AdapterInfo]:
  function load_and_merge_adapters (line 71) | def load_and_merge_adapters(
  class AdapterParametersContainer (line 99) | class AdapterParametersContainer:
    method __hash__ (line 103) | def __hash__(self) -> int:
  function _load_and_merge (line 108) | def _load_and_merge(
  function check_architectures (line 146) | def check_architectures(
  function load_module_map (line 185) | def load_module_map(
  function get_attn_weights (line 233) | def get_attn_weights(i, layer):
  function get_mlp_weights (line 256) | def get_mlp_weights(i, layer):
  function build_layer_weight_lookup (line 294) | def build_layer_weight_lookup(model):

FILE: backends/gaudi/server/text_generation_server/utils/chunks.py
  function concat_text_chunks (line 8) | def concat_text_chunks(chunks: Iterable[generate_pb2.InputChunk]) -> str:

FILE: backends/gaudi/server/text_generation_server/utils/convert.py
  function _remove_duplicate_names (line 12) | def _remove_duplicate_names(
  function convert_file (line 62) | def convert_file(pt_file: Path, sf_file: Path, discard_names: List[str]):
  function convert_files (line 96) | def convert_files(pt_files: List[Path], sf_files: List[Path], discard_na...

FILE: backends/gaudi/server/text_generation_server/utils/debug.py
  function to_gb_rounded (line 17) | def to_gb_rounded(mem: float) -> float:
  function count_hpu_graphs (line 30) | def count_hpu_graphs():
  function dbg_trace (line 34) | def dbg_trace(tag, txt):

FILE: backends/gaudi/server/text_generation_server/utils/dist.py
  class FakeBarrier (line 13) | class FakeBarrier:
    method wait (line 14) | def wait(self):
  class FakeGroup (line 18) | class FakeGroup(ProcessGroup):
    method __init__ (line 19) | def __init__(self, rank, size):
    method allreduce (line 24) | def allreduce(self, *args, **kwargs):
    method allgather (line 27) | def allgather(self, inputs, local_tensor, **kwargs):
    method barrier (line 35) | def barrier(self, *args, **kwargs):
    method size (line 38) | def size(self):
    method rank (line 41) | def rank(self):
    method _get_backend_name (line 44) | def _get_backend_name(self):
  function initialize_torch_distributed (line 48) | def initialize_torch_distributed():

FILE: backends/gaudi/server/text_generation_server/utils/hub.py
  function _cached_weight_files (line 21) | def _cached_weight_files(
  function _weight_hub_files_from_model_info (line 32) | def _weight_hub_files_from_model_info(
  function _weight_files_from_dir (line 46) | def _weight_files_from_dir(d: Path, extension: str) -> List[str]:
  function _get_cached_revision_directory (line 62) | def _get_cached_revision_directory(
  function weight_hub_files (line 97) | def weight_hub_files(
  function try_to_load_from_cache (line 119) | def try_to_load_from_cache(
  function weight_files (line 133) | def weight_files(
  function download_weights (line 188) | def download_weights(

FILE: backends/gaudi/server/text_generation_server/utils/import_utils.py
  function get_hpu_free_memory (line 4) | def get_hpu_free_memory(device, memory_fraction):
  function synchronize_hpu (line 9) | def synchronize_hpu(device):
  function noop (line 13) | def noop(*args, **kwargs):

FILE: backends/gaudi/server/text_generation_server/utils/kernels.py
  function load_kernel (line 9) | def load_kernel(*, module: str, repo_id: str):

FILE: backends/gaudi/server/text_generation_server/utils/log.py
  function log_once (line 6) | def log_once(log, msg: str, master=True):
  function log_master (line 13) | def log_master(log, msg: str):

FILE: backends/gaudi/server/text_generation_server/utils/logits_process.py
  class StaticWarper (line 26) | class StaticWarper:
    method __init__ (line 27) | def __init__(
    method __call__ (line 51) | def __call__(self, scores):
  function static_warper (line 76) | def static_warper(
  class HeterogeneousRepetitionPenaltyLogitsProcessor (line 87) | class HeterogeneousRepetitionPenaltyLogitsProcessor(LogitsProcessor):
    method __init__ (line 99) | def __init__(self, penalty: List[float], dtype: torch.dtype, device: t...
    method __call__ (line 105) | def __call__(self, input_ids: torch.Tensor, scores: torch.Tensor) -> t...
    method filter (line 116) | def filter(self, indices):
  class FrequencyPenaltyLogitsProcessor (line 124) | class FrequencyPenaltyLogitsProcessor(LogitsProcessor):
    method __init__ (line 133) | def __init__(self, penalty: float):
    method __call__ (line 136) | def __call__(
  class HeterogeneousFrequencyPenaltyLogitsProcessor (line 148) | class HeterogeneousFrequencyPenaltyLogitsProcessor(LogitsProcessor):
    method __init__ (line 158) | def __init__(self, penalty: List[float], dtype: torch.dtype, device: t...
    method __call__ (line 164) | def __call__(self, input_ids: torch.Tensor, scores: torch.Tensor) -> t...
    method filter (line 183) | def filter(self, indices):
  class HeterogeneousTemperatureLogitsWarper (line 191) | class HeterogeneousTemperatureLogitsWarper:
    method __init__ (line 202) | def __init__(
    method __call__ (line 210) | def __call__(self, input_ids: torch.Tensor, scores: torch.Tensor) -> t...
    method filter (line 214) | def filter(self, indices):
  class HeterogeneousTopPLogitsWarper (line 222) | class HeterogeneousTopPLogitsWarper(LogitsProcessor):
    method __init__ (line 238) | def __init__(
    method __call__ (line 253) | def __call__(self, input_ids: torch.Tensor, scores: torch.Tensor) -> t...
    method filter (line 273) | def filter(self, indices):
  class HeterogeneousTopKLogitsWarper (line 281) | class HeterogeneousTopKLogitsWarper(LogitsProcessor):
    method __init__ (line 296) | def __init__(
    method __call__ (line 324) | def __call__(self, input_ids: torch.Tensor, scores: torch.Tensor) -> t...
    method filter (line 345) | def filter(self, indices):
  class HeterogeneousTypicalLogitsWarper (line 362) | class HeterogeneousTypicalLogitsWarper(LogitsProcessor):
    method __init__ (line 378) | def __init__(
    method __call__ (line 400) | def __call__(self, input_ids: torch.Tensor, scores: torch.Tensor) -> t...
    method filter (line 436) | def filter(self, indices):
  class HeterogeneousProcessorWrapper (line 452) | class HeterogeneousProcessorWrapper(LogitsProcessor):
    method __init__ (line 460) | def __init__(
    method __call__ (line 466) | def __call__(self, input_ids: torch.Tensor, scores: torch.Tensor) -> t...
    method filter (line 471) | def filter(self, indices):
  class GrammarLogitProcessor (line 483) | class GrammarLogitProcessor(LogitsProcessor):
    method __init__ (line 487) | def __init__(self, tokenizer, device, grammar, grammar_type):
    method __call__ (line 494) | def __call__(
    method advance (line 507) | def advance(self, next_token_id, fsm_grammar_state):
    method _advance (line 513) | def _advance(next_token_id, fsm_grammar_state, fsm):
    method _cached_compile_fsm (line 521) | def _cached_compile_fsm(grammar_type, schema, tokenizer):
    method _cached_adapt_tokenizer (line 533) | def _cached_adapt_tokenizer(tokenizer):
  class HeterogeneousGrammarLogitProcessor (line 561) | class HeterogeneousGrammarLogitProcessor(LogitsProcessor):
    method __init__ (line 562) | def __init__(self, tokenizer, device, grammars, grammar_types):
    method __call__ (line 575) | def __call__(
    method advance_batch (line 590) | def advance_batch(self, next_token_ids, fsm_grammar_states):
    method advance_at_index (line 598) | def advance_at_index(self, next_token_id, fsm_grammar_state, index):
    method filter (line 605) | def filter(self, indices):

FILE: backends/gaudi/server/text_generation_server/utils/merges/strategies.py
  class AdapterParameters (line 17) | class AdapterParameters:
    method __init__ (line 18) | def __init__(
  function _apply_weights (line 28) | def _apply_weights(
  class MergeStrategy (line 44) | class MergeStrategy(ABC):
    method merge (line 45) | def merge(
  class LinearMerge (line 51) | class LinearMerge(MergeStrategy):
    method __init__ (line 52) | def __init__(self, **kwargs):
    method merge (line 55) | def merge(
  class TiesMerge (line 62) | class TiesMerge(MergeStrategy):
    method __init__ (line 63) | def __init__(self, density: float, majority_sign_method: str = "total"...
    method merge (line 67) | def merge(
  class DareLinearMerge (line 86) | class DareLinearMerge(MergeStrategy):
    method __init__ (line 87) | def __init__(self, density: float, **kwargs):
    method merge (line 90) | def merge(
  class DareTiesMerge (line 102) | class DareTiesMerge(MergeStrategy):
    method __init__ (line 103) | def __init__(self, density: float, majority_sign_method: str = "total"...
    method merge (line 107) | def merge(
  function merge_adapters (line 136) | def merge_adapters(
  function _validate_lora_configs (line 193) | def _validate_lora_configs(lora_configs: List["LoraConfig"]):
  function _merge_lora_configs (line 207) | def _merge_lora_configs(lora_configs: List["LoraConfig"]) -> "LoraConfig":

FILE: backends/gaudi/server/text_generation_server/utils/merges/utils.py
  function magnitude_based_pruning (line 23) | def magnitude_based_pruning(tensor: torch.Tensor, density: float) -> tor...
  function random_pruning (line 39) | def random_pruning(tensor: torch.Tensor, density: float, rescale: bool) ...
  function prune (line 56) | def prune(
  function calculate_majority_sign_mask (line 83) | def calculate_majority_sign_mask(
  function disjoint_merge (line 105) | def disjoint_merge(task_tensors, majority_sign_mask):

FILE: backends/gaudi/server/text_generation_server/utils/peft.py
  function download_and_unload_peft (line 10) | def download_and_unload_peft(model_id, revision, trust_remote_code):
  function download_peft (line 48) | def download_peft(

FILE: backends/gaudi/server/text_generation_server/utils/prefill_chunking.py
  function set_support_chunking (line 7) | def set_support_chunking(support_chunking: bool):
  function get_support_chunking (line 12) | def get_support_chunking() -> bool:
  function set_max_prefill_tokens (line 17) | def set_max_prefill_tokens(max_prefill_tokens: int):
  function get_max_prefill_tokens (line 22) | def get_max_prefill_tokens() -> int:

FILE: backends/gaudi/server/text_generation_server/utils/quantization.py
  class _QuantizerConfig (line 14) | class _QuantizerConfig:
  class _FP8QuantizerConfig (line 26) | class _FP8QuantizerConfig:
  function _get_config_json (line 30) | def _get_config_json(model_id: str, revision: Optional[str], filename: s...
  function _get_quantizer_config (line 45) | def _get_quantizer_config(model_id, revision):
  function get_loader (line 122) | def get_loader(

FILE: backends/gaudi/server/text_generation_server/utils/segments.py
  function find_segments (line 10) | def find_segments(
  class SegmentConcatBuilder (line 35) | class SegmentConcatBuilder:
    method __init__ (line 36) | def __init__(self):
    method concat (line 40) | def concat(self, adapter_segments: torch.Tensor, segment_indices: List...
    method build (line 65) | def build(self) -> Tuple[torch.Tensor, List[int]]:

FILE: backends/gaudi/server/text_generation_server/utils/sgmv.py
  function has_sgmv (line 30) | def has_sgmv() -> bool:
  function pad_rank (line 34) | def pad_rank(t: torch.Tensor, dim: int, world_size: int) -> torch.Tensor:
  function use_cutlass_shrink (line 64) | def use_cutlass_shrink(lora_rank: int) -> bool:
  function orient_for_rank (line 68) | def orient_for_rank(t: torch.Tensor, rank: int) -> torch.Tensor:
  function add_lora_sgmv_cutlass (line 75) | def add_lora_sgmv_cutlass(
  function _add_lora_sgmv_cutlass_legacy (line 115) | def _add_lora_sgmv_cutlass_legacy(
  function get_tmp_tensor (line 133) | def get_tmp_tensor(device: torch.device) -> torch.Tensor:
  function get_tmp_tensor_for_size (line 138) | def get_tmp_tensor_for_size(size: int, device: torch.device) -> torch.Te...
  function get_tmp_tensor_for_size_no_kernels (line 143) | def get_tmp_tensor_for_size_no_kernels(size: int, device: torch.device) ...
  function get_tmp_expand_size (line 147) | def get_tmp_expand_size(size: int) -> int:
  function get_tmp_tensors (line 151) | def get_tmp_tensors(
  function lora_a_sgmv_cutlass (line 167) | def lora_a_sgmv_cutlass(
  function lora_b_sgmv_cutlass (line 184) | def lora_b_sgmv_cutlass(
  function add_lora_a_bgmv (line 217) | def add_lora_a_bgmv(
  function add_lora_b_bgmv (line 227) | def add_lora_b_bgmv(
  function segmented_matmul (line 237) | def segmented_matmul(

FILE: backends/gaudi/server/text_generation_server/utils/speculate.py
  function get_speculate (line 4) | def get_speculate() -> int:
  function set_speculate (line 9) | def set_speculate(speculate: int):

FILE: backends/gaudi/server/text_generation_server/utils/tokens.py
  class NextTokenChooser (line 27) | class NextTokenChooser:
    method __init__ (line 28) | def __init__(
    method __call__ (line 84) | def __call__(self, input_ids, scores):
    method advance_grammar (line 103) | def advance_grammar(self, next_id: int):
    method from_pb (line 111) | def from_pb(
  class StopSequenceCriteria (line 134) | class StopSequenceCriteria:
    method __init__ (line 135) | def __init__(self, stop_sequence: str):
    method __call__ (line 139) | def __call__(self, output: str) -> bool:
  class StoppingCriteria (line 145) | class StoppingCriteria:
    method __init__ (line 146) | def __init__(
    method __call__ (line 174) | def __call__(self, last_token: int, last_output: str) -> Tuple[bool, O...
    method from_pb (line 198) | def from_pb(
  function create_n_gram_speculation (line 216) | def create_n_gram_speculation(
  class HeterogeneousNextTokenChooser (line 240) | class HeterogeneousNextTokenChooser:
    method __init__ (line 241) | def __init__(
    method __call__ (line 335) | def __call__(
    method advance_grammar (line 424) | def advance_grammar(self, next_ids: List[int]):
    method advance_grammar_single (line 432) | def advance_grammar_single(self, grammar_state_index: int, next_id: int):
    method advance_grammar_single_with_past_state (line 443) | def advance_grammar_single_with_past_state(
    method filter (line 457) | def filter(self, indices):
    method from_pb (line 500) | def from_pb(
  function pad_next_token_chooser_parameters (line 531) | def pad_next_token_chooser_parameters(
  class Sampling (line 553) | class Sampling:
    method __init__ (line 554) | def __init__(self, seed: int, device: str = "cpu"):
    method __call__ (line 564) | def __call__(self, logits):
  class Greedy (line 572) | class Greedy:
    method __call__ (line 573) | def __call__(self, logits):
  class HeterogeneousSampling (line 577) | class HeterogeneousSampling:
    method __init__ (line 582) | def __init__(self, do_sample: List[bool], seeds: List[int], device: to...
    method __call__ (line 595) | def __call__(self, logits):
    method filter (line 605) | def filter(self, indices):
  function batch_top_tokens (line 619) | def batch_top_tokens(
  function make_tokenizer_optional (line 700) | def make_tokenizer_optional(tokenizer):
  function is_tokenizer_transparent (line 766) | def is_tokenizer_transparent(tokenizer):

FILE: backends/gaudi/server/text_generation_server/utils/version.py
  function get_driver_version (line 6) | def get_driver_version():
  function is_driver_compatible (line 32) | def is_driver_compatible():

FILE: backends/gaudi/server/text_generation_server/utils/watermark.py
  class WatermarkLogitsProcessor (line 26) | class WatermarkLogitsProcessor(LogitsProcessor):
    method __init__ (line 27) | def __init__(
    method _seed_rng (line 40) | def _seed_rng(self, input_ids: Union[List[int], torch.LongTensor]):
    method _get_greenlist_ids (line 55) | def _get_greenlist_ids(
    method _calc_greenlist_mask (line 70) | def _calc_greenlist_mask(
    method _bias_greenlist_logits (line 79) | def _bias_greenlist_logits(
    method __call__ (line 85) | def __call__(

FILE: backends/gaudi/server/text_generation_server/utils/weights.py
  class WeightsLoader (line 11) | class WeightsLoader(ABC):
    method get_weights (line 23) | def get_weights(self, weights: "Weights", prefix: str):
    method get_weights_col_packed (line 30) | def get_weights_col_packed(
    method get_weights_col (line 50) | def get_weights_col(self, weights: "Weights", prefix: str):
    method get_multi_weights_col (line 58) | def get_multi_weights_col(self, weights: "Weights", prefixes: List[str...
    method get_multi_weights (line 66) | def get_multi_weights(self, weights: "Weights", prefixes: List[str], d...
    method get_weights_row (line 74) | def get_weights_row(self, weights: "Weights", prefix: str):
  class Weight (line 82) | class Weight(ABC):
    method get_linear (line 87) | def get_linear(self, bias: torch.Tensor):
  class UnquantizedWeight (line 93) | class UnquantizedWeight(Weight):
    method get_linear (line 96) | def get_linear(self, bias: torch.Tensor):
  class DefaultWeightsLoader (line 102) | class DefaultWeightsLoader(WeightsLoader):
    method __init__ (line 105) | def __init__(self, weight_class: Type[UnquantizedWeight]):
    method get_weights (line 117) | def get_weights(self, weights: "Weights", prefix: str):
    method get_weights_col_packed (line 120) | def get_weights_col_packed(
    method get_multi_weights_col (line 132) | def get_multi_weights_col(self, weights: "Weights", prefixes: List[str...
    method get_weights_row (line 136) | def get_weights_row(self, weights: "Weights", prefix: str):
    method get_multi_weights (line 141) | def get_multi_weights(self, weights: "Weights", prefixes: List[str], d...
  class Weights (line 146) | class Weights:
    method __init__ (line 147) | def __init__(
    method _get_handle (line 177) | def _get_handle(self, filename):
    method get_filename (line 184) | def get_filename(self, tensor_name: str) -> (str, str):
    method _get_slice (line 201) | def _get_slice(self, tensor_name: str):
    method has_tensor (line 207) | def has_tensor(self, tensor_name: str):
    method get_shape (line 214) | def get_shape(self, tensor_name: str):
    method get_tensor (line 217) | def get_tensor(
    method get_partial_sharded (line 242) | def get_partial_sharded(
    method get_sharded (line 275) | def get_sharded(self, tensor_name: str, dim: int, to_device=True, to_d...
    method get_packed_sharded (line 288) | def get_packed_sharded(
    method get_weights (line 357) | def get_weights(self, prefix: str):
    method get_weights_col_packed_qkv (line 360) | def get_weights_col_packed_qkv(
    method get_weights_col_packed_gate_up (line 370) | def get_weights_col_packed_gate_up(self, prefix: str):
    method get_weights_col_packed (line 373) | def get_weights_col_packed(self, prefix: str, block_sizes: Union[int, ...
    method get_weights_col (line 383) | def get_weights_col(self, prefix: str):
    method get_multi_weights_col (line 386) | def get_multi_weights_col(self, prefixes: List[str], dim: int):
    method get_tensor_shard (line 389) | def get_tensor_shard(self, var, dim):
    method get_weights_row (line 405) | def get_weights_row(self, prefix: str):
    method get_multi_weights (line 408) | def get_multi_weights(self, prefixes: List[str], dim: int):
    method use_loader (line 412) | def use_loader(self, weights_loader: WeightsLoader):
    method loader (line 426) | def loader(self):
  function _blocks_to_block_sizes (line 430) | def _blocks_to_block_sizes(total_size: int, blocks: Union[int, List[int]...

FILE: backends/grpc-metadata/src/lib.rs
  type MetadataInjector (line 9) | struct MetadataInjector<'a>(pub &'a mut tonic::metadata::MetadataMap);
  method set (line 13) | fn set(&mut self, key: &str, value: String) {
  function inject (line 23) | fn inject(metadata: &mut tonic::metadata::MetadataMap) {
  type InjectTelemetryContext (line 32) | pub trait InjectTelemetryContext {
    method inject_context (line 33) | fn inject_context(self) -> Self;
    method inject_context (line 37) | fn inject_context(mut self) -> Self {

FILE: backends/llamacpp/build.rs
  type PrefixStripper (line 6) | struct PrefixStripper;
  method generated_name_override (line 9) | fn generated_name_override(&self, item_info: ItemInfo<'_>) -> Option<Str...
  function main (line 14) | fn main() {

FILE: backends/llamacpp/src/backend.rs
  type LlamacppSplitMode (line 22) | pub enum LlamacppSplitMode {
  type Err (line 29) | type Err = String;
  method from_str (line 30) | fn from_str(s: &str) -> Result<Self, Self::Err> {
  type LlamacppNuma (line 43) | pub enum LlamacppNuma {
  type LlamacppGGMLType (line 53) | pub enum LlamacppGGMLType {
    method to_ggml_type (line 89) | fn to_ggml_type(self) -> llamacpp::ggml_type {
  type LlamacppConfig (line 126) | pub struct LlamacppConfig {
  type LlamacppRequest (line 147) | struct LlamacppRequest {
    method new (line 170) | fn new(
  type LlamacppBackend (line 164) | pub struct LlamacppBackend {
    method new (line 427) | pub fn new(
  type Llamacpp (line 193) | struct Llamacpp {
    method new (line 219) | fn new(conf: LlamacppConfig) -> Result<Self, BackendError> {
    method decode (line 284) | fn decode(&mut self) -> i32 {
    method clear_kv_cache (line 288) | fn clear_kv_cache(&mut self, seq_id: llamacpp::llama_seq_id) {
    method batch_push (line 294) | fn batch_push(
  function llamacpp_log_callback (line 201) | extern "C" fn llamacpp_log_callback(
  method drop (line 315) | fn drop(&mut self) {
  type LlamacppSampler (line 326) | struct LlamacppSampler {
    method new (line 331) | fn new(req: &LlamacppRequest) -> Option<Self> {
    method sample (line 381) | fn sample(&self, llamacpp: &mut Llamacpp, idx: usize) -> (llamacpp::ll...
  method drop (line 406) | fn drop(&mut self) {
  type LlamacppSeq (line 413) | struct LlamacppSeq {
  method schedule (line 644) | fn schedule(
  method health (line 659) | async fn health(&self, _: bool) -> bool {
  method name (line 663) | fn name(&self) -> &'static str {
  type BackendError (line 669) | pub enum BackendError {

FILE: backends/llamacpp/src/main.rs
  type Args (line 25) | struct Args {
  function main (line 167) | async fn main() -> Result<(), RouterError> {
  type RouterError (line 335) | enum RouterError {

FILE: backends/llamacpp/src/quantize.rs
  type QuantizeType (line 7) | pub enum QuantizeType {
  function model (line 11) | pub fn model(

FILE: backends/neuron/server/text_generation_server/cli.py
  function serve (line 12) | def serve(
  function download_weights (line 75) | def download_weights(

FILE: backends/neuron/server/text_generation_server/generator.py
  class Generator (line 35) | class Generator(ABC):
    method info (line 43) | def info(self) -> InfoResponse:
    method warmup (line 47) | def warmup(self, batch: Batch) -> int:
    method prefill (line 59) | def prefill(self, batch: Batch) -> Tuple[List[Generation], CachedBatch]:
    method decode (line 74) | def decode(self, batches: List[Batch]) -> Tuple[List[Generation], Cach...
    method filter (line 78) | def filter(self, batch_id: int, request_ids: List[int]) -> CachedBatch:
    method clear (line 82) | def clear(self):
    method from_pretrained (line 87) | def from_pretrained(cls, model_id: str, revision: Optional[str]):
  class Slot (line 92) | class Slot:
    class State (line 95) | class State(Enum):
    method __init__ (line 100) | def __init__(self, id: int, tokenizer: PreTrainedTokenizerBase):
    method clear (line 105) | def clear(self):
    method id (line 123) | def id(self) -> int:
    method state (line 127) | def state(self) -> "Slot.State":
    method batch_id (line 131) | def batch_id(self) -> int:
    method request_id (line 135) | def request_id(self) -> int:
    method cached_text (line 139) | def cached_text(self) -> str:
    method generation_config (line 143) | def generation_config(self) -> GenerationConfig:
    method generated_tokens (line 147) | def generated_tokens(self) -> int:
    method assign (line 150) | def assign(
    method reset (line 198) | def reset(
    method pause (line 221) | def pause(self):
    method resume (line 228) | def resume(self):
    method _decode_next_tokens (line 232) | def _decode_next_tokens(
    method append (line 259) | def append(self, next_token: int) -> str:
    method select (line 284) | def select(
    method stopped (line 301) | def stopped(self) -> bool:
    method generated_text (line 307) | def generated_text(self) -> str:
    method next_token (line 311) | def next_token(self) -> int:
    method attention_mask (line 315) | def attention_mask(self) -> torch.LongTensor:
    method max_token (line 319) | def max_token(self) -> int:
    method max_new_tokens (line 323) | def max_new_tokens(self) -> int:
    method truncate (line 329) | def truncate(self) -> int:
  class NeuronGenerator (line 333) | class NeuronGenerator(Generator):
    method __init__ (line 336) | def __init__(
    method on_device_sampling (line 363) | def on_device_sampling(self) -> bool:
    method info (line 367) | def info(self) -> InfoResponse:
    method warmup (line 376) | def warmup(self, batch: Batch) -> int:
    method max_prefill_length (line 399) | def max_prefill_length(self) -> int:
    method prefill (line 404) | def prefill(self, batch: Batch) -> Tuple[List[Generation], CachedBatch]:
    method decode (line 517) | def decode(
    method _generate_token (line 585) | def _generate_token(
    method _cached_batch (line 652) | def _cached_batch(self, batch_id: int, request_ids: List):
    method filter (line 659) | def filter(self, batch_id: int, keep_request_ids: List[int]) -> Cached...
    method clear (line 677) | def clear(self, batch_id: Optional[int] = None):
    method _clear (line 684) | def _clear(self, keep_slot_ids: List):
    method from_pretrained (line 691) | def from_pretrained(cls, model_id: str, revision: str = None):

FILE: backends/neuron/server/text_generation_server/interceptor.py
  class ExceptionInterceptor (line 10) | class ExceptionInterceptor(AsyncServerInterceptor):
    method intercept (line 11) | async def intercept(

FILE: backends/neuron/server/text_generation_server/model.py
  function get_export_kwargs_from_env (line 17) | def get_export_kwargs_from_env():
  function is_cached (line 36) | def is_cached(model_id):
  function log_cache_size (line 50) | def log_cache_size():
  function fetch_model (line 62) | def fetch_model(

FILE: backends/neuron/server/text_generation_server/server.py
  class TextGenerationService (line 14) | class TextGenerationService(generate_pb2_grpc.TextGenerationServiceServi...
    method __init__ (line 15) | def __init__(self, generator: Generator, server_urls: List[str]):
    method Info (line 19) | async def Info(self, request, context):
    method Health (line 22) | async def Health(self, request, context):
    method ServiceDiscovery (line 25) | async def ServiceDiscovery(self, request, context):
    method ClearCache (line 28) | async def ClearCache(self, request, context):
    method FilterBatch (line 35) | async def FilterBatch(self, request, context):
    method Warmup (line 39) | async def Warmup(self, request, context):
    method Prefill (line 43) | async def Prefill(self, request, context):
    method Decode (line 47) | async def Decode(self, request, context):
  function serve (line 52) | def serve(

FILE: backends/neuron/server/text_generation_server/tgi_env.py
  function parse_cmdline_and_set_env (line 34) | def parse_cmdline_and_set_env(argv: List[str] = None) -> argparse.Namesp...
  function neuron_config_to_env (line 88) | def neuron_config_to_env(neuron_config):
  function sort_neuron_configs (line 114) | def sort_neuron_configs(dictionary):
  function lookup_compatible_cached_model (line 118) | def lookup_compatible_cached_model(
  function check_env_and_neuron_config_compatibility (line 158) | def check_env_and_neuron_config_compatibility(
  function get_env_dict (line 245) | def get_env_dict() -> Dict[str, str]:
  function get_neuron_config_for_model (line 252) | def get_neuron_config_for_model(

FILE: backends/neuron/tests/fixtures/model.py
  function export_model (line 58) | def export_model(model_id, export_kwargs, neuron_model_path):
  function neuron_model_config (line 80) | def neuron_model_config(request):
  function neuron_model_path (line 117) | def neuron_model_path(neuron_model_config):

FILE: backends/neuron/tests/prune_test_models.py
  function main (line 5) | def main():

FILE: backends/neuron/tests/server/helpers.py
  function create_request (line 10) | def create_request(
  function check_prefill (line 40) | def check_prefill(
  function check_decode_single (line 80) | def check_decode_single(
  function check_decode_multiple (line 106) | def check_decode_multiple(model_path):

FILE: backends/neuron/tests/server/test_cached_model.py
  function cached_model_id (line 9) | def cached_model_id(neuron_model_config) -> str:
  function test_model_is_cached (line 26) | def test_model_is_cached(cached_model_id):
  function test_fetch_cached_model (line 30) | def test_fetch_cached_model(cached_model_id: str):
  function test_generator_from_cached_model (line 38) | def test_generator_from_cached_model(cached_model_id: str):

FILE: backends/neuron/tests/server/test_continuous_batching.py
  function test_continuous_batching_two_requests (line 6) | def test_continuous_batching_two_requests(neuron_model_config):

FILE: backends/neuron/tests/server/test_decode.py
  function test_decode (line 6) | def test_decode(neuron_model_config):
  function _test_decode (line 25) | def _test_decode(config_name, generator, do_sample):

FILE: backends/neuron/tests/server/test_generator_slot.py
  function tokenizer (line 12) | def tokenizer(request):
  function test_decode_streaming (line 33) | def test_decode_streaming(tokenizer, input_text, generated_text):

FILE: backends/neuron/tests/server/test_info.py
  function test_info (line 4) | def test_info(neuron_model_path):

FILE: backends/neuron/tests/server/test_prefill.py
  function test_prefill (line 6) | def test_prefill(neuron_model_config):
  function _test_prefill (line 21) | def _test_prefill(config_name, generator, batch_size, do_sample):
  function test_prefill_truncate (line 60) | def test_prefill_truncate(neuron_model_config):

FILE: backends/neuron/tests/test_entry_point.py
  function test_get_neuron_config_for_model (line 15) | def test_get_neuron_config_for_model(neuron_model_config):
  function test_lookup_compatible_cached_model (line 38) | def test_lookup_compatible_cached_model(model_id: str):
  function test_neuron_config_to_env (line 43) | def test_neuron_config_to_env(neuron_model_config) -> None:

FILE: backends/neuron/tgi_entry_point.py
  function main (line 22) | def main():

FILE: backends/trtllm/build.rs
  constant ADDITIONAL_BACKEND_LINK_LIBRARIES (line 8) | const ADDITIONAL_BACKEND_LINK_LIBRARIES: [&str; 1] = ["spdlog"];
  constant CUDA_ARCH_LIST (line 9) | const CUDA_ARCH_LIST: Option<&str> = option_env!("CUDA_ARCH_LIST");
  constant CUDA_REQUIRED_VERSION (line 10) | const CUDA_REQUIRED_VERSION: &str = "12.8";
  constant MPI_REQUIRED_VERSION (line 11) | const MPI_REQUIRED_VERSION: &str = "4.1";
  constant INSTALL_PREFIX (line 12) | const INSTALL_PREFIX: Option<&str> = option_env!("CMAKE_INSTALL_PREFIX");
  constant TENSORRT_ROOT_DIR (line 13) | const TENSORRT_ROOT_DIR: Option<&str> = option_env!("TENSORRT_ROOT_DIR");
  constant NCCL_ROOT_DIR (line 14) | const NCCL_ROOT_DIR: Option<&str> = option_env!("NCCL_ROOT_DIR");
  constant IS_GHA_BUILD (line 16) | const IS_GHA_BUILD: LazyLock<bool> = LazyLock::new(|| {
  constant BACKEND_DEPS (line 26) | const BACKEND_DEPS: &str = "tgi_trtllm_backend_impl";
  constant CUDA_TRANSITIVE_DEPS (line 27) | const CUDA_TRANSITIVE_DEPS: [&str; 4] = ["cuda", "cudart", "cublas", "nv...
  constant TENSORRT_LLM_TRANSITIVE_DEPS (line 28) | const TENSORRT_LLM_TRANSITIVE_DEPS: [(&str, &str); 5] = [
  function get_compiler_flag (line 45) | fn get_compiler_flag(
  function get_library_architecture (line 56) | fn get_library_architecture() -> &'static str {
  function build_backend (line 87) | fn build_backend(is_debug: bool, opt_level: &str, out_dir: &PathBuf) -> ...
  function build_ffi_layer (line 178) | fn build_ffi_layer(deps_folder: &PathBuf, is_debug: bool) {
  function main (line 206) | fn main() {

FILE: backends/trtllm/csrc/backend.cpp
  type huggingface::tgi::backends::trtllm (line 8) | namespace huggingface::tgi::backends::trtllm {

FILE: backends/trtllm/csrc/backend.hpp
  type huggingface::tgi::backends::trtllm (line 17) | namespace huggingface::tgi::backends::trtllm {
    type generation_params_t (line 26) | struct generation_params_t {
    type sampling_params_t (line 33) | struct sampling_params_t {
    type generation_config_t (line 65) | struct generation_config_t {
      method generation_config_t (line 70) | constexpr explicit generation_config_t(const json &config) :
    class backend_workspace_t (line 87) | class backend_workspace_t {
      method backend_workspace_t (line 100) | backend_workspace_t(std::filesystem::path &engines_folder, std::file...
      method backend_workspace_t (line 106) | backend_workspace_t(std::filesystem::path &&engines_folder, std::fil...
      method engines_folder (line 116) | [[nodiscard]] constexpr std::filesystem::path engines_folder() const...
      method generation_config_t (line 123) | [[nodiscard]] constexpr const generation_config_t &generation_config...
    type backend_error_t (line 143) | enum backend_error_t {
    class backend_t (line 155) | class backend_t {
      method backend_t (line 163) | backend_t(std::filesystem::path &&engines_folder, std::filesystem::p...
  type fmt::formatter<huggingface::tgi::backends::trtllm::generation_params_t> (line 212) | struct fmt::formatter<huggingface::tgi::backends::trtllm::generation_par...
    method format (line 213) | auto format(huggingface::tgi::backends::trtllm::generation_params_t co...
  type fmt::formatter<huggingface::tgi::backends::trtllm::sampling_params_t> (line 220) | struct fmt::formatter<huggingface::tgi::backends::trtllm::sampling_param...
    method format (line 221) | auto format(huggingface::tgi::backends::trtllm::sampling_params_t cons...

FILE: backends/trtllm/csrc/ffi.hpp
  type rust::behavior (line 16) | namespace rust::behavior {
    function trycatch (line 18) | static void trycatch(Try &&func, Fail &&fail) noexcept try {
  type huggingface::tgi::backends::trtllm (line 25) | namespace huggingface::tgi::backends::trtllm {
    class tensorrt_llm_backend_t (line 26) | class tensorrt_llm_backend_t
      method tensorrt_llm_backend_t (line 83) | tensorrt_llm_backend_t(std::filesystem::path &&engine_folder, std::f...
      method num_tokens_ready (line 86) | size_t num_tokens_ready() const noexcept { return inner_.num_tokens_...
      method request_id_t (line 88) | request_id_t submit(
      method pull_tokens (line 118) | std::unique_ptr<std::vector<generation_step_t>> pull_tokens() noexce...
      method cancel (line 139) | void cancel(request_id_t request_id) noexcept {
    function finish_reason_t (line 35) | constexpr finish_reason_t as_finish_reason_t(const tle::FinishReason r...
    class tensorrt_llm_backend_t (line 78) | class tensorrt_llm_backend_t {
      method tensorrt_llm_backend_t (line 83) | tensorrt_llm_backend_t(std::filesystem::path &&engine_folder, std::f...
      method num_tokens_ready (line 86) | size_t num_tokens_ready() const noexcept { return inner_.num_tokens_...
      method request_id_t (line 88) | request_id_t submit(
      method pull_tokens (line 118) | std::unique_ptr<std::vector<generation_step_t>> pull_tokens() noexce...
      method cancel (line 139) | void cancel(request_id_t request_id) noexcept {
    function initialize_logging (line 145) | void initialize_logging() {
    function initialize_tensorrt_llm_backend (line 163) | void initialize_tensorrt_llm_backend() {
    function create_backend_from_engine_folder (line 180) | std::unique_ptr<tensorrt_llm_backend_t>
  type huggingface::tgi::backends::trtllm (line 32) | namespace huggingface::tgi::backends::trtllm {
    class tensorrt_llm_backend_t (line 26) | class tensorrt_llm_backend_t
      method tensorrt_llm_backend_t (line 83) | tensorrt_llm_backend_t(std::filesystem::path &&engine_folder, std::f...
      method num_tokens_ready (line 86) | size_t num_tokens_ready() const noexcept { return inner_.num_tokens_...
      method request_id_t (line 88) | request_id_t submit(
      method pull_tokens (line 118) | std::unique_ptr<std::vector<generation_step_t>> pull_tokens() noexce...
      method cancel (line 139) | void cancel(request_id_t request_id) noexcept {
    function finish_reason_t (line 35) | constexpr finish_reason_t as_finish_reason_t(const tle::FinishReason r...
    class tensorrt_llm_backend_t (line 78) | class tensorrt_llm_backend_t {
      method tensorrt_llm_backend_t (line 83) | tensorrt_llm_backend_t(std::filesystem::path &&engine_folder, std::f...
      method num_tokens_ready (line 86) | size_t num_tokens_ready() const noexcept { return inner_.num_tokens_...
      method request_id_t (line 88) | request_id_t submit(
      method pull_tokens (line 118) | std::unique_ptr<std::vector<generation_step_t>> pull_tokens() noexce...
      method cancel (line 139) | void cancel(request_id_t request_id) noexcept {
    function initialize_logging (line 145) | void initialize_logging() {
    function initialize_tensorrt_llm_backend (line 163) | void initialize_tensorrt_llm_backend() {
    function create_backend_from_engine_folder (line 180) | std::unique_ptr<tensorrt_llm_backend_t>

FILE: backends/trtllm/csrc/hardware.hpp
  type huggingface::tgi::hardware::cuda (line 8) | namespace huggingface::tgi::hardware::cuda {
    function get_device_count (line 19) | inline std::optional<size_t> get_device_count() {
    type compute_capabilities_t (line 30) | struct compute_capabilities_t {
      method compute_capabilities_t (line 34) | compute_capabilities_t(): compute_capabilities_t(0) {}
      method compute_capabilities_t (line 35) | explicit compute_capabilities_t(size_t device_idx): major(-1), minor...
      method compute_capabilities_t (line 41) | compute_capabilities_t(int32_t major, int32_t minor): major(major), ...
      method is_at_least (line 48) | [[nodiscard]] constexpr auto is_at_least(std::tuple<uint32_t, uint32...
      method is_at_least_volta (line 54) | [[nodiscard]] constexpr bool is_at_least_volta() const { return is_a...
      method is_at_least_turing (line 60) | [[nodiscard]] constexpr bool is_at_least_turing() const { return is_...
      method is_at_least_ampere (line 66) | [[nodiscard]] constexpr bool is_at_least_ampere() const { return is_...
      method is_at_least_ada_lovelace (line 72) | [[nodiscard]] constexpr bool is_at_least_ada_lovelace() const { retu...
      method is_at_least_hopper (line 78) | [[nodiscard]] constexpr bool is_at_least_hopper() const { return is_...

FILE: backends/trtllm/scripts/setup_sccache.py
  function setup_sccache_locally (line 14) | def setup_sccache_locally():
  function setup_sccache_for_s3 (line 25) | def setup_sccache_for_s3():

FILE: backends/trtllm/src/errors.rs
  type TensorRtLlmBackendError (line 7) | pub enum TensorRtLlmBackendError {

FILE: backends/trtllm/src/lib.rs
  type FinishReason (line 11) | pub enum FinishReason {
  type GenerationStep (line 33) | pub struct GenerationStep {
  function create_backend_from_engine_folder (line 64) | fn create_backend_from_engine_folder(
  function num_tokens_ready (line 69) | fn num_tokens_ready(self: &TensorRtLlmBackendImpl) -> usize;
  function submit (line 71) | fn submit(
  function pull_tokens (line 83) | fn pull_tokens(
  function cancel (line 87) | fn cancel(self: Pin<&mut TensorRtLlmBackendImpl>, request_id: u64);
  method from (line 95) | fn from(reason: FinishReason) -> Self {

FILE: backends/trtllm/src/looper.rs
  type InferResult (line 29) | type InferResult<T> = Result<T, InferError>;
  type GenerationContext (line 32) | struct GenerationContext {
  type DecodedToken (line 41) | struct DecodedToken {
    type Error (line 49) | type Error = InferError;
    method try_from (line 51) | fn try_from(step: &'step GenerationStep) -> Result<Self, Self::Error> {
  function executor_status_looper (line 65) | fn executor_status_looper(
  function post_process_decoded_token (line 170) | fn post_process_decoded_token(
  function ensure_paths_exist (line 218) | fn ensure_paths_exist<P: AsRef<Path>, PP: AsRef<Path>>(
  type TensorRtLlmBackendV2 (line 259) | pub struct TensorRtLlmBackendV2(UnboundedSender<GenerationContext>);
    method new (line 262) | pub fn new<P: AsRef<Path> + Send, PP: AsRef<Path> + Send>(
    method validate (line 286) | fn validate(request: &ValidGenerateRequest) -> InferResult<()> {
  method schedule (line 315) | fn schedule(
  method health (line 340) | async fn health(&self, _: bool) -> bool {
  method name (line 344) | fn name(&self) -> &'static str {

FILE: backends/trtllm/src/main.rs
  type Args (line 19) | struct Args {
  function get_tokenizer (line 74) | async fn get_tokenizer(tokenizer_name: &str, revision: Option<&str>) -> ...
  function main (line 219) | async fn main() -> Result<(), TensorRtLlmBackendError> {

FILE: backends/trtllm/src/utils.rs
  function first_line (line 20) | pub(crate) fn first_line(s: &str, fail: &str) -> String {

FILE: backends/v2/build.rs
  function main (line 3) | fn main() -> Result<(), Box<dyn std::error::Error>> {

FILE: backends/v2/src/backend.rs
  type BackendV2 (line 16) | pub struct BackendV2 {
    method new (line 27) | pub(crate) fn new(
  method schedule (line 73) | fn schedule(
  method health (line 98) | async fn health(&self, current_health: bool) -> bool {
  method start_health (line 108) | fn start_health(&self) -> bool {
  method name (line 112) | fn name(&self) -> &'static str {
  function batching_task (line 122) | pub(crate) async fn batching_task(
  function prefill (line 240) | async fn prefill(
  function decode (line 280) | async fn decode(
  function filter_batch (line 327) | async fn filter_batch(
  function filter_send_generations (line 361) | fn filter_send_generations(generations: Vec<Generation>, entries: &mut I...
  function send_responses (line 386) | fn send_responses(
  function send_errors (line 478) | fn send_errors(error: ClientError, entries: &mut IntMap<u64, Entry>) {
  method from (line 495) | fn from(value: crate::client::GeneratedText) -> Self {

FILE: backends/v2/src/client/grpc_client.rs
  type Client (line 14) | pub struct Client {
    method connect (line 21) | pub async fn connect(uri: Uri) -> Result<Self> {
    method connect_uds (line 30) | pub async fn connect_uds(path: String) -> Result<Self> {
    method service_discovery (line 45) | pub async fn service_discovery(&mut self) -> Result<Vec<String>> {
    method info (line 65) | pub async fn info(&mut self) -> Result<InfoResponse> {
    method health (line 73) | pub async fn health(&mut self) -> Result<HealthResponse> {
    method clear_cache (line 81) | pub async fn clear_cache(&mut self, batch_id: Option<u64>) -> Result<(...
    method filter_batch (line 89) | pub async fn filter_batch(
    method warmup (line 107) | pub async fn warmup(
    method prefill (line 188) | pub async fn prefill(
    method decode (line 206) | pub async fn decode(
  type PrefillTimings (line 225) | pub struct PrefillTimings {
    method new (line 232) | fn new(forward_ns: u64, decode_ns: u64, total_ns: u64) -> Self {
  type DecodeTimings (line 241) | pub struct DecodeTimings {
    method new (line 249) | fn new(concat_ns: Option<u64>, forward_ns: u64, decode_ns: u64, total_...

FILE: backends/v2/src/client/mod.rs
  type Health (line 22) | pub trait Health {
    method device_health (line 24) | async fn device_health(&self) -> Result<()>;
    method model_health (line 28) | async fn model_health(&self) -> Result<()>;
  type ShardInfo (line 32) | pub struct ShardInfo {
  type ClientError (line 41) | pub enum ClientError {
    method from (line 51) | fn from(err: Status) -> Self {
    method from (line 59) | fn from(err: transport::Error) -> Self {
  type Result (line 68) | pub type Result<T> = std::result::Result<T, ClientError>;

FILE: backends/v2/src/client/sharded_client.rs
  type ShardedClient (line 18) | pub struct ShardedClient {
    method new (line 23) | fn new(clients: Vec<Client>) -> Self {
    method from_master_client (line 29) | async fn from_master_client(mut master_client: Client) -> Result<Self> {
    method connect (line 39) | pub async fn connect(uri: Uri) -> Result<Self> {
    method connect_uds (line 45) | pub async fn connect_uds(path: String) -> Result<Self> {
    method info (line 52) | pub async fn info(&mut self) -> Result<ShardInfo> {
    method health (line 63) | pub async fn health(&mut self) -> Result<HealthResponse> {
    method clear_cache (line 74) | pub async fn clear_cache(&mut self, batch_id: Option<u64>) -> Result<(...
    method filter_batch (line 85) | pub async fn filter_batch(
    method warmup (line 103) | pub async fn warmup(
    method prefill (line 135) | pub async fn prefill(
    method decode (line 168) | pub async fn decode(
  method from (line 198) | fn from(value: InfoResponse) -> Self {
  method device_health (line 211) | async fn device_health(&self) -> Result<()> {
  method model_health (line 216) | async fn model_health(&self) -> Result<()> {

FILE: backends/v2/src/lib.rs
  type BackendInfo (line 12) | pub struct BackendInfo {
  function connect_backend (line 33) | pub async fn connect_backend(
  type V2Error (line 130) | pub enum V2Error {

FILE: backends/v2/src/main.rs
  type Args (line 9) | struct Args {
  type Commands (line 82) | enum Commands {
  function main (line 87) | async fn main() -> Result<(), RouterError> {
  type RouterError (line 215) | enum RouterError {

FILE: backends/v2/src/queue.rs
  type Entry (line 18) | pub(crate) struct Entry {
  type Queue (line 35) | pub(crate) struct Queue {
    method new (line 41) | pub(crate) fn new(
    method append (line 63) | pub(crate) fn append(&self, entry: Entry) {
    method next_batch (line 73) | pub(crate) async fn next_batch(
  function queue_task (line 101) | async fn queue_task(
  type State (line 135) | struct State {
    method new (line 159) | fn new(
    method append (line 177) | fn append(&mut self, mut entry: Entry) {
    method next_batch (line 188) | fn next_batch(
  type NextBatch (line 349) | type NextBatch = (IntMap<u64, Entry>, Batch, Span);
  type QueueCommand (line 352) | enum QueueCommand {
  method from (line 365) | fn from(value: ValidParameters) -> Self {
  method from (line 392) | fn from(value: ValidStoppingParameters) -> Self {
  function default_entry (line 407) | fn default_entry() -> (
  function test_append (line 452) | fn test_append() {
  function test_next_batch_empty (line 468) | fn test_next_batch_empty() {
  function test_next_batch_min_size (line 476) | fn test_next_batch_min_size() {
  function test_next_batch_max_size (line 508) | fn test_next_batch_max_size() {
  function test_next_batch_token_budget (line 528) | fn test_next_batch_token_budget() {
  function test_queue_append (line 561) | async fn test_queue_append() {
  function test_queue_next_batch_empty (line 568) | async fn test_queue_next_batch_empty() {
  function test_queue_next_batch_min_size (line 576) | async fn test_queue_next_batch_min_size() {
  function test_queue_next_batch_max_size (line 609) | async fn test_queue_next_batch_max_size() {
  function test_queue_next_batch_token_budget (line 625) | async fn test_queue_next_batch_token_budget() {
  function test_queue_next_batch_token_speculate (line 650) | async fn test_queue_next_batch_token_speculate() {
  function test_queue_next_batch_dropped_receiver (line 669) | async fn test_queue_next_batch_dropped_receiver() {

FILE: backends/v3/benches/prefix_cache.rs
  function prefix_cache_benchmark (line 9) | fn prefix_cache_benchmark(c: &mut Criterion) {

FILE: backends/v3/build.rs
  function main (line 3) | fn main() -> Result<(), Box<dyn std::error::Error>> {

FILE: backends/v3/src/backend.rs
  type BackendV3 (line 18) | pub struct BackendV3 {
    method new (line 29) | pub(crate) fn new(
  method schedule (line 79) | fn schedule(
  method health (line 105) | async fn health(&self, current_health: bool) -> bool {
  method start_health (line 115) | fn start_health(&self) -> bool {
  method name (line 119) | fn name(&self) -> &'static str {
  function batching_task (line 129) | pub(crate) async fn batching_task(
  function prefill (line 297) | async fn prefill(
  function decode (line 342) | async fn decode(
  function filter_batch (line 389) | async fn filter_batch(
  function filter_send_generations (line 423) | fn filter_send_generations(generations: Vec<Generation>, entries: &mut I...
  function send_responses (line 448) | fn send_responses(
  function send_errors (line 540) | fn send_errors(error: ClientError, entries: &mut IntMap<u64, Entry>) {
  method from (line 557) | fn from(value: crate::client::GeneratedText) -> Self {

FILE: backends/v3/src/block_allocator.rs
  type BlockAllocation (line 7) | pub struct BlockAllocation {
  method drop (line 20) | fn drop(&mut self) {
  type BlockAllocator (line 28) | pub struct BlockAllocator {
    method new (line 34) | pub(crate) fn new(
    method allocate (line 57) | pub(crate) async fn allocate(
    method free (line 77) | pub(crate) fn free(&self, blocks: Vec<u32>, allocation_id: u64) {
  function block_allocator_task (line 87) | async fn block_allocator_task(
  type BlockAllocatorCommand (line 119) | enum BlockAllocatorCommand {
  type Allocator (line 131) | pub trait Allocator {
    method allocate (line 132) | fn allocate(
    method free (line 138) | fn free(&mut self, blocks: Vec<u32>, allocation_id: u64);
    method allocate (line 160) | fn allocate(
    method free (line 218) | fn free(&mut self, blocks: Vec<u32>, _allocation_id: u64) {
  type SimpleAllocator (line 140) | pub struct SimpleAllocator {
    method new (line 148) | fn new(blocks: u32, block_size: u32, window_size: Option<u32>) -> Self {

FILE: backends/v3/src/client/grpc_client.rs
  type Client (line 16) | pub struct Client {
    method connect (line 23) | pub async fn connect(uri: Uri) -> Result<Self> {
    method connect_uds (line 32) | pub async fn connect_uds(path: String) -> Result<Self> {
    method service_discovery (line 47) | pub async fn service_discovery(&mut self) -> Result<Vec<String>> {
    method info (line 67) | pub async fn info(&mut self) -> Result<InfoResponse> {
    method health (line 75) | pub async fn health(&mut self) -> Result<HealthResponse> {
    method clear_cache (line 83) | pub async fn clear_cache(&mut self, batch_id: Option<u64>) -> Result<(...
    method filter_batch (line 91) | pub async fn filter_batch(
    method warmup (line 109) | pub async fn warmup(
    method prefill (line 230) | pub async fn prefill(
    method decode (line 258) | pub async fn decode(
  type PrefillTimings (line 277) | pub struct PrefillTimings {
    method new (line 285) | fn new(concat_ns: Option<u64>, forward_ns: u64, decode_ns: u64, total_...
  type DecodeTimings (line 295) | pub struct DecodeTimings {
    method new (line 303) | fn new(concat_ns: Option<u64>, forward_ns: u64, decode_ns: u64, total_...

FILE: backends/v3/src/client/mod.rs
  type Health (line 23) | pub trait Health {
    method device_health (line 25) | async fn device_health(&self) -> Result<()>;
    method model_health (line 29) | async fn model_health(&self) -> Result<()>;
  type ClientError (line 33) | pub enum ClientError {
    method from (line 43) | fn from(err: Status) -> Self {
    method from (line 51) | fn from(err: transport::Error) -> Self {
  method from (line 60) | fn from(chunk: Chunk) -> Self {
  type Result (line 67) | pub type Result<T> = std::result::Result<T, ClientError>;

FILE: backends/v3/src/client/sharded_client.rs
  type ShardedClient (line 18) | pub struct ShardedClient {
    method new (line 23) | fn new(clients: Vec<Client>) -> Self {
    method from_master_client (line 29) | async fn from_master_client(mut master_client: Client) -> Result<Self> {
    method connect (line 39) | pub async fn connect(uri: Uri) -> Result<Self> {
    method connect_uds (line 45) | pub async fn connect_uds(path: String) -> Result<Self> {
    method info (line 52) | pub async fn info(&mut self) -> Result<InfoResponse> {
    method health (line 63) | pub async fn health(&mut self) -> Result<HealthResponse> {
    method clear_cache (line 74) | pub async fn clear_cache(&mut self, batch_id: Option<u64>) -> Result<(...
    method filter_batch (line 85) | pub async fn filter_batch(
    method warmup (line 103) | pub async fn warmup(
    method prefill (line 142) | pub async fn prefill(
    method decode (line 176) | pub async fn decode(
  method device_health (line 207) | async fn device_health(&self) -> Result<()> {
  method model_health (line 212) | async fn model_health(&self) -> Result<()> {

FILE: backends/v3/src/lib.rs
  type BackendInfo (line 14) | pub struct BackendInfo {
  function connect_backend (line 48) | pub async fn connect_backend(
  type V3Error (line 172) | pub enum V3Error {

FILE: backends/v3/src/main.rs
  type Args (line 9) | struct Args {
  type Commands (line 82) | enum Commands {
  function main (line 87) | async fn main() -> Result<(), RouterError> {
  type RouterError (line 231) | enum RouterError {

FILE: backends/v3/src/queue.rs
  type Entry (line 22) | pub(crate) struct Entry {
  type Queue (line 41) | pub(crate) struct Queue {
    method new (line 47) | pub(crate) fn new(
    method append (line 76) | pub(crate) fn append(&self, entry: Entry) {
    method next_batch (line 86) | pub(crate) async fn next_batch(
  function queue_task (line 119) | async fn queue_task(
  type State (line 166) | struct State {
    method new (line 195) | fn new(
    method append (line 226) | fn append(&mut self, mut entry: Entry) {
    method next_batch (line 237) | async fn next_batch(
  type NextBatch (line 507) | type NextBatch = (IntMap<u64, Entry>, Batch, Span);
  type QueueCommand (line 510) | enum QueueCommand {
  method from (line 523) | fn from(value: ValidParameters) -> Self {
  method from (line 550) | fn from(value: ValidStoppingParameters) -> Self {
  function default_entry (line 566) | fn default_entry() -> (
  function test_append (line 612) | async fn test_append() {
  function test_next_batch_empty (line 628) | async fn test_next_batch_empty() {
  function test_next_batch_min_size (line 636) | async fn test_next_batch_min_size() {
  function test_next_batch_max_size (line 668) | async fn test_next_batch_max_size() {
  function test_next_batch_token_budget (line 688) | async fn test_next_batch_token_budget() {
  function test_queue_append (line 721) | async fn test_queue_append() {
  function test_queue_next_batch_empty (line 728) | async fn test_queue_next_batch_empty() {
  function test_queue_next_batch_min_size (line 736) | async fn test_queue_next_batch_min_size() {
  function test_queue_next_batch_max_size (line 769) | async fn test_queue_next_batch_max_size() {
  function test_queue_next_batch_token_budget (line 785) | async fn test_queue_next_batch_token_budget() {
  function test_queue_next_batch_token_speculate (line 810) | async fn test_queue_next_batch_token_speculate() {
  function test_queue_next_batch_dropped_receiver (line 829) | async fn test_queue_next_batch_dropped_receiver() {

FILE: backends/v3/src/radix.rs
  function hash (line 9) | fn hash(slice: &[u32]) -> u64 {
  type RadixAllocator (line 20) | pub struct RadixAllocator {
    method new (line 39) | pub fn new(block_size: u32, n_blocks: u32, window_size: Option<u32>) -...
    method alloc_or_reclaim (line 52) | fn alloc_or_reclaim(&mut self, n_blocks_needed: usize) -> Option<Vec<u...
  method allocate (line 82) | fn allocate(
  method free (line 157) | fn free(&mut self, blocks: Vec<u32>, allocation_id: u64) {
  type RadixAllocation (line 211) | struct RadixAllocation {
  type TrieError (line 230) | pub enum TrieError {
  type NodeId (line 235) | pub type NodeId = DefaultKey;
  type RadixTrie (line 238) | pub struct RadixTrie {
    method new (line 258) | pub fn new(block_size: usize) -> Self {
    method find (line 280) | pub fn find(&mut self, key: &[u32], blocks: &mut Vec<u32>) -> NodeId {
    method find_ (line 286) | fn find_(&mut self, node_id: NodeId, key: &[u32], blocks: &mut Vec<u32...
    method decref (line 313) | pub fn decref(&mut self, node_id: NodeId) -> Result<(), TrieError> {
    method incref (line 342) | pub fn incref(&mut self, node_id: NodeId) -> Result<(), TrieError> {
    method evict (line 363) | pub fn evict(&mut self, n_blocks: usize) -> Vec<u32> {
    method insert (line 416) | pub fn insert(&mut self, tokens: &[u32], blocks: &[u32]) -> Result<usi...
    method insert_ (line 423) | fn insert_(
    method split_node (line 473) | fn split_node(&mut self, node_id: NodeId, prefix_len: usize) -> NodeId {
    method add_node (line 509) | fn add_node(
    method add_node_to_parent (line 529) | fn add_node_to_parent(&mut self, parent_id: NodeId, hash: u64, child_i...
    method remove_node (line 540) | fn remove_node(&mut self, node_id: NodeId) -> TrieNode {
    method update_access_time (line 558) | fn update_access_time(&mut self, node_id: NodeId) {
    method print_debug (line 575) | pub fn print_debug(&self) {
    method print_debug_ (line 579) | fn print_debug_(&self, node_id: NodeId, indent: usize) {
    method root_id (line 597) | pub(crate) fn root_id(&self) -> DefaultKey {
  type TrieNode (line 604) | struct TrieNode {
    method new (line 614) | fn new(key: Vec<u32>, blocks: Vec<u32>, last_accessed: u64, parent: Op...
  function shared_prefix (line 626) | fn shared_prefix(left: &[u32], right: &[u32], block_size: usize) -> usize {
  function allocator_block_size (line 647) | fn allocator_block_size() {
  function allocator_block_size_non_aligned (line 662) | fn allocator_block_size_non_aligned() {
  function allocator_reuses_prefixes (line 677) | fn allocator_reuses_prefixes() {
  function allocator_collects_older_prefixes_first (line 691) | fn allocator_collects_older_prefixes_first() {
  function allocator_frees_fully_overlapping_prefills (line 711) | fn allocator_frees_fully_overlapping_prefills() {
  function allocator_frees_partially_overlapping_prefills (line 727) | fn allocator_frees_partially_overlapping_prefills() {
  function trie_insertions_have_correct_prefix_len (line 769) | fn trie_insertions_have_correct_prefix_len() {
  function trie_insertions_block_size (line 792) | fn trie_insertions_block_size() {
  function trie_get_returns_correct_blocks (line 816) | fn trie_get_returns_correct_blocks() {
  function trie_evict_removes_correct_blocks (line 850) | fn trie_evict_removes_correct_blocks() {
  function full_match_returns_correct_node (line 888) | fn full_match_returns_correct_node() {
  function partial_match_does_not_recurse (line 899) | fn partial_match_does_not_recurse() {
  type AllocationWithInfo (line 910) | struct AllocationWithInfo {
  function invariants_hold_on_many_operations_remove_all (line 919) | fn invariants_hold_on_many_operations_remove_all() {
  function invariants_hold_on_many_operations_remove_subset (line 924) | fn invariants_hold_on_many_operations_remove_subset() {
  function invariants_hold_on_many_insertions (line 928) | fn invariants_hold_on_many_insertions(remove_all: bool) {
  function check_allocation_invariants (line 1014) | fn check_allocation_invariants(allocations: &[AllocationWithInfo]) {

FILE: benchmark/src/app.rs
  type App (line 15) | pub(crate) struct App {
    method new (line 33) | pub(crate) fn new(
    method handle_key_event (line 69) | pub(crate) fn handle_key_event(&mut self, key_event: KeyEvent) {
    method tick (line 125) | pub(crate) fn tick(&mut self) {
    method render (line 155) | pub fn render(&mut self, f: &mut Frame) {
  type Data (line 367) | pub(crate) struct Data {
    method new (line 379) | fn new(n_run: usize, batch_size: Vec<u32>) -> Self {
    method push_prefill (line 406) | fn push_prefill(&mut self, prefill: Prefill, batch_idx: usize) {
    method push_decode (line 412) | fn push_decode(&mut self, decode: Decode, batch_idx: usize) {
    method end_batch (line 420) | fn end_batch(&mut self, batch_idx: usize) {
  function progress_gauge (line 437) | fn progress_gauge(title: &str, label: String, progress: f64, color: Colo...
  function throughput_paragraph (line 446) | fn throughput_paragraph<'a>(throughput: &[f64], name: &'static str) -> P...
  function latency_paragraph (line 459) | fn latency_paragraph<'a>(latency: &mut [f64], name: &'static str) -> Par...
  function statis_spans (line 485) | fn statis_spans<'a>(data: &[f64], unit: &'static str) -> Vec<Line<'a>> {
  function latency_histogram_data (line 516) | fn latency_histogram_data(latency: &[f64], bins: usize) -> Vec<(String, ...
  function latency_histogram (line 529) | fn latency_histogram<'a>(
  function latency_throughput_chart (line 544) | fn latency_throughput_chart<'a>(
  function color_vec (line 674) | fn color_vec() -> Vec<Color> {

FILE: benchmark/src/event.rs
  type Event (line 8) | pub(crate) enum Event {
  function terminal_event_task (line 17) | pub(crate) async fn terminal_event_task(
  function event_loop (line 33) | async fn event_loop(fps: u32, event_sender: mpsc::Sender<Event>) {

FILE: benchmark/src/generation.rs
  constant LOREM_IPSUM (line 10) | const LOREM_IPSUM: &str = "Lorem ipsum dolor sit amet, consectetur adipi...
  type Prefill (line 13) | pub(crate) struct Prefill {
  type Decode (line 19) | pub(crate) struct Decode {
  type Message (line 26) | pub(crate) enum Message {
  function generation_task (line 36) | pub(crate) async fn generation_task(
  function generate_runs (line 64) | async fn generate_runs(
  function prefill (line 132) | async fn prefill(
  function decode (line 197) | async fn decode(batch: CachedBatch, client: &mut ShardedClient) -> Resul...
  function create_sequence (line 227) | fn create_sequence(sequence_length: u32, tokenizer: Tokenizer) -> String {

FILE: benchmark/src/lib.rs
  function run (line 19) | pub async fn run(

FILE: benchmark/src/main.rs
  type Args (line 16) | struct Args {
  function main (line 108) | fn main() -> Result<(), Box<dyn std::error::Error>> {
  function init_logging (line 211) | fn init_logging() {

FILE: benchmark/src/table.rs
  function parameters_table (line 6) | pub(crate) fn parameters_table(
  function latency_table (line 46) | pub(crate) fn latency_table(data: &Data) -> Table {
  function throughput_table (line 84) | pub(crate) fn throughput_table(data: &Data) -> Table {
  function add_latencies (line 107) | fn add_latencies(
  function add_throuhgputs (line 132) | fn add_throuhgputs(
  function avg_min_max (line 154) | fn avg_min_max(data: &[f64]) -> (f64, f64, f64) {
  function px (line 167) | fn px(data: &[f64], p: u32) -> f64 {
  function format_value (line 172) | fn format_value(value: f64, unit: &'static str) -> String {

FILE: benchmark/src/utils.rs
  function histogram (line 16) | pub(crate) fn histogram(values: &[f64], bins: usize) -> Vec<(f64, usize)> {
  function percentiles (line 35) | pub(crate) fn percentiles(values: &[f64], pecents: &[i32]) -> BTreeMap<S...

FILE: clients/python/tests/conftest.py
  function flan_t5_xxl (line 8) | def flan_t5_xxl():
  function llama_7b (line 13) | def llama_7b():
  function fake_model (line 18) | def fake_model():
  function unsupported_model (line 23) | def unsupported_model():
  function base_url (line 28) | def base_url():
  function bloom_url (line 33) | def bloom_url(base_url, bloom_model):
  function flan_t5_xxl_url (line 38) | def flan_t5_xxl_url(base_url, flan_t5_xxl):
  function llama_7b_url (line 43) | def llama_7b_url(base_url, llama_7b):
  function fake_url (line 48) | def fake_url(base_url, fake_model):
  function unsupported_url (line 53) | def unsupported_url(base_url, unsupported_model):
  function hf_headers (line 58) | def hf_headers():

FILE: clients/python/tests/test_client.py
  function test_generate (line 8) | def test_generate(llama_7b_url, hf_headers):
  function test_generate_best_of (line 24) | def test_generate_best_of(llama_7b_url, hf_headers):
  function test_generate_not_found (line 36) | def test_generate_not_found(fake_url, hf_headers):
  function test_generate_validation_error (line 42) | def test_generate_validation_error(llama_7b_url, hf_headers):
  function test_generate_stream (line 48) | def test_generate_stream(llama_7b_url, hf_headers):
  function test_generate_stream_not_found (line 63) | def test_generate_stream_not_found(fake_url, hf_headers):
  function test_generate_stream_validation_error (line 69) | def test_generate_stream_validation_error(llama_7b_url, hf_headers):
  function test_generate_async (line 76) | async def test_generate_async(llama_7b_url, hf_headers):
  function test_generate_async_best_of (line 98) | async def test_generate_async_best_of(llama_7b_url, hf_headers):
  function test_generate_async_not_found (line 111) | async def test_generate_async_not_found(fake_url, hf_headers):
  function test_generate_async_validation_error (line 118) | async def test_generate_async_validation_error(llama_7b_url, hf_headers):
  function test_generate_stream_async (line 125) | async def test_generate_stream_async(llama_7b_url, hf_headers):
  function test_generate_stream_async_not_found (line 141) | async def test_generate_stream_async_not_found(fake_url, hf_headers):
  function test_generate_stream_async_validation_error (line 149) | async def test_generate_stream_async_validation_error(llama_7b_url, hf_h...

FILE: clients/python/tests/test_errors.py
  function test_generation_error (line 16) | def test_generation_error():
  function test_incomplete_generation_error (line 21) | def test_incomplete_generation_error():
  function test_overloaded_error (line 26) | def test_overloaded_error():
  function test_validation_error (line 31) | def test_validation_error():
  function test_bad_request_error (line 36) | def test_bad_request_error():
  function test_shard_not_ready_error (line 41) | def test_shard_not_ready_error():
  function test_shard_timeout_error (line 47) | def test_shard_timeout_error():
  function test_not_found_error (line 52) | def test_not_found_error():
  function test_rate_limit_exceeded_error (line 57) | def test_rate_limit_exceeded_error():
  function test_unknown_error (line 62) | def test_unknown_error():

FILE: clients/python/tests/test_types.py
  function test_parameters_validation (line 7) | def test_parameters_validation():
  function test_request_validation (line 72) | def test_request_validation():

FILE: clients/python/text_generation/client.py
  class Client (line 31) | class Client:
    method __init__ (line 52) | def __init__(
    method completion (line 76) | def completion(
    method _completion_stream_response (line 142) | def _completion_stream_response(self, request):
    method chat (line 164) | def chat(
    method _chat_stream_response (line 264) | def _chat_stream_response(self, request):
    method generate (line 286) | def generate(
    method generate_stream (line 392) | def generate_stream(
  class AsyncClient (line 513) | class AsyncClient:
    method __init__ (line 535) | def __init__(
    method completion (line 559) | async def completion(
    method _completion_single_response (line 615) | async def _completion_single_response(self, request):
    method _completion_stream_response (line 627) | async def _completion_stream_response(self, request):
    method chat (line 646) | async def chat(
    method _chat_single_response (line 736) | async def _chat_single_response(self, request):
    method _chat_stream_response (line 748) | async def _chat_stream_response(self, request):
    method generate (line 772) | async def generate(
    method generate_stream (line 877) | async def generate_stream(

FILE: clients/python/text_generation/errors.py
  class ValidationError (line 5) | class ValidationError(Exception):
    method __init__ (line 6) | def __init__(self, message: str):
  class GenerationError (line 10) | class GenerationError(Exception):
    method __init__ (line 11) | def __init__(self, message: str):
  class OverloadedError (line 15) | class OverloadedError(Exception):
    method __init__ (line 16) | def __init__(self, message: str):
  class IncompleteGenerationError (line 20) | class IncompleteGenerationError(Exception):
    method __init__ (line 21) | def __init__(self, message: str):
  class BadRequestError (line 26) | class BadRequestError(Exception):
    method __init__ (line 27) | def __init__(self, message: str):
  class ShardNotReadyError (line 31) | class ShardNotReadyError(Exception):
    method __init__ (line 32) | def __init__(self, message: str):
  class ShardTimeoutError (line 36) | class ShardTimeoutError(Exception):
    method __init__ (line 37) | def __init__(self, message: str):
  class NotFoundError (line 41) | class NotFoundError(Exception):
    method __init__ (line 42) | def __init__(self, message: str):
  class RateLimitExceededError (line 46) | class RateLimitExceededError(Exception):
    method __init__ (line 47) | def __init__(self, message: str):
  class NotSupportedError (line 51) | class NotSupportedError(Exception):
    method __init__ (line 52) | def __init__(self, model_id: str):
  class UnknownError (line 61) | class UnknownError(Exception):
    method __init__ (line 62) | def __init__(self, message: str):
  function parse_error (line 66) | def parse_error(status_code: int, payload: Dict[str, str]) -> Exception:

FILE: clients/python/text_generation/inference_api.py
  function deployed_models (line 16) | def deployed_models(headers: Optional[Dict] = None) -> List[DeployedModel]:
  function check_model_support (line 37) | def check_model_support(repo_id: str, headers: Optional[Dict] = None) ->...
  class InferenceAPIClient (line 59) | class InferenceAPIClient(Client):
    method __init__ (line 83) | def __init__(self, repo_id: str, token: Optional[str] = None, timeout:...
  class InferenceAPIAsyncClient (line 115) | class InferenceAPIAsyncClient(AsyncClient):
    method __init__ (line 140) | def __init__(self, repo_id: str, token: Optional[str] = None, timeout:...

FILE: clients/python/text_generation/types.py
  class GrammarType (line 9) | class GrammarType(str, Enum):
  class Grammar (line 15) | class Grammar(BaseModel):
  class ToolCall (line 22) | class ToolCall(BaseModel):
  class Chunk (line 31) | class Chunk(BaseModel):
  class Message (line 37) | class Message(BaseModel):
  class Tool (line 48) | class Tool(BaseModel):
  class Function (line 55) | class Function(BaseModel):
  class ChoiceDeltaToolCall (line 60) | class ChoiceDeltaToolCall(BaseModel):
  class ChoiceDelta (line 67) | class ChoiceDelta(BaseModel):
  class Choice (line 73) | class Choice(BaseModel):
  class CompletionRequest (line 80) | class CompletionRequest(BaseModel):
  class CompletionComplete (line 106) | class CompletionComplete(BaseModel):
  class Completion (line 117) | class Completion(BaseModel):
  class ChatRequest (line 127) | class ChatRequest(BaseModel):
  class ChatCompletionComplete (line 169) | class ChatCompletionComplete(BaseModel):
  class ChatComplete (line 182) | class ChatComplete(BaseModel):
  class ChatCompletionChunk (line 193) | class ChatCompletionChunk(BaseModel):
  class Parameters (line 203) | class Parameters(BaseModel):
    method valid_best_of (line 247) | def valid_best_of(cls, field_value, values):
    method valid_repetition_penalty (line 266) | def valid_repetition_penalty(cls, v):
    method valid_frequency_penalty (line 272) | def valid_frequency_penalty(cls, v):
    method valid_seed (line 278) | def valid_seed(cls, v):
    method valid_temp (line 284) | def valid_temp(cls, v):
    method valid_top_k (line 290) | def valid_top_k(cls, v):
    method valid_top_p (line 296) | def valid_top_p(cls, v):
    method valid_truncate (line 302) | def valid_truncate(cls, v):
    method valid_typical_p (line 308) | def valid_typical_p(cls, v):
    method valid_top_n_tokens (line 314) | def valid_top_n_tokens(cls, v):
    method valid_grammar (line 320) | def valid_grammar(cls, v):
  class Request (line 329) | class Request(BaseModel):
    method valid_input (line 338) | def valid_input(cls, v):
    method valid_best_of_stream (line 344) | def valid_best_of_stream(cls, field_value, values):
  class InputToken (line 359) | class InputToken(BaseModel):
  class Token (line 370) | class Token(BaseModel):
  class FinishReason (line 383) | class FinishReason(str, Enum):
  class BestOfSequence (line 393) | class BestOfSequence(BaseModel):
  class Details (line 411) | class Details(BaseModel):
  class Response (line 429) | class Response(BaseModel):
  class StreamDetails (line 437) | class StreamDetails(BaseModel):
  class StreamResponse (line 447) | class StreamResponse(BaseModel):
  class DeployedModel (line 461) | class DeployedModel(BaseModel):

FILE: integration-tests/conftest.py
  class SessionTimeoutFix (line 19) | class SessionTimeoutFix(requests.Session):
    method request (line 20) | def request(self, *args, **kwargs):
  function pytest_addoption (line 68) | def pytest_addoption(parser):
  function pytest_configure (line 86) | def pytest_configure(config):
  function pytest_collection_modifyitems (line 91) | def pytest_collection_modifyitems(config, items):
  function container_log (line 139) | def container_log(request: SubRequest):
  class ResponseComparator (line 151) | class ResponseComparator(JSONSnapshotExtension):
    method _serialize (line 155) | def _serialize(
    method serialize (line 181) | def serialize(
    method matches (line 201) | def matches(
  class GenerousResponseComparator (line 385) | class GenerousResponseComparator(ResponseComparator):
  class IgnoreLogProbResponseComparator (line 390) | class IgnoreLogProbResponseComparator(ResponseComparator):
  class LauncherHandle (line 394) | class LauncherHandle:
    method __init__ (line 395) | def __init__(self, port: int, error_log):
    method _inner_health (line 400) | def _inner_health(self):
    method health (line 403) | async def health(self, timeout: int = 60):
  class ContainerLauncherHandle (line 421) | class ContainerLauncherHandle(LauncherHandle):
    method __init__ (line 422) | def __init__(self, docker_client, container_name, port: int, error_log):
    method _inner_health (line 427) | def _inner_health(self) -> bool:
  class ProcessLauncherHandle (line 432) | class ProcessLauncherHandle(LauncherHandle):
    method __init__ (line 433) | def __init__(self, process, port: int, error_log):
    method _inner_health (line 437) | def _inner_health(self) -> bool:
  function response_snapshot (line 442) | def response_snapshot(snapshot):
  function generous_response_snapshot (line 447) | def generous_response_snapshot(snapshot):
  function ignore_logprob_response_snapshot (line 452) | def ignore_logprob_response_snapshot(snapshot):
  function error_log (line 457) | def error_log():
  function launcher (line 463) | async def launcher(error_log):
  function generate_load (line 734) | def generate_load():
  function generate_multi (line 762) | def generate_multi():
  function chicken (line 797) | def chicken():
  function cow_beach (line 806) | def cow_beach():

FILE: integration-tests/fixtures/gaudi/service.py
  function stream_container_logs (line 58) | def stream_container_logs(container, test_name):
  class TestClient (line 72) | class TestClient(AsyncInferenceClient):
    method __init__ (line 73) | def __init__(self, service_name: str, base_url: str):
  class LauncherHandle (line 78) | class LauncherHandle:
    method __init__ (line 79) | def __init__(self, service_name: str, port: int):
    method _inner_health (line 82) | def _inner_health(self):
    method health (line 85) | async def health(self, timeout: int = 60):
  class ContainerLauncherHandle (line 118) | class ContainerLauncherHandle(LauncherHandle):
    method __init__ (line 119) | def __init__(self, docker_client, container_name, port: int):
    method _inner_health (line 125) | def _inner_health(self) -> bool:
  class ProcessLauncherHandle (line 140) | class ProcessLauncherHandle(LauncherHandle):
    method __init__ (line 141) | def __init__(self, process, port: int):
    method _inner_health (line 146) | def _inner_health(self) -> bool:
  function data_volume (line 151) | def data_volume():
  function gaudi_launcher (line 162) | def gaudi_launcher():
  function gaudi_generate_load (line 292) | def gaudi_generate_load():

FILE: integration-tests/fixtures/neuron/export_models.py
  function get_neuron_backend_hash (line 79) | def get_neuron_backend_hash():
  function get_neuron_model_name (line 104) | def get_neuron_model_name(config_name: str):
  function get_tgi_docker_image (line 108) | def get_tgi_docker_image():
  function maybe_export_model (line 121) | def maybe_export_model(config_name, model_config):
  function maybe_export_models (line 218) | def maybe_export_models():
  function neuron_model_config (line 224) | def neuron_model_config(request):
  function neuron_model_path (line 269) | def neuron_model_path(neuron_model_config):

FILE: integration-tests/fixtures/neuron/service.py
  function get_tgi_docker_image (line 24) | def get_tgi_docker_image():
  class TestClient (line 45) | class TestClient(AsyncInferenceClient):
    method __init__ (line 46) | def __init__(self, service_name: str, base_url: str):
  class LauncherHandle (line 51) | class LauncherHandle:
    method __init__ (line 52) | def __init__(self, service_name: str, port: int):
    method _inner_health (line 55) | def _inner_health(self):
    method health (line 58) | async def health(self, timeout: int = 60):
  class ContainerLauncherHandle (line 75) | class ContainerLauncherHandle(LauncherHandle):
    method __init__ (line 76) | def __init__(self, service_name, docker_client, container_name, port: ...
    method _inner_health (line 82) | def _inner_health(self) -> bool:
  function event_loop (line 92) | def event_loop():
  function neuron_launcher (line 99) | def neuron_launcher(event_loop):
  function neuron_generate_load (line 239) | def neuron_generate_load():

FILE: integration-tests/gaudi/capture_expected_outputs.py
  function test_config (line 17) | def test_config(request) -> Dict[str, Any]:
  function test_name (line 25) | def test_name(test_config):
  function tgi_service (line 30) | def tgi_service(launcher, test_config, test_name) -> Generator:
  function test_capture_expected_outputs (line 37) | async def test_capture_expected_outputs(tgi_service, test_config, test_n...

FILE: integration-tests/gaudi/test_gaudi_generate.py
  function pytest_configure (line 7) | def pytest_configure(config):
  function pytest_generate_tests (line 193) | def pytest_generate_tests(metafunc):
  function test_config (line 208) | def test_config(request: SubRequest) -> Dict[str, Any]:
  function model_id (line 217) | def model_id(test_config: Dict[str, Any]) -> Generator[str, None, None]:
  function test_name (line 222) | def test_name(test_config: Dict[str, Any]) -> Generator[str, None, None]:
  function expected_outputs (line 227) | def expected_outputs(test_config: Dict[str, Any]) -> Dict[str, str]:
  function input (line 235) | def input(test_config: Dict[str, Any]) -> str:
  function tgi_service (line 240) | def tgi_service(
  function tgi_client (line 253) | async def tgi_client(tgi_service) -> AsyncInferenceClient:
  function test_model_single_request (line 260) | async def test_model_single_request(
  function test_model_multiple_requests (line 276) | async def test_model_multiple_requests(

FILE: integration-tests/models/test_bloom_560m.py
  function bloom_560_handle (line 5) | def bloom_560_handle(launcher):
  function bloom_560 (line 11) | async def bloom_560(bloom_560_handle):
  function test_bloom_560m (line 18) | async def test_bloom_560m(bloom_560, response_snapshot):
  function test_bloom_560m_all_params (line 33) | async def test_bloom_560m_all_params(bloom_560, response_snapshot):
  function test_bloom_560m_load (line 56) | async def test_bloom_560m_load(bloom_560, generate_load, response_snapsh...

FILE: integration-tests/models/test_bloom_560m_sharded.py
  function bloom_560m_sharded_handle (line 5) | def bloom_560m_sharded_handle(launcher):
  function bloom_560m_sharded (line 11) | async def bloom_560m_sharded(bloom_560m_sharded_handle):
  function test_bloom_560m_sharded (line 18) | async def test_bloom_560m_sharded(bloom_560m_sharded, response_snapshot):
  function test_bloom_560m_sharded_load (line 33) | async def test_bloom_560m_sharded_load(

FILE: integration-tests/models/test_chat_llama.py
  function flash_llama_chat_handle (line 5) | def flash_llama_chat_handle(launcher):
  function flash_llama_chat (line 13) | async def flash_llama_chat(flash_llama_chat_handle):
  function test_flash_llama_simple (line 19) | async def test_flash_llama_simple(flash_llama_chat, response_snapshot):

FILE: integration-tests/models/test_chat_stream_options.py
  function chat_handle (line 5) | def chat_handle(launcher):
  function chat_client (line 13) | async def chat_client(chat_handle):

FILE: integration-tests/models/test_completion_prompts.py
  function flash_llama_completion_handle (line 8) | def flash_llama_completion_handle(launcher):
  function flash_llama_completion (line 16) | async def flash_llama_completion(flash_llama_completion_handle):
  function test_flash_llama_completion_single_prompt (line 26) | def test_flash_llama_completion_single_prompt(
  function test_flash_llama_completion_stream_usage (line 50) | async def test_flash_llama_completion_stream_usage(
  function test_flash_llama_completion_many_prompts (line 118) | def test_flash_llama_completion_many_prompts(flash_llama_completion, res...
  function test_flash_llama_completion_many_prompts_stream (line 154) | async def test_flash_llama_completion_many_prompts_stream(
  function test_chat_openai_usage (line 190) | async def test_chat_openai_usage(flash_llama_completion, response_snapsh...
  function test_chat_openai_nousage (line 214) | async def test_chat_openai_nousage(flash_llama_completion, response_snap...
  function test_chat_hfhub_usage (line 235) | async def test_chat_hfhub_usage(flash_llama_completion, response_snapshot):
  function test_chat_hfhub_nousage (line 259) | async def test_chat_hfhub_nousage(flash_llama_completion, response_snaps...

FILE: integration-tests/models/test_compressed_tensors_w8a8_int.py
  function compressed_tensors_w8a8_int_handle (line 5) | def compressed_tensors_w8a8_int_handle(launcher):
  function compressed_tensors_w8a8_int (line 15) | async def compressed_tensors_w8a8_int(compressed_tensors_w8a8_int_handle):
  function test_compressed_tensors_w8a8_int (line 23) | async def test_compressed_tensors_w8a8_int(
  function test_compressed_tensors_w8a8_int_all_params (line 43) | async def test_compressed_tensors_w8a8_int_all_params(
  function test_compressed_tensors_w8a8_int_load (line 73) | async def test_compressed_tensors_w8a8_int_load(

FILE: integration-tests/models/test_compressed_tensors_w8a8_int_dynamic_weight.py
  function compressed_tensors_w8a8_int_dynamic_weight_handle (line 5) | def compressed_tensors_w8a8_int_dynamic_weight_handle(launcher):
  function compressed_tensors_w8a8_int_dynamic_weight (line 15) | async def compressed_tensors_w8a8_int_dynamic_weight(
  function test_compressed_tensors_w8a8_int_dynamic_weight (line 25) | async def test_compressed_tensors_w8a8_int_dynamic_weight(
  function test_compressed_tensors_w8a8_int_dynamic_weight_all_params (line 46) | async def test_compressed_tensors_w8a8_int_dynamic_weight_all_params(
  function test_compressed_tensors_w8a8_int_dynamic_weight_load (line 76) | async def test_compressed_tensors_w8a8_int_dynamic_weight_load(

FILE: integration-tests/models/test_compressed_tensors_w8an_fp.py
  function compressed_tensors_w8an_handle (line 5) | def compressed_tensors_w8an_handle(launcher):
  function compressed_tensors_w8an (line 15) | async def compressed_tensors_w8an(compressed_tensors_w8an_handle):
  function test_compressed_tensors_w8an (line 23) | async def test_compressed_tensors_w8an(compressed_tensors_w8an, response...
  function test_compressed_tensors_w8an_all_params (line 39) | async def test_compressed_tensors_w8an_all_params(
  function test_compressed_tensors_w8an_load (line 69) | async def test_compressed_tensors_w8an_load(

FILE: integration-tests/models/test_compressed_tensors_wna16_int.py
  function compressed_tensors_wna16_handle (line 5) | def compressed_tensors_wna16_handle(launcher):
  function compressed_tensors_wna16 (line 15) | async def compressed_tensors_wna16(compressed_tensors_wna16_handle):
  function test_compressed_tensors_wna16 (line 23) | async def test_compressed_tensors_wna16(compressed_tensors_wna16, respon...
  function test_compressed_tensors_wna16_all_params (line 39) | async def test_compressed_tensors_wna16_all_params(
  function test_compressed_tensors_wna16_load (line 69) | async def test_compressed_tensors_wna16_load(

FILE: integration-tests/models/test_compressed_tensors_wna16_int_24.py
  function compressed_tensors_wna16_int_24_handle (line 5) | def compressed_tensors_wna16_int_24_handle(launcher):
  function compressed_tensors_wna16_int_24 (line 15) | async def compressed_tensors_wna16_int_24(compressed_tensors_wna16_int_2...
  function test_compressed_tensors_wna16_int_24 (line 23) | async def test_compressed_tensors_wna16_int_24(
  function test_compressed_tensors_wna16_int_24_all_params (line 43) | async def test_compressed_tensors_wna16_int_24_all_params(
  function test_compressed_tensors_wna16_int_24_load (line 73) | async def test_compressed_tensors_wna16_int_24_load(

FILE: integration-tests/models/test_continue_final_message.py
  function llama_continue_final_message_handle (line 6) | def llama_continue_final_message_handle(launcher):
  function llama_continue_final_message (line 12) | async def llama_continue_final_message(llama_continue_final_message_hand...
  function test_llama_completion_single_prompt (line 17) | def test_llama_completion_single_prompt(
  function test_llama_completion_single_prompt_continue (line 46) | def test_llama_completion_single_prompt_continue(

FILE: integration-tests/models/test_flash_awq.py
  function flash_llama_awq_handle (line 5) | def flash_llama_awq_handle(launcher):
  function flash_llama_awq (line 15) | async def flash_llama_awq(flash_llama_awq_handle):
  function test_flash_llama_awq (line 22) | async def test_flash_llama_awq(flash_llama_awq, response_snapshot):
  function test_flash_llama_awq_all_params (line 37) | async def test_flash_llama_awq_all_params(flash_llama_awq, response_snap...
  function test_flash_llama_awq_load (line 59) | async def test_flash_llama_awq_load(flash_llama_awq, generate_load, resp...

FILE: integration-tests/models/test_flash_awq_sharded.py
  function flash_llama_awq_handle_sharded (line 5) | def flash_llama_awq_handle_sharded(launcher):
  function flash_llama_awq_sharded (line 15) | async def flash_llama_awq_sharded(flash_llama_awq_handle_sharded):
  function test_flash_llama_awq_sharded (line 22) | async def test_flash_llama_awq_sharded(flash_llama_awq_sharded, response...
  function test_flash_llama_awq_load_sharded (line 37) | async def test_flash_llama_awq_load_sharded(

FILE: integration-tests/models/test_flash_deepseek_v2.py
  function flash_deepseek_v2_handle (line 5) | def flash_deepseek_v2_handle(launcher):
  function flash_deepseek_v2 (line 11) | async def flash_deepseek_v2(flash_deepseek_v2_handle):
  function test_flash_deepseek_v2 (line 19) | async def test_flash_deepseek_v2(flash_deepseek_v2, response_snapshot):
  function test_flash_deepseek_v2_all_params (line 30) | async def test_flash_deepseek_v2_all_params(flash_deepseek_v2, response_...
  function test_flash_deepseek_v2_load (line 53) | async def test_flash_deepseek_v2_load(

FILE: integration-tests/models/test_flash_falcon.py
  function flash_falcon_handle (line 5) | def flash_falcon_handle(launcher):
  function flash_falcon (line 11) | async def flash_falcon(flash_falcon_handle):
  function test_flash_falcon (line 19) | async def test_flash_falcon(flash_falcon, response_snapshot):
  function test_flash_falcon_all_params (line 33) | async def test_flash_falcon_all_params(flash_falcon, response_snapshot):
  function test_flash_falcon_load (line 57) | async def test_flash_falcon_load(flash_falcon, generate_load, response_s...

FILE: integration-tests/models/test_flash_gemma.py
  function flash_gemma_handle (line 5) | def flash_gemma_handle(launcher):
  function flash_gemma (line 11) | async def flash_gemma(flash_gemma_handle):
  function test_flash_gemma_simple (line 19) | async def test_flash_gemma_simple(flash_gemma, response_snapshot):
  function test_flash_gemma_all_params (line 31) | async def test_flash_gemma_all_params(flash_gemma, response_snapshot):
  function test_flash_gemma_load (line 55) | async def test_flash_gemma_load(flash_gemma, generate_load, response_sna...

FILE: integration-tests/models/test_flash_gemma2.py
  function flash_gemma2_handle (line 5) | def flash_gemma2_handle(launcher):
  function flash_gemma2 (line 11) | async def flash_gemma2(flash_gemma2_handle):
  function test_flash_gemma2 (line 19) | async def test_flash_gemma2(flash_gemma2, response_snapshot):
  function test_flash_gemma2_load (line 34) | async def test_flash_gemma2_load(flash_gemma2, generate_load, response_s...

FILE: integration-tests/models/test_flash_gemma3.py
  function flash_gemma3_handle (line 9) | def flash_gemma3_handle(launcher):
  function flash_gemma3 (line 15) | async def flash_gemma3(flash_gemma3_handle):
  function test_flash_gemma3 (line 20) | async def test_flash_gemma3(flash_gemma3, response_snapshot):
  function test_flash_gemma3_image_cow_dog (line 35) | async def test_flash_gemma3_image_cow_dog(flash_gemma3, response_snapshot):
  function test_flash_gemma3_image_cow (line 62) | async def test_flash_gemma3_image_cow(flash_gemma3, response_snapshot):
  function test_exceed_window (line 85) | async def test_exceed_window(flash_gemma3, response_snapshot):
  function image_to_data_url (line 101) | def image_to_data_url(img: Image.Image, fmt: str) -> str:
  function test_flash_gemma3_image_base64_rgba (line 110) | async def test_flash_gemma3_image_base64_rgba(flash_gemma3, response_sna...
  function test_flash_gemma3_image_base64_rgb_png (line 133) | async def test_flash_gemma3_image_base64_rgb_png(flash_gemma3, response_...
  function test_flash_gemma3_image_base64_rgb_jpg (line 153) | async def test_flash_gemma3_image_base64_rgb_jpg(flash_gemma3, response_...

FILE: integration-tests/models/test_flash_gemma_gptq.py
  function flash_gemma_gptq_handle (line 5) | def flash_gemma_gptq_handle(launcher):
  function flash_gemma_gptq (line 11) | async def flash_gemma_gptq(flash_gemma_gptq_handle):
  function test_flash_gemma_gptq (line 19) | async def test_flash_gemma_gptq(flash_gemma_gptq, ignore_logprob_respons...
  function test_flash_gemma_gptq_all_params (line 31) | async def test_flash_gemma_gptq_all_params(
  function test_flash_gemma_gptq_load (line 57) | async def test_flash_gemma_gptq_load(

FILE: integration-tests/models/test_flash_gpt2.py
  function flash_gpt2_handle (line 5) | def flash_gpt2_handle(launcher):
  function flash_gpt2 (line 11) | a
Condensed preview — 864 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (7,009K chars).
[
  {
    "path": ".dockerignore",
    "chars": 106,
    "preview": "aml\ntarget\nserver/transformers\nserver/flash-attention\ncmake-build-debug/\ncmake-build-release/\nDockerfile*\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug-report.yml",
    "chars": 2375,
    "preview": "name: \"\\U0001F41B Bug Report\"\ndescription: Submit a bug report to help us improve text-generation-inference\nbody:\n  - ty"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/config.yml",
    "chars": 40,
    "preview": "blank_issues_enabled: true\nversion: 2.1\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature-request.yml",
    "chars": 1118,
    "preview": "name: \"\\U0001F680 Feature request\"\ndescription: Submit a proposal/request for a new text-generation-inference feature\nla"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/new-model-addition.yml",
    "chars": 1077,
    "preview": "name: \"\\U0001F31F New model addition\"\ndescription: Submit a proposal/request to implement a new model\nlabels: [ \"New mod"
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE.md",
    "chars": 2029,
    "preview": "# What does this PR do?\n\n<!--\nCongratulations! You've made it this far! You're not quite done yet though.\n\nOnce merged, "
  },
  {
    "path": ".github/workflows/autodocs.yaml",
    "chars": 980,
    "preview": "name: Automatic Documentation for Launcher\n\non:\n  pull_request:\n\njobs:\n  update_docs:\n    runs-on: ubuntu-latest\n\n    st"
  },
  {
    "path": ".github/workflows/build.yaml",
    "chars": 14587,
    "preview": "name: Build and push docker image to internal registry\n\non:\n  workflow_call:\n    inputs:\n      hardware:\n        type: s"
  },
  {
    "path": ".github/workflows/build_documentation.yaml",
    "chars": 437,
    "preview": "name: Build documentation\n\non:\n  push:\n    paths:\n      - \"docs/source/**\"\n    branches:\n      - main\n      - doc-builde"
  },
  {
    "path": ".github/workflows/build_pr_documentation.yaml",
    "chars": 497,
    "preview": "name: Build PR Documentation\n\non:\n  pull_request:\n    paths:\n      - \"docs/source/**\"\n\nconcurrency:\n  group: ${{ github."
  },
  {
    "path": ".github/workflows/ci_build.yaml",
    "chars": 1265,
    "preview": "name: CI build\n\non:\n  push:\n    branches:\n      - 'main'\n    tags:\n      - 'v*'\n  pull_request:\n    paths:\n      - \".git"
  },
  {
    "path": ".github/workflows/client-tests.yaml",
    "chars": 585,
    "preview": "name: Python Client Tests\n\non:\n  pull_request:\n    paths:\n      - \".github/workflows/client-tests.yaml\"\n      - \"clients"
  },
  {
    "path": ".github/workflows/codeql.yml",
    "chars": 660,
    "preview": "---\nname: CodeQL Security Analysis For Github Actions\n\non:\n  push:\n    branches: [\"main\"]\n  workflow_dispatch:\n  # pull_"
  },
  {
    "path": ".github/workflows/integration_tests.yaml",
    "chars": 1195,
    "preview": "name: Integration tests\n\non:\n  workflow_call:\n    inputs:\n      docker_image:\n        type: string\n        description: "
  },
  {
    "path": ".github/workflows/load_test.yaml",
    "chars": 1357,
    "preview": "name: Nightly load test\n\non:\n  schedule:\n    - cron: '0 0 * * 1-5'\n  workflow_call:\n  workflow_dispatch:\n\n  pull_request"
  },
  {
    "path": ".github/workflows/nix_build.yaml",
    "chars": 1780,
    "preview": "name: \"Nix Build Docker image\"\non:\n  pull_request:\n  push:\n    branches:\n      - 'main'\n    tags:\n      - 'v*'\nconcurren"
  },
  {
    "path": ".github/workflows/nix_cache.yaml",
    "chars": 1037,
    "preview": "name: \"Cache devshells\"\non:\n  pull_request:\n    paths:\n      - \"flake.nix\"\n      - \"flake.lock\"\n      - \"nix/**\"\nconcurr"
  },
  {
    "path": ".github/workflows/nix_tests.yaml",
    "chars": 1259,
    "preview": "name: \"Nix Tests\"\non:\n  pull_request:\n    paths:\n      - \".github/workflows/nix_tests.yaml\"\n      - \"server/**\"\n      - "
  },
  {
    "path": ".github/workflows/stale.yaml",
    "chars": 406,
    "preview": "name: 'Close stale issues and PRs'\non:\n  schedule:\n    - cron: '30 1 * * *'\n\njobs:\n  stale:\n    runs-on: ubuntu-latest\n "
  },
  {
    "path": ".github/workflows/tests.yaml",
    "chars": 1869,
    "preview": "name: Server Tests\n\non:\n  pull_request:\n    paths:\n      - \".github/workflows/tests.yaml\"\n      - \"server/**\"\n      - \"p"
  },
  {
    "path": ".github/workflows/trufflehog.yaml",
    "chars": 536,
    "preview": "on:\n  push:\n\nname: Secret Leaks\n\npermissions:\n  contents: read\n\njobs:\n  trufflehog:\n    runs-on: ubuntu-latest\n    steps"
  },
  {
    "path": ".github/workflows/upload_pr_documentation.yaml",
    "chars": 399,
    "preview": "name: Upload PR Documentation\n\non:\n  workflow_run:\n    workflows: [\"Build PR Documentation\"]\n    types:\n      - complete"
  },
  {
    "path": ".gitignore",
    "chars": 554,
    "preview": ".idea\ntarget\nrouter/tokenizer.json\n*__pycache__*\n\nbackends/v2/src/client/pb\nbackends/v3/src/client/pb\nbackends/client/sr"
  },
  {
    "path": ".pre-commit-config.yaml",
    "chars": 629,
    "preview": "repos:\n-   repo: https://github.com/pre-commit/pre-commit-hooks\n    rev: v4.5.0\n    hooks:\n    -   id: check-yaml\n    - "
  },
  {
    "path": ".redocly.lint-ignore.yaml",
    "chars": 4210,
    "preview": "# This file instructs Redocly's linter to ignore the rules contained for specific parts of your API.\n# See https://redoc"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 5489,
    "preview": "\n# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make particip"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 5615,
    "preview": "<!---\nCopyright 2024 The HuggingFace Team. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "Cargo.toml",
    "chars": 1212,
    "preview": "[workspace]\nmembers = [\n    \"benchmark\",\n    \"backends/v2\",\n    \"backends/v3\",\n    \"backends/grpc-metadata\",\n    \"backen"
  },
  {
    "path": "Dockerfile",
    "chars": 9117,
    "preview": "# Rust builder\nFROM lukemathwalker/cargo-chef:latest-rust-1.85.1 AS chef\nWORKDIR /usr/src\n\nARG CARGO_REGISTRIES_CRATES_I"
  },
  {
    "path": "Dockerfile.neuron",
    "chars": 5097,
    "preview": "# Fetch and extract the TGI sources\nFROM alpine AS tgi\nRUN mkdir -p /tgi\n\n# Fetch the optimum-neuron sources directly to"
  },
  {
    "path": "Dockerfile.nix",
    "chars": 677,
    "preview": "# Build the image and get out the docker file:\n#\n# docker build -t tgi-nix-builder -f Dockerfile.nix\n# docker run --log-"
  },
  {
    "path": "Dockerfile_amd",
    "chars": 11261,
    "preview": "# Rust builder\nFROM lukemathwalker/cargo-chef:latest-rust-1.85.1 AS chef\nWORKDIR /usr/src\n\nARG CARGO_REGISTRIES_CRATES_I"
  },
  {
    "path": "Dockerfile_gaudi",
    "chars": 3942,
    "preview": "# Those arguments are required to build the image\nARG HABANA_VERSION=1.21.0\nARG PYTORCH_VERSION=2.6.0\n\n# Rust builder\nFR"
  },
  {
    "path": "Dockerfile_intel",
    "chars": 8579,
    "preview": "ARG PLATFORM=xpu\n\nFROM lukemathwalker/cargo-chef:latest-rust-1.85.1 AS chef\nWORKDIR /usr/src\n\nARG CARGO_REGISTRIES_CRATE"
  },
  {
    "path": "Dockerfile_llamacpp",
    "chars": 2568,
    "preview": "FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04 AS deps\n\nARG llamacpp_version=b4827\nARG llamacpp_cuda=OFF\nARG llamacpp_n"
  },
  {
    "path": "Dockerfile_trtllm",
    "chars": 5674,
    "preview": "ARG cuda_arch_list=\"75-real;80-real;86-real;89-real;90-real;100-real;120-real\"\nARG cuda_base=12.8.0\nARG build_type=relea"
  },
  {
    "path": "LICENSE",
    "chars": 11342,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "Makefile",
    "chars": 1401,
    "preview": "install-server:\n\tcd server && make install\n\ninstall-server-cpu:\n\tcd server && make install-server\n\ninstall-router:\n\tcarg"
  },
  {
    "path": "README.md",
    "chars": 14691,
    "preview": "> [!CAUTION]\n> text-generation-inference is now in maintenance mode. Going forward, we will accept pull requests for min"
  },
  {
    "path": "assets/tgi_grafana.json",
    "chars": 100690,
    "preview": "{\n  \"__inputs\": [\n    {\n      \"name\": \"DS_PROMETHEUS_EKS API INFERENCE PROD\",\n      \"label\": \"Prometheus EKS API Inferen"
  },
  {
    "path": "backends/client/Cargo.toml",
    "chars": 475,
    "preview": "[package]\nname = \"text-generation-client\"\nversion.workspace = true\nedition.workspace = true\nauthors.workspace = true\nhom"
  },
  {
    "path": "backends/client/build.rs",
    "chars": 1394,
    "preview": "use std::fs;\n\nfn main() -> Result<(), Box<dyn std::error::Error>> {\n    println!(\"cargo:rerun-if-changed=../../proto/\");"
  },
  {
    "path": "backends/client/src/lib.rs",
    "chars": 3283,
    "preview": "//! Text Generation gRPC client library\n\npub mod v2;\npub mod v3;\n\nuse async_trait::async_trait;\nuse base64::{engine::gen"
  },
  {
    "path": "backends/client/src/v2/client.rs",
    "chars": 8772,
    "preview": "/// Single shard Client\nuse crate::v2::pb;\nuse crate::{ClientError, Result};\n\nuse crate::WARMUP_IMAGE_BASE64;\nuse grpc_m"
  },
  {
    "path": "backends/client/src/v2/mod.rs",
    "chars": 394,
    "preview": "#[allow(clippy::derive_partial_eq_without_eq)]\nmod pb;\n\nmod client;\nmod sharded_client;\n\npub use client::Client;\npub use"
  },
  {
    "path": "backends/client/src/v2/sharded_client.rs",
    "chars": 8549,
    "preview": "/// Multi shard Client\nuse crate::{v2, Health, ShardInfo};\nuse crate::{ClientError, Result};\n\nuse crate::v2::InfoRespons"
  },
  {
    "path": "backends/client/src/v3/client.rs",
    "chars": 10510,
    "preview": "use crate::v3::{pb, Chunk};\nuse crate::{ClientError, Result, WARMUP_IMAGE_BASE64};\n/// Single shard Client\nuse base64::e"
  },
  {
    "path": "backends/client/src/v3/mod.rs",
    "chars": 418,
    "preview": "#[allow(clippy::derive_partial_eq_without_eq)]\nmod pb;\n\nmod client;\nmod sharded_client;\n\npub use client::Client;\npub use"
  },
  {
    "path": "backends/client/src/v3/sharded_client.rs",
    "chars": 9309,
    "preview": "/// Multi shard Client\nuse crate::{v3, Health, ShardInfo};\nuse crate::{ClientError, Result};\n\nuse crate::v3::{Chunk, Inf"
  },
  {
    "path": "backends/gaudi/Makefile",
    "chars": 2590,
    "preview": "mkfile_path := $(abspath $(lastword $(MAKEFILE_LIST)))\nmkfile_dir := $(dir $(mkfile_path))\nroot_dir := ${mkfile_dir}/../"
  },
  {
    "path": "backends/gaudi/README.md",
    "chars": 4168,
    "preview": "# Text-generation-inference - Gaudi backend\n\n## Description\n\nThis is the TGI backend for Intel Gaudi. This backend is co"
  },
  {
    "path": "backends/gaudi/examples/docker_commands/docker_commands.md",
    "chars": 3701,
    "preview": "# Examples of Docker Commands for Gaudi Backend\n\nThis page gives a list of examples of docker run commands for some of t"
  },
  {
    "path": "backends/gaudi/server/.gitignore",
    "chars": 2865,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\ntext_generation_server/__pycache__/\ntext_generation_server/pb/__pyc"
  },
  {
    "path": "backends/gaudi/server/Makefile",
    "chars": 1218,
    "preview": "include Makefile-flash-att\ninclude Makefile-flash-att-v2\ninclude Makefile-vllm\ninclude Makefile-awq\ninclude Makefile-eet"
  },
  {
    "path": "backends/gaudi/server/Makefile-awq",
    "chars": 463,
    "preview": "# Fork that adds only the correct stream to this kernel in order\n# to make cuda graphs work.\nawq_commit := bd1dc2d525434"
  },
  {
    "path": "backends/gaudi/server/Makefile-eetq",
    "chars": 370,
    "preview": "eetq_commit := 1657b1504faa359e2ce0ac02999439d7ac8c74c0\n\neetq:\n    # Clone eetq\n\tpip install packaging\n\tgit clone https:"
  },
  {
    "path": "backends/gaudi/server/Makefile-fbgemm",
    "chars": 749,
    "preview": "fbgemm_commit := v0.8.0\n\nbuild-fbgemm:\n\t@if [ ! -d \"fbgemm\" ]; then \\\n\t\tgit clone https://github.com/pytorch/FBGEMM.git "
  },
  {
    "path": "backends/gaudi/server/Makefile-flash-att",
    "chars": 679,
    "preview": "flash_att_commit := 3a9bfd076f98746c73362328958dbc68d145fbec\n\nbuild-flash-attention:\n\tif [ ! -d 'flash-attention' ]; the"
  },
  {
    "path": "backends/gaudi/server/Makefile-flash-att-v2",
    "chars": 924,
    "preview": "flash_att_v2_commit_cuda := v2.6.1\nflash_att_v2_commit_rocm := 2092111b9f975b3347c652ff7fabd431130256c4\n\nbuild-flash-att"
  },
  {
    "path": "backends/gaudi/server/Makefile-selective-scan",
    "chars": 937,
    "preview": "selective_scan_commit := 2a3704fd47ba817b415627b06fd796b971fdc137\n\ncausal-conv1d:\n\trm -rf causal-conv1d\n\tgit clone https"
  },
  {
    "path": "backends/gaudi/server/Makefile-vllm",
    "chars": 896,
    "preview": "commit_cuda := d243e9dc7e2c9c2e36a4150ec8e64809cb55c01b\ncommit_rocm := 4e0929e6e4fa0a3d09d358715c288020ea9dc247\nbuild-vl"
  },
  {
    "path": "backends/gaudi/server/README.md",
    "chars": 173,
    "preview": "# Text Generation Inference Python gRPC Server\n\nA Python gRPC server for Text Generation Inference\n\n## Install\n\n```shell"
  },
  {
    "path": "backends/gaudi/server/dill-0.3.7-patch.sh",
    "chars": 3719,
    "preview": "#!/bin/bash\ngit clone -b dill-0.3.7 https://github.com/uqfoundation/dill.git\npushd dill\ncat <<EOF > dill-0.3.7.patch\ndif"
  },
  {
    "path": "backends/gaudi/server/dill-0.3.8-patch.sh",
    "chars": 3714,
    "preview": "#!/bin/bash\ngit clone -b 0.3.8 https://github.com/uqfoundation/dill.git\npushd dill\ncat <<EOF > dill-0.3.8.patch\ndiff --g"
  },
  {
    "path": "backends/gaudi/server/pyproject.toml",
    "chars": 1164,
    "preview": "[tool.poetry]\nname = \"text-generation-server\"\nversion = \"2.0.4\"\ndescription = \"Text Generation Inference Python gRPC Ser"
  },
  {
    "path": "backends/gaudi/server/requirements.txt",
    "chars": 6630,
    "preview": "accelerate==1.7.0 ; python_version >= \"3.9\" and python_version < \"3.13\"\nannotated-types==0.7.0 ; python_version >= \"3.9\""
  },
  {
    "path": "backends/gaudi/server/text_generation_server/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "backends/gaudi/server/text_generation_server/adapters/__init__.py",
    "chars": 331,
    "preview": "# Origin:   https://github.com/predibase/lorax\n# Path:     lorax/server/lorax_server/adapters/__init__.py\n# License:  Ap"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/adapters/config.py",
    "chars": 727,
    "preview": "# Origin:   https://github.com/predibase/lorax\n# Path:     lorax/server/lorax_server/adapters/config.py\n# License:  Apac"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/adapters/lora.py",
    "chars": 15841,
    "preview": "# Origin:   https://github.com/predibase/lorax\n# Path:     lorax/server/lorax_server/adapters/lora.py\n# License:  Apache"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/adapters/weights.py",
    "chars": 4244,
    "preview": "# Origin:   https://github.com/predibase/lorax\n# Path:     lorax/server/lorax_server/adapters/weights.py\n# License:  Apa"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/cache.py",
    "chars": 796,
    "preview": "import torch\n\nfrom typing import Dict, Optional, TypeVar\n\nfrom text_generation_server.models.types import Batch\n\nB = Typ"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/cli.py",
    "chars": 12453,
    "preview": "import os\nimport sys\nimport typer\n\nfrom pathlib import Path\nfrom loguru import logger\nfrom typing import Optional\nfrom e"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/interceptor.py",
    "chars": 1417,
    "preview": "# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.\n\nimport torch\nimport grpc\n\nfrom google.rpc import status_pb2, c"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/__init__.py",
    "chars": 992,
    "preview": "from text_generation_server.layers.tensor_parallel import (\n    TensorParallelColumnLinear,\n    TensorParallelRowLinear,"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/attention/__init__.py",
    "chars": 737,
    "preview": "from .common import (\n    Seqlen,\n    HPUPagedAttentionMetadata,\n    trim_attn_metadata,\n    trim_seqlen_metadata,\n    _"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/attention/common.py",
    "chars": 6511,
    "preview": "from dataclasses import dataclass\nimport torch\nfrom typing import Optional, List, Dict\nimport collections\nimport torch.n"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/attention/hpu.py",
    "chars": 7736,
    "preview": "import torch\nfrom text_generation_server.layers.attention import Seqlen, HPUPagedAttentionMetadata\nfrom typing import Op"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/attention/kv_cache.py",
    "chars": 5565,
    "preview": "from typing import Tuple\nfrom dataclasses import dataclass, field\n\nimport torch\n\nfrom text_generation_server.models.glob"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/awq/conversion_utils.py",
    "chars": 3330,
    "preview": "import torch\nfrom typing import List\n\n\nAWQ_PACK_ORDER = [0, 2, 4, 6, 1, 3, 5, 7]\nREVERSE_AWQ_PACK_ORDER = [0, 4, 1, 5, 2"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/awq/quantize/__init__.py",
    "chars": 50,
    "preview": "from .hpu import WQLinear\n\n__all__ = [\"WQLinear\"]\n"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/awq/quantize/hpu.py",
    "chars": 4170,
    "preview": "from typing import Optional\nimport torch\nimport torch.nn as nn\n\ntry:\n    import habana_frameworks.torch.hpu  # noqa: F40"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/bnb.py",
    "chars": 4040,
    "preview": "from dataclasses import dataclass\n\nimport bitsandbytes as bnb\nimport torch\nfrom bitsandbytes.nn import Int8Params, Param"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/compressed_tensors/__init__.py",
    "chars": 83,
    "preview": "from .loader import CompressedTensorsLoader\n\n__all__ = [\"CompressedTensorsLoader\"]\n"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/compressed_tensors/loader.py",
    "chars": 6367,
    "preview": "from typing import Any, Dict, List, Union\n\nfrom compressed_tensors import QuantizationConfig, QuantizationStatus\nfrom co"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/compressed_tensors/w8an_fp.py",
    "chars": 8999,
    "preview": "from typing import List, Optional, Union\n\nimport torch\nfrom compressed_tensors.quantization import QuantizationArgs, Qua"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/conv.py",
    "chars": 1117,
    "preview": "from accelerate import init_empty_weights\nimport torch\n\n\n@classmethod\ndef load_conv2d(cls, prefix, weights, in_channels,"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/exl2.py",
    "chars": 2457,
    "preview": "from dataclasses import dataclass\nfrom typing import List, Union\n\nimport torch\nfrom text_generation_server.utils.weights"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/fp8.py",
    "chars": 22436,
    "preview": "from dataclasses import dataclass\nfrom typing import Optional, Tuple, Type, Union, List\n\nimport torch\n\nfrom text_generat"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/gptq/__init__.py",
    "chars": 15250,
    "preview": "from dataclasses import dataclass\nfrom typing import List, Optional, Union\n\nimport torch\nfrom loguru import logger\nfrom "
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/gptq/hpu.py",
    "chars": 7627,
    "preview": "import math\nimport numpy as np\nimport torch\nimport torch.nn as nn\n\ntry:\n\n    convert_from_uint4 = torch.ops.hpu.convert_"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/gptq/quantize.py",
    "chars": 32293,
    "preview": "import time\nimport torch.nn as nn\nimport math\nimport json\nimport os\nimport torch\nimport transformers\n\nfrom texttable imp"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/gptq/utils.py",
    "chars": 1698,
    "preview": "import torch\n\n\n# copied from https://github.com/openppl-public/ppq/blob/master/ppq/quantization/measure/norm.py\ndef torc"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/layernorm.py",
    "chars": 1836,
    "preview": "import torch\nfrom torch import nn\nfrom accelerate import init_empty_weights\n\n\n# Monkey patching\n@classmethod\ndef load_la"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/linear.py",
    "chars": 1103,
    "preview": "import torch\nfrom torch.nn import functional as F\n\n\nclass FastLinear(torch.nn.Module):\n    def __init__(\n        self,\n "
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/lora.py",
    "chars": 10986,
    "preview": "from typing import TYPE_CHECKING, Optional, List\n\nimport torch\nimport torch.distributed\nfrom torch import nn\nfrom torch."
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/medusa.py",
    "chars": 6380,
    "preview": "import torch\nfrom torch import nn\nfrom typing import Tuple, Optional\nfrom text_generation_server.utils.speculate import "
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/mlp.py",
    "chars": 10125,
    "preview": "import torch\nimport math\nfrom torch import nn\nfrom torch.nn import functional as F\nfrom typing import Optional, Tuple\nfr"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/moe/__init__.py",
    "chars": 8117,
    "preview": "from typing import Optional, Protocol, runtime_checkable\n\nimport torch\nimport torch.nn as nn\nfrom loguru import logger\nf"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/moe/fp8.py",
    "chars": 9272,
    "preview": "from typing import Optional\n\nimport torch\nimport torch.nn as nn\nimport os\n\nfrom text_generation_server.utils.weights imp"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/moe/fused_moe.py",
    "chars": 4654,
    "preview": "# coding=utf-8\n# Copyright 2023, 2024 DeepSeek-AI and The HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/moe/unquantized.py",
    "chars": 5538,
    "preview": "from typing import Optional\n\nimport torch\nimport torch.nn as nn\n\nfrom text_generation_server.utils.weights import Unquan"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/rotary.py",
    "chars": 24014,
    "preview": "import os\nimport math\nimport torch\nfrom torch import nn\nfrom habana_frameworks.torch.hpex.kernels import (\n    RotaryPos"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/speculative.py",
    "chars": 1874,
    "preview": "import torch\nimport json\nfrom typing import Tuple, Optional\nfrom text_generation_server.layers.tensor_parallel import Te"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/layers/tensor_parallel.py",
    "chars": 8876,
    "preview": "import torch\nfrom torch.nn import functional as F\nfrom typing import Iterable, List\nfrom text_generation_server.layers.l"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/__init__.py",
    "chars": 38397,
    "preview": "# ruff: noqa: F821\n# the above line disables the `undefined-name` rule for the model type variables\nimport torch\nimport "
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/bloom_modeling.py",
    "chars": 35411,
    "preview": "# coding=utf-8\n# Copyright 2022 HuggingFace Inc. team and BigScience workshop.\n#\n# Licensed under the Apache License, Ve"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/clip.py",
    "chars": 31123,
    "preview": "from typing import Optional, Tuple\n\nimport torch\nfrom torch import nn\n\nfrom transformers.activations import ACT2FN\nfrom "
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_cohere_modeling.py",
    "chars": 16997,
    "preview": "# coding=utf-8\n# Copyright 2024 Cohere team. All rights reserved.\n#\n# This code is based on EleutherAI's GPT-NeoX librar"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_dbrx_modeling.py",
    "chars": 24210,
    "preview": "# coding=utf-8\n# Copyright 2022 HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under the Apache License, Versi"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py",
    "chars": 24650,
    "preview": "# coding=utf-8\n# Copyright 2023, 2024 DeepSeek-AI and The HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_deepseek_v3_modeling.py",
    "chars": 25017,
    "preview": "# coding=utf-8\n# Copyright 2023, 2024 DeepSeek-AI and The HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py",
    "chars": 19255,
    "preview": "# coding=utf-8\n# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.\n#\n# This code is based on"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py",
    "chars": 25557,
    "preview": "# coding=utf-8\n# Copyright 2024 HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under the Apache License, Versi"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_gemma_modeling.py",
    "chars": 16186,
    "preview": "# coding=utf-8\n# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.\n#\n# This code is based on"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_gpt2_modeling.py",
    "chars": 14560,
    "preview": "# coding=utf-8\n# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.\n#\n# This code is based on"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_gptj_modeling.py",
    "chars": 12718,
    "preview": "# coding=utf-8\n# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.\n#\n# This code is based on"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_llama4_modeling.py",
    "chars": 54743,
    "preview": "# coding=utf-8\n# Copyright 2025 The LLAMA4 and HuggingFace Inc. team. All rights reserved.\n#\n#\n# Licensed under the Apac"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py",
    "chars": 22486,
    "preview": "# coding=utf-8\n# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.\n#\n# This code is based on"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_llava_next.py",
    "chars": 12344,
    "preview": "# coding=utf-8\n# Copyright 2024 the HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under the Apache License, V"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py",
    "chars": 16280,
    "preview": "# coding=utf-8\n# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.\n#\n# This code is based on"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py",
    "chars": 17321,
    "preview": "# coding=utf-8\n# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.\n#\n# This code is based on"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_mllama.py",
    "chars": 34909,
    "preview": "# coding=utf-8\n# Copyright 2024 the HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under the Apache License, V"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_neox_modeling.py",
    "chars": 13889,
    "preview": "# coding=utf-8\n# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.\n#\n# This code is based on"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_pali_gemma_modeling.py",
    "chars": 4723,
    "preview": "# coding=utf-8\n# Copyright 2024 HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under the Apache License, Versi"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_phi_modeling.py",
    "chars": 13887,
    "preview": "import torch\nimport torch.distributed\n\nfrom torch import nn\nfrom transformers.activations import ACT2FN\nfrom transformer"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_phi_moe_modeling.py",
    "chars": 12381,
    "preview": "# coding=utf-8\n# Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under the Apa"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py",
    "chars": 12049,
    "preview": "import torch\nimport torch.distributed\n\nfrom torch import nn\nfrom transformers.activations import ACT2FN\nfrom typing impo"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_qwen3_modeling.py",
    "chars": 11984,
    "preview": "# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with "
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_qwen3_moe_modeling.py",
    "chars": 19514,
    "preview": "# coding=utf-8\n# Copyright 5 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.\n#\n# Licens"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_rw_modeling.py",
    "chars": 21658,
    "preview": "from typing import List, Optional, Tuple\n\nimport torch\nimport torch.distributed\nfrom torch import nn\nfrom transformers.c"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_santacoder_modeling.py",
    "chars": 17118,
    "preview": "import torch\nimport torch.distributed\n\nfrom torch import nn\nfrom transformers.activations import ACT2FN\nfrom typing impo"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/flash_starcoder2_modeling.py",
    "chars": 19604,
    "preview": "# coding=utf-8\n# Copyright 2024 Starcoder2 AI and the HuggingFace Inc. team. All rights reserved.\n#\n# This code is based"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/idefics2.py",
    "chars": 32512,
    "preview": "# coding=utf-8\n# Copyright 2024 the HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under the Apache License, V"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/idefics3.py",
    "chars": 23243,
    "preview": "# coding=utf-8\n# Copyright 2024 the HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under the Apache License, V"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/mamba_modeling.py",
    "chars": 8825,
    "preview": "import torch\nimport torch.distributed\n\nfrom mamba_ssm.ops.triton.selective_state_update import selective_state_update\nfr"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/qwen2_5_vl.py",
    "chars": 38213,
    "preview": "# coding=utf-8\n# Copyright 2025 the HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under the Apache License, V"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/qwen2_vl.py",
    "chars": 20466,
    "preview": "# coding=utf-8\n# Copyright 2024 the HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under the Apache License, V"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/siglip.py",
    "chars": 15179,
    "preview": "from typing import Optional, Tuple\nimport warnings\nimport math\nimport torch\nfrom torch import nn\n\nfrom transformers.acti"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/custom_modeling/vlm.py",
    "chars": 2604,
    "preview": "def load_text_model(prefix, config, weights, name=None):\n    if config.model_type == \"llama\":\n        from text_generati"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/flash_causal_lm.py",
    "chars": 106889,
    "preview": "import math\nimport os\nimport time\nimport torch\nimport torch.distributed\n\nimport numpy as np\n\nfrom loguru import logger\nf"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/flash_vlm_causal_lm.py",
    "chars": 41037,
    "preview": "import torch\nfrom PIL import Image\nfrom io import BytesIO\nfrom dataclasses import dataclass\nfrom opentelemetry import tr"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/globals.py",
    "chars": 1370,
    "preview": "import os\nfrom typing import Dict, Optional\nfrom loguru import logger\nfrom text_generation_server.utils.log import log_m"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/mllama_causal_lm.py",
    "chars": 25379,
    "preview": "import torch\n\nimport numpy as np\n\nfrom typing import Iterable, Optional, Tuple, List, Dict\nfrom text_generation_server.p"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/model.py",
    "chars": 4987,
    "preview": "import inspect\nimport torch\n\nfrom abc import ABC, abstractmethod\nfrom typing import List, Tuple, Optional, TypeVar, Type"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/seq2seq_lm.py",
    "chars": 34256,
    "preview": "import torch\nimport torch.distributed\nimport time\nfrom dataclasses import dataclass\nfrom opentelemetry import trace\nfrom"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/models/types.py",
    "chars": 2676,
    "preview": "import torch\n\nfrom abc import ABC, abstractmethod\nfrom dataclasses import dataclass\nfrom typing import List, Optional\n\nf"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/pb/.gitignore",
    "chars": 18,
    "preview": "*.py\n*.pyi\n*.py-e\n"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/server.py",
    "chars": 11152,
    "preview": "# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.\n\nimport asyncio\nimport os\nimport torch\nimport time\nimport signa"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/tracing.py",
    "chars": 2356,
    "preview": "import grpc\n\nfrom opentelemetry import trace\nfrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanE"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/__init__.py",
    "chars": 1388,
    "preview": "# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.\n\nfrom text_generation_server.utils.convert import convert_file,"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/adapter.py",
    "chars": 10253,
    "preview": "# Origin:   https://github.com/predibase/lorax\n# Path:     lorax/server/lorax_server/utils/adapter.py\n# License:  Apache"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/chunks.py",
    "chars": 819,
    "preview": "from typing import Iterable\n\nfrom loguru import logger\n\nfrom text_generation_server.pb import generate_pb2\n\n\ndef concat_"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/convert.py",
    "chars": 4320,
    "preview": "import datetime\nimport torch\nimport os\n\nfrom loguru import logger\nfrom pathlib import Path\nfrom safetensors.torch import"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/debug.py",
    "chars": 1304,
    "preview": "# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.\n\nimport os\nimport glob\nimport time\n\nimport habana_frameworks.to"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/dist.py",
    "chars": 1849,
    "preview": "import os\nimport torch\nfrom torch.distributed import ProcessGroup\nfrom datetime import timedelta\nfrom loguru import logg"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/hub.py",
    "chars": 7808,
    "preview": "import time\nimport os\n\nfrom datetime import timedelta\nfrom loguru import logger\nfrom pathlib import Path\nfrom typing imp"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/import_utils.py",
    "chars": 328,
    "preview": "import torch\n\n\ndef get_hpu_free_memory(device, memory_fraction):\n    free_hpu_memory, _ = torch.hpu.mem_get_info()\n    r"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/kernels.py",
    "chars": 593,
    "preview": "import importlib\n\nfrom loguru import logger\nfrom hf_kernels import load_kernel as hf_load_kernel\n\nfrom text_generation_s"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/log.py",
    "chars": 281,
    "preview": "from functools import lru_cache\nfrom text_generation_server.utils.dist import RANK\n\n\n@lru_cache(10)\ndef log_once(log, ms"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/logits_process.py",
    "chars": 21947,
    "preview": "import math\nimport torch\nimport habana_frameworks.torch.core as htcore\n\nfrom loguru import logger\nfrom typing import Dic"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/merges/strategies.py",
    "chars": 7351,
    "preview": "import copy\nfrom abc import ABC\nfrom collections import defaultdict\nfrom typing import TYPE_CHECKING, Dict, List, Tuple,"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/merges/utils.py",
    "chars": 3973,
    "preview": "# coding=utf-8\n# From: https://github.com/huggingface/peft/pull/1364\n# Copyright 2024-present the HuggingFace Inc. team."
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/peft.py",
    "chars": 2184,
    "preview": "import os\nfrom typing import Union\nfrom loguru import logger\nimport torch\n\nfrom transformers import AutoTokenizer\nfrom p"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/prefill_chunking.py",
    "chars": 552,
    "preview": "from typing import Optional\n\nSUPPORT_CHUNKING: Optional[bool] = None\nMAX_PREFILL_TOKENS: Optional[int] = None\n\n\ndef set_"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/quantization.py",
    "chars": 5895,
    "preview": "import json\nimport os\nfrom dataclasses import dataclass\nfrom typing import Optional, List\n\nfrom huggingface_hub import h"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/segments.py",
    "chars": 2646,
    "preview": "# Origin:   https://github.com/predibase/lorax\n# Path:     lorax/server/lorax_server/utils/segments.py\n# License:  Apach"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/sgmv.py",
    "chars": 7829,
    "preview": "# Origin:   https://github.com/predibase/lorax\n# Path:     lorax/server/lorax_server/utils/sgmv.py\n# License:  Apache Li"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/speculate.py",
    "chars": 173,
    "preview": "SPECULATE = None\n\n\ndef get_speculate() -> int:\n    global SPECULATE\n    return SPECULATE\n\n\ndef set_speculate(speculate: "
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/tokens.py",
    "chars": 27880,
    "preview": "# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company.\n\nimport re\nfrom typing import List, Optional, Tuple, Set, Union"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/version.py",
    "chars": 901,
    "preview": "from packaging.version import Version\nfrom packaging import version\nimport subprocess\n\n\ndef get_driver_version():\n    \"\""
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/watermark.py",
    "chars": 3536,
    "preview": "# coding=utf-8\n# Copyright 2023 Authors of \"A Watermark for Large Language Models\"\n# available at https://arxiv.org/abs/"
  },
  {
    "path": "backends/gaudi/server/text_generation_server/utils/weights.py",
    "chars": 15886,
    "preview": "import torch\n\nfrom abc import ABC, abstractmethod\nfrom contextlib import contextmanager\nfrom pathlib import Path\nfrom ty"
  },
  {
    "path": "backends/gaudi/tgi-entrypoint.sh",
    "chars": 352,
    "preview": "#!/bin/bash\n\nldconfig 2>/dev/null || echo 'unable to refresh ld cache, not a big deal in most cases'\n\n# Check if --shard"
  },
  {
    "path": "backends/grpc-metadata/Cargo.toml",
    "chars": 173,
    "preview": "[package]\nname = \"grpc-metadata\"\nversion = \"0.1.0\"\nedition = \"2021\"\n\n[dependencies]\nopentelemetry = \"^0.20\"\ntonic = \"^0."
  },
  {
    "path": "backends/grpc-metadata/src/lib.rs",
    "chars": 1394,
    "preview": "//! A crate to extract and inject a OpenTelemetry context from and to a gRPC request.\n//! Inspired by: https://github.co"
  },
  {
    "path": "backends/llamacpp/Cargo.toml",
    "chars": 512,
    "preview": "[package]\nname = \"text-generation-router-llamacpp\"\nversion.workspace = true\nedition.workspace = true\nauthors.workspace ="
  },
  {
    "path": "backends/llamacpp/README.md",
    "chars": 727,
    "preview": "# Llamacpp backend\n\nIf all your dependencies are installed at the system level, running\ncargo build should be sufficient"
  },
  {
    "path": "backends/llamacpp/build.rs",
    "chars": 1643,
    "preview": "use bindgen::callbacks::{ItemInfo, ParseCallbacks};\nuse std::env;\nuse std::path::PathBuf;\n\n#[derive(Debug)]\nstruct Prefi"
  },
  {
    "path": "backends/llamacpp/requirements.txt",
    "chars": 75,
    "preview": "transformers==4.49\nhuggingface-hub==0.28.1\nhf-transfer==0.1.9\ntorch==2.6.0\n"
  },
  {
    "path": "backends/llamacpp/src/backend.rs",
    "chars": 24001,
    "preview": "use crate::llamacpp;\n\nuse async_trait::async_trait;\nuse std::ffi::CString;\nuse std::mem::replace;\nuse std::str::FromStr;"
  },
  {
    "path": "backends/llamacpp/src/llamacpp.rs",
    "chars": 165,
    "preview": "#![allow(non_upper_case_globals)]\n#![allow(non_camel_case_types)]\n#![allow(non_snake_case)]\n#![allow(dead_code)]\ninclude"
  },
  {
    "path": "backends/llamacpp/src/main.rs",
    "chars": 11114,
    "preview": "mod backend;\nmod llamacpp;\nmod quantize;\n\nuse quantize::QuantizeType;\n\nuse backend::{\n    BackendError, LlamacppBackend,"
  },
  {
    "path": "backends/llamacpp/src/quantize.rs",
    "chars": 946,
    "preview": "use crate::llamacpp;\n\nuse std::ffi::CString;\n\n#[repr(u32)]\n#[derive(Debug, Clone, Copy)]\npub enum QuantizeType {\n    Mos"
  },
  {
    "path": "backends/neuron/Cargo.toml",
    "chars": 1046,
    "preview": "[workspace]\nmembers = [\n  \"backends/v2\",\n  \"backends/grpc-metadata\",\n  \"launcher\",\n  \"router\"\n]\ndefault-members = [\n  \"b"
  },
  {
    "path": "backends/neuron/Makefile",
    "chars": 1413,
    "preview": "#  Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n#  Licensed under the Apache License, Version 2.0 (the \"L"
  },
  {
    "path": "backends/neuron/README.md",
    "chars": 774,
    "preview": "# Text-generation-inference - Neuron backend for AWS Trainium and inferentia2\n\n## Description\n\nThis is the TGI backend f"
  },
  {
    "path": "backends/neuron/server/.gitignore",
    "chars": 6,
    "preview": "build\n"
  },
  {
    "path": "backends/neuron/server/Makefile",
    "chars": 2492,
    "preview": "# Initialize base variables\nSHELL := /bin/bash\npkg_name := text_generation_server\nBUILDDIR ?= $(CURDIR)/build\nVERSION ?="
  },
  {
    "path": "backends/neuron/server/build-requirements.txt",
    "chars": 41,
    "preview": "build\ngrpcio-tools==1.53.0\nmypy-protobuf\n"
  },
  {
    "path": "backends/neuron/server/pyproject.toml",
    "chars": 734,
    "preview": "[build-system]\nrequires = [\"setuptools>=78.1\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"text-generatio"
  },
  {
    "path": "backends/neuron/server/text_generation_server/cli.py",
    "chars": 3816,
    "preview": "import sys\nfrom typing import Optional\n\nimport typer\nfrom loguru import logger\n\n\napp = typer.Typer()\n\n\n@app.command()\nde"
  },
  {
    "path": "backends/neuron/server/text_generation_server/generator.py",
    "chars": 29253,
    "preview": "import copy\nimport logging\nimport time\nfrom abc import ABC\nfrom enum import Enum\nfrom typing import List, Optional, Tupl"
  },
  {
    "path": "backends/neuron/server/text_generation_server/interceptor.py",
    "chars": 907,
    "preview": "from typing import Any, Callable\n\nimport grpc\nfrom google.rpc import code_pb2, status_pb2\nfrom grpc_interceptor.server i"
  },
  {
    "path": "backends/neuron/server/text_generation_server/model.py",
    "chars": 4606,
    "preview": "import os\nimport shutil\nimport time\nfrom typing import Optional\n\nfrom huggingface_hub import snapshot_download\nfrom hugg"
  },
  {
    "path": "backends/neuron/server/text_generation_server/server.py",
    "chars": 3136,
    "preview": "import asyncio\nfrom pathlib import Path\nfrom typing import List\n\nfrom grpc import aio\nfrom grpc_reflection.v1alpha impor"
  },
  {
    "path": "backends/neuron/server/text_generation_server/tgi_env.py",
    "chars": 9966,
    "preview": "#!/usr/bin/env python\n\nimport argparse\nimport logging\nimport os\nimport sys\nfrom typing import Any, Dict, List, Optional\n"
  },
  {
    "path": "backends/neuron/tests/conftest.py",
    "chars": 36,
    "preview": "pytest_plugins = [\"fixtures.model\"]\n"
  },
  {
    "path": "backends/neuron/tests/fixtures/model.py",
    "chars": 3750,
    "preview": "import copy\nimport logging\nimport subprocess\nimport sys\nfrom tempfile import TemporaryDirectory\n\nimport os\nimport pytest"
  }
]

// ... and 664 more files (download for full content)

About this extraction

This page contains the full source code of the huggingface/text-generation-inference GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 864 files (6.3 MB), approximately 1.7M tokens, and a symbol index with 5123 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!