Full Code of InternLM/lmdeploy for AI

main 764f35a85b0e cached
1274 files
7.7 MB
2.1M tokens
7894 symbols
1 requests
Download .txt
Showing preview only (8,314K chars total). Download the full file or copy to clipboard to get everything.
Repository: InternLM/lmdeploy
Branch: main
Commit: 764f35a85b0e
Files: 1274
Total size: 7.7 MB

Directory structure:
gitextract_4p86pot8/

├── .clang-format
├── .claude/
│   └── skills/
│       ├── check-env/
│       │   └── SKILL.md
│       ├── code-navigation/
│       │   └── SKILL.md
│       ├── resolve-review/
│       │   └── SKILL.md
│       ├── submit-pr/
│       │   └── SKILL.md
│       └── support-new-model/
│           └── SKILL.md
├── .github/
│   ├── CONTRIBUTING.md
│   ├── ISSUE_TEMPLATE/
│   │   ├── 1-bug-report.yml
│   │   ├── 2-feature-request.yml
│   │   └── 3-documentation.yml
│   ├── pull_request_template.md
│   ├── release.yml
│   ├── scripts/
│   │   ├── action_tools.py
│   │   ├── check_lmdeploy.py
│   │   ├── doc_link_checker.py
│   │   ├── eval_base_config.py
│   │   ├── eval_chat_config.py
│   │   ├── eval_regression_base_models.py
│   │   ├── eval_regression_chat_models.py
│   │   ├── eval_stable_object_config.py
│   │   └── eval_stable_subject_config.py
│   └── workflows/
│       ├── api_eval.yml
│       ├── benchmark.yml
│       ├── cuda12.8_whl_release.yml
│       ├── daily_ete_test.yml
│       ├── daily_ete_test_3090.yml
│       ├── daily_ete_test_5080.yml
│       ├── docker.yml
│       ├── docker_dev.yml
│       ├── evaluate.yml
│       ├── lint.yml
│       ├── linux_x64_gpu.yml
│       ├── mllm_api_eval.yml
│       ├── pr_ete_test.yml
│       ├── pypi.yml
│       ├── stable.yml
│       ├── stale.yml
│       ├── test_docker.yml
│       ├── unit_test.yml
│       └── windows_x64_gpu.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .pylintrc
├── CLAUDE.md
├── CMakeLists.txt
├── LICENSE
├── MANIFEST.in
├── README.md
├── README_ja.md
├── README_zh-CN.md
├── autotest/
│   ├── benchmark/
│   │   ├── test_apiserver_performance.py
│   │   ├── test_longtext_performance.py
│   │   ├── test_mllm_apiserver_performance.py
│   │   ├── test_prefixcache_performance.py
│   │   └── test_throughput_performance.py
│   ├── chat_prompt_case.yml
│   ├── config.yml
│   ├── config_3090.yml
│   ├── config_3090_legacy.yml
│   ├── config_5080.yml
│   ├── config_5080_legacy.yml
│   ├── config_ascend.yml
│   ├── config_h.yml
│   ├── config_h800.yml
│   ├── config_h_legacy.yml
│   ├── config_legacy.yml
│   ├── config_test.yml
│   ├── config_testascend.yml
│   ├── conftest.py
│   ├── evaluate/
│   │   ├── eval_config_chat.py
│   │   ├── test_api_evaluate.py
│   │   └── test_mllm_api_evaluate.py
│   ├── interface/
│   │   ├── pipeline/
│   │   │   ├── test_pipeline_func.py
│   │   │   └── test_pipeline_longtext_func.py
│   │   └── restful/
│   │       ├── test_restful_chat_completions_v1.py
│   │       ├── test_restful_completions_v1.py
│   │       └── test_restful_generate.py
│   ├── prompt_case.yml
│   ├── pytest.ini
│   ├── template.json
│   ├── toolchain/
│   │   └── test_lagent.py
│   ├── tools/
│   │   ├── chat/
│   │   │   ├── test_command_chat_hf_pytorch.py
│   │   │   └── test_command_chat_hf_turbomind.py
│   │   ├── common_case_config.py
│   │   ├── pipeline/
│   │   │   ├── llm_case.py
│   │   │   ├── mllm_case.py
│   │   │   ├── test_pipeline_chat_pytorch_llm.py
│   │   │   ├── test_pipeline_chat_pytorch_mllm.py
│   │   │   ├── test_pipeline_chat_turbomind_llm.py
│   │   │   └── test_pipeline_chat_turbomind_mllm.py
│   │   ├── quantization/
│   │   │   ├── test_quantization_awq.py
│   │   │   └── test_quantization_w8a8.py
│   │   └── restful/
│   │       ├── test_restful_chat_hf_pytorch_llm.py
│   │       ├── test_restful_chat_hf_pytorch_mllm.py
│   │       ├── test_restful_chat_hf_turbomind_llm.py
│   │       └── test_restful_chat_hf_turbomind_mllm.py
│   └── utils/
│       ├── benchmark_utils.py
│       ├── common_utils.py
│       ├── config_utils.py
│       ├── constant.py
│       ├── evaluate_utils.py
│       ├── get_run_config.py
│       ├── mp_log_utils.py
│       ├── pipeline_chat.py
│       ├── proxy_distributed_utils.py
│       ├── quantization_utils.py
│       ├── ray_distributed_utils.py
│       ├── restful_return_check.py
│       ├── rule_condition_assert.py
│       ├── run_client_chat.py
│       ├── run_restful_chat.py
│       └── toolkit.py
├── benchmark/
│   ├── README.md
│   ├── benchmark_decode.py
│   ├── benchmark_pipeline.py
│   ├── benchmark_serving.py
│   ├── benchmark_throughput.py
│   ├── lmdeploy.yml
│   ├── profile_pipeline_api.py
│   ├── profile_restful_api.py
│   └── profile_throughput.py
├── builder/
│   ├── manywheel/
│   │   ├── Dockerfile_2014
│   │   ├── README.md
│   │   ├── build_all_lmdeploy_builders.sh
│   │   ├── build_all_wheel.sh
│   │   ├── build_lmdeploy_builder.sh
│   │   ├── build_wheel.sh
│   │   ├── entrypoint_build.sh
│   │   └── scripts/
│   │       ├── install_conda.sh
│   │       ├── install_cuda.sh
│   │       └── install_openmpi.sh
│   └── windows/
│       ├── README.md
│       ├── generate.ps1
│       └── setup_cuda.ps1
├── cmake/
│   ├── Modules/
│   │   └── FindNCCL.cmake
│   ├── TritonTurboMindBackendConfig.cmake.in
│   ├── TurboMindConfig.cmake.in
│   └── yaml-cpp_cmake_policy.patch
├── debug.sh
├── docker/
│   ├── Dockerfile
│   ├── Dockerfile.jetson
│   ├── Dockerfile_ascend_a2_300i
│   ├── Dockerfile_ascend_a3
│   ├── Dockerfile_dev
│   ├── InternVL_Dockerfile
│   ├── Qwen2VL_Dockerfile
│   ├── build.sh
│   ├── install.sh
│   └── prepare_wheel.sh
├── docs/
│   ├── en/
│   │   ├── .readthedocs.yaml
│   │   ├── Makefile
│   │   ├── _static/
│   │   │   └── css/
│   │   │       └── readthedocs.css
│   │   ├── advance/
│   │   │   ├── chat_template.md
│   │   │   ├── context_parallel.md
│   │   │   ├── debug_turbomind.md
│   │   │   ├── long_context.md
│   │   │   ├── metrics.md
│   │   │   ├── pytorch_multinodes.md
│   │   │   ├── pytorch_multithread.md
│   │   │   ├── pytorch_new_model.md
│   │   │   ├── pytorch_profiling.md
│   │   │   ├── spec_decoding.md
│   │   │   ├── structed_output.md
│   │   │   └── update_weights.md
│   │   ├── api/
│   │   │   ├── cli.rst
│   │   │   ├── openapi.rst
│   │   │   └── pipeline.rst
│   │   ├── benchmark/
│   │   │   ├── a100_fp16.md
│   │   │   ├── benchmark.md
│   │   │   ├── evaluate_with_opencompass.md
│   │   │   └── evaluate_with_vlmevalkit.md
│   │   ├── conf.py
│   │   ├── faq.md
│   │   ├── get_started/
│   │   │   ├── ascend/
│   │   │   │   └── get_started.md
│   │   │   ├── camb/
│   │   │   │   └── get_started.md
│   │   │   ├── get_started.md
│   │   │   ├── index.rst
│   │   │   ├── installation.md
│   │   │   └── maca/
│   │   │       └── get_started.md
│   │   ├── index.rst
│   │   ├── inference/
│   │   │   ├── load_hf.md
│   │   │   ├── pytorch.md
│   │   │   ├── turbomind.md
│   │   │   └── turbomind_config.md
│   │   ├── llm/
│   │   │   ├── api_server.md
│   │   │   ├── api_server_lora.md
│   │   │   ├── api_server_reasoning.md
│   │   │   ├── api_server_tools.md
│   │   │   ├── codellama.md
│   │   │   ├── pipeline.md
│   │   │   └── proxy_server.md
│   │   ├── make.bat
│   │   ├── multi_modal/
│   │   │   ├── api_server_vl.md
│   │   │   ├── cogvlm.md
│   │   │   ├── deepseek_vl2.md
│   │   │   ├── gemma3.md
│   │   │   ├── index.rst
│   │   │   ├── internvl.md
│   │   │   ├── llava.md
│   │   │   ├── minicpmv.md
│   │   │   ├── molmo.md
│   │   │   ├── phi3.md
│   │   │   ├── qwen2_5_vl.md
│   │   │   ├── qwen2_vl.md
│   │   │   ├── vl_pipeline.md
│   │   │   └── xcomposer2d5.md
│   │   ├── quantization/
│   │   │   ├── kv_quant.md
│   │   │   ├── llm_compressor.md
│   │   │   ├── w4a16.md
│   │   │   └── w8a8.md
│   │   └── supported_models/
│   │       ├── reward_models.md
│   │       └── supported_models.md
│   └── zh_cn/
│       ├── .readthedocs.yaml
│       ├── Makefile
│       ├── _static/
│       │   └── css/
│       │       └── readthedocs.css
│       ├── advance/
│       │   ├── chat_template.md
│       │   ├── context_parallel.md
│       │   ├── debug_turbomind.md
│       │   ├── long_context.md
│       │   ├── metrics.md
│       │   ├── pytorch_multinodes.md
│       │   ├── pytorch_multithread.md
│       │   ├── pytorch_new_model.md
│       │   ├── pytorch_profiling.md
│       │   ├── spec_decoding.md
│       │   ├── structed_output.md
│       │   └── update_weights.md
│       ├── api/
│       │   ├── cli.rst
│       │   ├── openapi.rst
│       │   └── pipeline.rst
│       ├── benchmark/
│       │   ├── benchmark.md
│       │   ├── evaluate_with_opencompass.md
│       │   └── evaluate_with_vlmevalkit.md
│       ├── conf.py
│       ├── faq.md
│       ├── get_started/
│       │   ├── ascend/
│       │   │   └── get_started.md
│       │   ├── camb/
│       │   │   └── get_started.md
│       │   ├── get_started.md
│       │   ├── index.rst
│       │   ├── installation.md
│       │   └── maca/
│       │       └── get_started.md
│       ├── index.rst
│       ├── inference/
│       │   ├── load_hf.md
│       │   ├── pytorch.md
│       │   ├── turbomind.md
│       │   └── turbomind_config.md
│       ├── llm/
│       │   ├── api_server.md
│       │   ├── api_server_lora.md
│       │   ├── api_server_reasoning.md
│       │   ├── api_server_tools.md
│       │   ├── codellama.md
│       │   ├── pipeline.md
│       │   └── proxy_server.md
│       ├── make.bat
│       ├── multi_modal/
│       │   ├── api_server_vl.md
│       │   ├── cogvlm.md
│       │   ├── deepseek_vl2.md
│       │   ├── gemma3.md
│       │   ├── index.rst
│       │   ├── internvl.md
│       │   ├── llava.md
│       │   ├── minicpmv.md
│       │   ├── molmo.md
│       │   ├── phi3.md
│       │   ├── qwen2_5_vl.md
│       │   ├── qwen2_vl.md
│       │   ├── vl_pipeline.md
│       │   └── xcomposer2d5.md
│       ├── quantization/
│       │   ├── kv_quant.md
│       │   ├── llm_compressor.md
│       │   ├── w4a16.md
│       │   └── w8a8.md
│       └── supported_models/
│           ├── reward_models.md
│           └── supported_models.md
├── eval/
│   ├── config.py
│   └── eval.py
├── examples/
│   └── lite/
│       ├── qwen3_30b_a3b_awq.py
│       └── qwen3_30b_a3b_gptq.py
├── generate.sh
├── k8s/
│   ├── deployment.yaml
│   └── service.yaml
├── lmdeploy/
│   ├── __init__.py
│   ├── __main__.py
│   ├── api.py
│   ├── archs.py
│   ├── cli/
│   │   ├── __init__.py
│   │   ├── chat.py
│   │   ├── cli.py
│   │   ├── entrypoint.py
│   │   ├── lite.py
│   │   ├── serve.py
│   │   └── utils.py
│   ├── lite/
│   │   ├── __init__.py
│   │   ├── apis/
│   │   │   ├── __init__.py
│   │   │   ├── auto_awq.py
│   │   │   ├── calibrate.py
│   │   │   ├── get_small_sharded_hf.py
│   │   │   ├── gptq.py
│   │   │   └── smooth_quant.py
│   │   ├── defaults.py
│   │   ├── modeling/
│   │   │   ├── __init__.py
│   │   │   ├── internlm2_gptq.py
│   │   │   └── internlm3_gptq.py
│   │   ├── quantization/
│   │   │   ├── __init__.py
│   │   │   ├── activation/
│   │   │   │   ├── __init__.py
│   │   │   │   └── observer.py
│   │   │   ├── awq.py
│   │   │   ├── calibration.py
│   │   │   ├── modules/
│   │   │   │   ├── __init__.py
│   │   │   │   └── linear.py
│   │   │   └── weight/
│   │   │       ├── __init__.py
│   │   │       ├── quant_utils.py
│   │   │       └── quantizer.py
│   │   └── utils/
│   │       ├── __init__.py
│   │       ├── batch_split.py
│   │       ├── cal_qparams.py
│   │       ├── calib_dataloader.py
│   │       ├── collect.py
│   │       ├── global_avail.py
│   │       ├── load.py
│   │       └── memory_efficient.py
│   ├── logger.py
│   ├── messages.py
│   ├── metrics/
│   │   ├── __init__.py
│   │   ├── loggers.py
│   │   ├── metrics_processor.py
│   │   └── stats.py
│   ├── model.py
│   ├── monitoring/
│   │   ├── docker-compose.yaml
│   │   ├── grafana/
│   │   │   ├── dashboards/
│   │   │   │   ├── config/
│   │   │   │   │   └── dashboard.yaml
│   │   │   │   └── json/
│   │   │   │       └── lmdeploy-dashboard.json
│   │   │   └── datasources/
│   │   │       └── datasource.yaml
│   │   └── prometheus.yaml
│   ├── pipeline.py
│   ├── profiler.py
│   ├── pytorch/
│   │   ├── __init__.py
│   │   ├── adapter/
│   │   │   ├── __init__.py
│   │   │   └── adapter.py
│   │   ├── backends/
│   │   │   ├── __init__.py
│   │   │   ├── activation.py
│   │   │   ├── apply_rotary_emb.py
│   │   │   ├── attention.py
│   │   │   ├── awq_modules.py
│   │   │   ├── base.py
│   │   │   ├── blockedf8_modules.py
│   │   │   ├── causal_conv1d.py
│   │   │   ├── cuda/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── activation.py
│   │   │   │   ├── apply_rotary_emb.py
│   │   │   │   ├── attention/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── default.py
│   │   │   │   │   ├── fa3.py
│   │   │   │   │   └── mla.py
│   │   │   │   ├── awq_modules.py
│   │   │   │   ├── blockedf8_modules.py
│   │   │   │   ├── causal_conv1d.py
│   │   │   │   ├── flash_attention.py
│   │   │   │   ├── gated_delta_rule.py
│   │   │   │   ├── graph_runner.py
│   │   │   │   ├── lora.py
│   │   │   │   ├── moe/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── blocked_fp8.py
│   │   │   │   │   ├── default.py
│   │   │   │   │   ├── ep_utils.py
│   │   │   │   │   └── w8a8.py
│   │   │   │   ├── moe_router.py
│   │   │   │   ├── multinomial_sampling.py
│   │   │   │   ├── norm.py
│   │   │   │   ├── nsa.py
│   │   │   │   ├── op_backend.py
│   │   │   │   ├── qmodules.py
│   │   │   │   ├── token_dispatcher.py
│   │   │   │   ├── utils.py
│   │   │   │   └── warmup_manager.py
│   │   │   ├── deepep_moe_checker.py
│   │   │   ├── default/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── activation.py
│   │   │   │   ├── apply_rotary_emb.py
│   │   │   │   ├── awq_modules.py
│   │   │   │   ├── embedding.py
│   │   │   │   ├── linear.py
│   │   │   │   ├── moe.py
│   │   │   │   ├── moe_router.py
│   │   │   │   ├── multinomial_sampling.py
│   │   │   │   ├── norm.py
│   │   │   │   ├── op_backend.py
│   │   │   │   ├── rotary_embedding.py
│   │   │   │   └── token_dispatcher.py
│   │   │   ├── dlinfer/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── activation.py
│   │   │   │   ├── apply_rotary_emb.py
│   │   │   │   ├── ascend/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── op_backend.py
│   │   │   │   │   └── utils.py
│   │   │   │   ├── attention.py
│   │   │   │   ├── awq_modules.py
│   │   │   │   ├── camb/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   └── op_backend.py
│   │   │   │   ├── flash_attention.py
│   │   │   │   ├── linear.py
│   │   │   │   ├── maca/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   └── op_backend.py
│   │   │   │   ├── moe.py
│   │   │   │   ├── norm.py
│   │   │   │   ├── op_backend.py
│   │   │   │   ├── qmodules.py
│   │   │   │   └── rotary_embedding.py
│   │   │   ├── embedding.py
│   │   │   ├── flash_attention.py
│   │   │   ├── gated_delta_rule.py
│   │   │   ├── graph_runner.py
│   │   │   ├── linear.py
│   │   │   ├── lora.py
│   │   │   ├── moe.py
│   │   │   ├── moe_router.py
│   │   │   ├── multinomial_sampling.py
│   │   │   ├── norm.py
│   │   │   ├── nsa.py
│   │   │   ├── qmodules.py
│   │   │   ├── rotary_embedding.py
│   │   │   ├── selector.py
│   │   │   └── token_dispatcher.py
│   │   ├── block.py
│   │   ├── check_env/
│   │   │   ├── __init__.py
│   │   │   ├── adapter.py
│   │   │   ├── base.py
│   │   │   ├── cuda.py
│   │   │   ├── deeplink.py
│   │   │   ├── dist.py
│   │   │   ├── model.py
│   │   │   ├── torch.py
│   │   │   ├── transformers.py
│   │   │   ├── triton.py
│   │   │   └── triton_custom_add.py
│   │   ├── config.py
│   │   ├── configurations/
│   │   │   ├── __init__.py
│   │   │   ├── builder.py
│   │   │   ├── chatglm.py
│   │   │   ├── cogvlm.py
│   │   │   ├── deepseek_v2.py
│   │   │   ├── deepseek_v32.py
│   │   │   ├── deepseek_vl2.py
│   │   │   ├── default.py
│   │   │   ├── gemma.py
│   │   │   ├── glm4.py
│   │   │   ├── gpt_oss.py
│   │   │   ├── interns1_pro.py
│   │   │   ├── internvl.py
│   │   │   ├── internvl3_hf.py
│   │   │   ├── llama.py
│   │   │   ├── llama4.py
│   │   │   ├── llava_hf.py
│   │   │   ├── minicpm3.py
│   │   │   ├── qwen.py
│   │   │   ├── qwen3_5.py
│   │   │   ├── qwen3_next.py
│   │   │   ├── qwen3_vl.py
│   │   │   ├── sdar.py
│   │   │   └── utils.py
│   │   ├── consts.py
│   │   ├── devices/
│   │   │   ├── __init__.py
│   │   │   └── device_manager.py
│   │   ├── disagg/
│   │   │   ├── README.md
│   │   │   ├── __init__.py
│   │   │   ├── backend/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── backend.py
│   │   │   │   ├── base.py
│   │   │   │   ├── dlslime.py
│   │   │   │   └── mooncake.py
│   │   │   ├── config.py
│   │   │   ├── conn/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── engine_conn.py
│   │   │   │   ├── protocol.py
│   │   │   │   └── proxy_conn.py
│   │   │   └── messages.py
│   │   ├── distributed.py
│   │   ├── engine/
│   │   │   ├── __init__.py
│   │   │   ├── base.py
│   │   │   ├── cache_engine.py
│   │   │   ├── config_builder.py
│   │   │   ├── engine.py
│   │   │   ├── engine_checker.py
│   │   │   ├── engine_instance.py
│   │   │   ├── engine_loop.py
│   │   │   ├── executor/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base.py
│   │   │   │   ├── base_worker.py
│   │   │   │   ├── dist_utils.py
│   │   │   │   ├── mp_executor.py
│   │   │   │   ├── ray_executor.py
│   │   │   │   └── uni_executor.py
│   │   │   ├── guided_process.py
│   │   │   ├── input_process.py
│   │   │   ├── inputs_maker.py
│   │   │   ├── logits_process.py
│   │   │   ├── model_agent/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── agent.py
│   │   │   │   ├── inputs_maker.py
│   │   │   │   └── profiler.py
│   │   │   ├── mp_engine/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base.py
│   │   │   │   ├── base_worker.py
│   │   │   │   ├── ray_engine.py
│   │   │   │   ├── zmq_engine.py
│   │   │   │   └── zmq_rpc.py
│   │   │   └── request.py
│   │   ├── envs.py
│   │   ├── kernels/
│   │   │   ├── __init__.py
│   │   │   ├── cuda/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── activation.py
│   │   │   │   ├── apply_rotary_pos_emb.py
│   │   │   │   ├── awq_kernels.py
│   │   │   │   ├── bitonic_topk.py
│   │   │   │   ├── blocked_fp8_fused_moe.py
│   │   │   │   ├── blocked_gemm_fp8.py
│   │   │   │   ├── causal_conv1d.py
│   │   │   │   ├── ds_index.py
│   │   │   │   ├── fill_kv_cache.py
│   │   │   │   ├── flashattention.py
│   │   │   │   ├── flatten_kv_cache.py
│   │   │   │   ├── fused_lora.py
│   │   │   │   ├── fused_moe.py
│   │   │   │   ├── fused_moe_ep.py
│   │   │   │   ├── fused_noaux_tc.py
│   │   │   │   ├── gated_delta_rule.py
│   │   │   │   ├── multinomial_sampling.py
│   │   │   │   ├── pagedattention.py
│   │   │   │   ├── rms_norm.py
│   │   │   │   ├── utils.py
│   │   │   │   ├── w8a8_fused_moe.py
│   │   │   │   └── w8a8_triton_kernels.py
│   │   │   ├── default/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── multinomial_sampling.py
│   │   │   │   └── w8a8_kernels.py
│   │   │   ├── dispatcher.py
│   │   │   ├── dlinfer/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── activation.py
│   │   │   │   ├── apply_rotary_pos_emb.py
│   │   │   │   ├── awq_kernels.py
│   │   │   │   ├── fill_kv_cache.py
│   │   │   │   ├── flash_attention.py
│   │   │   │   ├── fused_moe.py
│   │   │   │   ├── fused_rotary_emb.py
│   │   │   │   ├── linear.py
│   │   │   │   ├── moe_gating_topk_softmax.py
│   │   │   │   ├── pagedattention.py
│   │   │   │   ├── rms_norm.py
│   │   │   │   └── w8a8_kernels.py
│   │   │   └── w8a8_triton_kernels.py
│   │   ├── messages.py
│   │   ├── model_inputs.py
│   │   ├── models/
│   │   │   ├── __init__.py
│   │   │   ├── baichuan.py
│   │   │   ├── chatglm2.py
│   │   │   ├── cogvlm.py
│   │   │   ├── deepseek.py
│   │   │   ├── deepseek_mtp.py
│   │   │   ├── deepseek_v2.py
│   │   │   ├── deepseek_v32.py
│   │   │   ├── deepseek_vl2.py
│   │   │   ├── gemma.py
│   │   │   ├── gemma3_vl.py
│   │   │   ├── glm4.py
│   │   │   ├── glm4_1v.py
│   │   │   ├── glm4_moe.py
│   │   │   ├── glm4moe_mtp.py
│   │   │   ├── gpt_oss.py
│   │   │   ├── internlm.py
│   │   │   ├── internlm2.py
│   │   │   ├── internlm2_reward.py
│   │   │   ├── internlm2_ve.py
│   │   │   ├── internlm3.py
│   │   │   ├── interns1_pro.py
│   │   │   ├── interns1_pro_ts.py
│   │   │   ├── internvl.py
│   │   │   ├── internvl3_hf.py
│   │   │   ├── internvl_patch.py
│   │   │   ├── llama.py
│   │   │   ├── llama4.py
│   │   │   ├── llama_eagle.py
│   │   │   ├── llama_eagle3.py
│   │   │   ├── llava.py
│   │   │   ├── minicpm3.py
│   │   │   ├── minicpmv26.py
│   │   │   ├── mistral.py
│   │   │   ├── mixtral.py
│   │   │   ├── module_map.py
│   │   │   ├── patch.py
│   │   │   ├── phi3.py
│   │   │   ├── phi3_moe.py
│   │   │   ├── phi3_v.py
│   │   │   ├── q_modules.py
│   │   │   ├── qwen.py
│   │   │   ├── qwen2.py
│   │   │   ├── qwen2_5_vl.py
│   │   │   ├── qwen2_moe.py
│   │   │   ├── qwen2_reward.py
│   │   │   ├── qwen2_vl.py
│   │   │   ├── qwen3.py
│   │   │   ├── qwen3_5.py
│   │   │   ├── qwen3_5_moe.py
│   │   │   ├── qwen3_moe.py
│   │   │   ├── qwen3_next.py
│   │   │   ├── qwen3_vl.py
│   │   │   ├── qwen3_vl_moe.py
│   │   │   ├── sdar.py
│   │   │   ├── sdar_moe.py
│   │   │   ├── siglip.py
│   │   │   ├── starcoder2.py
│   │   │   ├── utils/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── cudagraph.py
│   │   │   │   ├── micro_batch.py
│   │   │   │   └── model.py
│   │   │   └── whisper.py
│   │   ├── multimodal/
│   │   │   ├── __init__.py
│   │   │   └── data_type.py
│   │   ├── nn/
│   │   │   ├── __init__.py
│   │   │   ├── activation.py
│   │   │   ├── attention.py
│   │   │   ├── embedding.py
│   │   │   ├── eplb.py
│   │   │   ├── gated_delta.py
│   │   │   ├── linear/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── awq.py
│   │   │   │   ├── base.py
│   │   │   │   ├── blocked_fp8.py
│   │   │   │   ├── default.py
│   │   │   │   ├── lora.py
│   │   │   │   ├── utils.py
│   │   │   │   └── w8a8.py
│   │   │   ├── moe/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base.py
│   │   │   │   ├── blocked_fp8.py
│   │   │   │   ├── default.py
│   │   │   │   ├── route.py
│   │   │   │   └── w8a8.py
│   │   │   ├── multinomial_sampling.py
│   │   │   ├── norm.py
│   │   │   ├── nsa.py
│   │   │   ├── quant_utils.py
│   │   │   ├── rotary_embedding.py
│   │   │   └── utils.py
│   │   ├── paging/
│   │   │   ├── __init__.py
│   │   │   ├── block_manager/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base_block_manager.py
│   │   │   │   ├── default_block_manager.py
│   │   │   │   └── window_block_manager.py
│   │   │   ├── block_trie.py
│   │   │   ├── eviction_helper/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base_eviction_helper.py
│   │   │   │   └── recompute_eviction_helper.py
│   │   │   ├── scheduler.py
│   │   │   ├── seq_states/
│   │   │   │   ├── __init__.py
│   │   │   │   └── states.py
│   │   │   └── state_manager.py
│   │   ├── ray.py
│   │   ├── spec_decode/
│   │   │   ├── __init__.py
│   │   │   ├── base.py
│   │   │   ├── proposers/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base.py
│   │   │   │   ├── deepseek_mtp.py
│   │   │   │   ├── eagle.py
│   │   │   │   └── eagle3.py
│   │   │   ├── reject_sampler.py
│   │   │   └── spec_agent.py
│   │   ├── strategies/
│   │   │   ├── __init__.py
│   │   │   ├── ar/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── cudagraph.py
│   │   │   │   ├── engine.py
│   │   │   │   ├── model_agent.py
│   │   │   │   ├── model_inputs.py
│   │   │   │   ├── sampling.py
│   │   │   │   └── sequence.py
│   │   │   ├── ar_spec/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── cudagraph.py
│   │   │   │   ├── engine.py
│   │   │   │   ├── model_agent.py
│   │   │   │   ├── model_inputs.py
│   │   │   │   ├── sampling.py
│   │   │   │   └── sequence.py
│   │   │   ├── base/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── cudagraph.py
│   │   │   │   ├── engine.py
│   │   │   │   ├── model_agent.py
│   │   │   │   ├── model_inputs.py
│   │   │   │   ├── sampling.py
│   │   │   │   └── sequence.py
│   │   │   └── dllm/
│   │   │       ├── __init__.py
│   │   │       ├── cudagraph.py
│   │   │       ├── engine.py
│   │   │       ├── model_agent.py
│   │   │       ├── model_inputs.py
│   │   │       ├── sampling.py
│   │   │       ├── sequence.py
│   │   │       └── unmasking.py
│   │   ├── third_party/
│   │   │   ├── __init__.py
│   │   │   ├── deep_gemm/
│   │   │   │   └── __init__.py
│   │   │   └── flash_attn_interface.py
│   │   ├── tools/
│   │   │   ├── __init__.py
│   │   │   └── utils.py
│   │   ├── transformers/
│   │   │   ├── __init__.py
│   │   │   └── configuration_deepseek_v32.py
│   │   ├── utils.py
│   │   └── weight_loader/
│   │       ├── __init__.py
│   │       └── model_weight_loader.py
│   ├── serve/
│   │   ├── __init__.py
│   │   ├── core/
│   │   │   ├── __init__.py
│   │   │   ├── async_engine.py
│   │   │   ├── exceptions.py
│   │   │   └── vl_async_engine.py
│   │   ├── managers/
│   │   │   ├── __init__.py
│   │   │   └── session_manager.py
│   │   ├── openai/
│   │   │   ├── __init__.py
│   │   │   ├── api_client.py
│   │   │   ├── api_server.py
│   │   │   ├── harmony_utils.py
│   │   │   ├── launch_server.py
│   │   │   ├── protocol.py
│   │   │   ├── reasoning_parser/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── deepseek_r1_reasoning_parser.py
│   │   │   │   ├── qwen_qwq_reasoning_parser.py
│   │   │   │   └── reasoning_parser.py
│   │   │   ├── serving_chat_completion.py
│   │   │   ├── serving_completion.py
│   │   │   ├── serving_generate.py
│   │   │   └── tool_parser/
│   │   │       ├── __init__.py
│   │   │       ├── internlm2_parser.py
│   │   │       ├── llama3_parser.py
│   │   │       ├── qwen2d5_parser.py
│   │   │       ├── qwen3_parser.py
│   │   │       ├── qwen3coder_parser.py
│   │   │       ├── tool_parser.py
│   │   │       └── utils.py
│   │   ├── processors/
│   │   │   ├── __init__.py
│   │   │   └── multimodal.py
│   │   ├── proxy/
│   │   │   ├── __init__.py
│   │   │   ├── proxy.py
│   │   │   ├── streaming_response.py
│   │   │   └── utils.py
│   │   └── utils/
│   │       ├── __init__.py
│   │       └── server_utils.py
│   ├── tokenizer.py
│   ├── turbomind/
│   │   ├── __init__.py
│   │   ├── deploy/
│   │   │   ├── __init__.py
│   │   │   ├── config.py
│   │   │   ├── converter.py
│   │   │   ├── loader.py
│   │   │   ├── module.py
│   │   │   ├── parameter.py
│   │   │   ├── policy.py
│   │   │   ├── source_model/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── baichuan.py
│   │   │   │   ├── base.py
│   │   │   │   ├── deepseek2.py
│   │   │   │   ├── deepseek_vl.py
│   │   │   │   ├── glm4.py
│   │   │   │   ├── glm4_moe_lite.py
│   │   │   │   ├── gpt_oss.py
│   │   │   │   ├── internlm2.py
│   │   │   │   ├── internvl.py
│   │   │   │   ├── llama.py
│   │   │   │   ├── llava.py
│   │   │   │   ├── minicpmv.py
│   │   │   │   ├── mixtral.py
│   │   │   │   ├── molmo.py
│   │   │   │   ├── qwen.py
│   │   │   │   └── xcomposer2.py
│   │   │   └── target_model/
│   │   │       ├── __init__.py
│   │   │       ├── base.py
│   │   │       └── fp.py
│   │   ├── supported_models.py
│   │   ├── tokenizer_info.py
│   │   └── turbomind.py
│   ├── utils.py
│   ├── version.py
│   └── vl/
│       ├── __init__.py
│       ├── constants.py
│       ├── engine.py
│       ├── media/
│       │   ├── __init__.py
│       │   ├── base.py
│       │   ├── connection.py
│       │   ├── image.py
│       │   ├── time_series.py
│       │   ├── video.py
│       │   └── video_loader.py
│       ├── model/
│       │   ├── __init__.py
│       │   ├── base.py
│       │   ├── builder.py
│       │   ├── cogvlm.py
│       │   ├── deepseek.py
│       │   ├── deepseek_vl2.py
│       │   ├── gemma3_vl.py
│       │   ├── glm4_1v.py
│       │   ├── glm4_v.py
│       │   ├── interns1_pro.py
│       │   ├── internvl.py
│       │   ├── internvl3_hf.py
│       │   ├── internvl_llava.py
│       │   ├── llama4.py
│       │   ├── llava.py
│       │   ├── llava_hf.py
│       │   ├── llava_next.py
│       │   ├── minicpmv.py
│       │   ├── mllama.py
│       │   ├── molmo.py
│       │   ├── phi3_vision.py
│       │   ├── qwen.py
│       │   ├── qwen2.py
│       │   ├── qwen3.py
│       │   ├── qwen3_5.py
│       │   ├── utils.py
│       │   ├── xcomposer2.py
│       │   └── yi.py
│       ├── tools/
│       │   ├── __init__.py
│       │   └── merge_xcomposer2d5_task.py
│       └── utils.py
├── pyproject.toml
├── setup.py
├── src/
│   ├── CMakeLists.txt
│   └── turbomind/
│       ├── CMakeLists.txt
│       ├── comm/
│       │   ├── CMakeLists.txt
│       │   ├── barrier.h
│       │   ├── cuda_ipc/
│       │   │   ├── CMakeLists.txt
│       │   │   ├── allgather.cu
│       │   │   ├── allreduce.cu
│       │   │   ├── bootstrap.h
│       │   │   ├── broadcast.cu
│       │   │   ├── common.h
│       │   │   ├── cuda_ipc_comm.cu
│       │   │   ├── cuda_ipc_comm.h
│       │   │   ├── fused_allreduce.cu
│       │   │   ├── fused_allreduce_ex.cu
│       │   │   ├── group_sum.h
│       │   │   ├── mscclpp.h
│       │   │   ├── multimem.cuh
│       │   │   ├── semaphore.cuh
│       │   │   └── semaphore.h
│       │   ├── device_comm.cc
│       │   ├── device_comm.h
│       │   ├── env.h
│       │   ├── gloo/
│       │   │   ├── CMakeLists.txt
│       │   │   ├── gloo_comm.cc
│       │   │   ├── hybrid_comm.cc
│       │   │   ├── tcp_store.cc
│       │   │   ├── tcp_store.h
│       │   │   └── test_ipc_comm.cc
│       │   ├── host_comm.cc
│       │   ├── host_comm.h
│       │   ├── nccl/
│       │   │   ├── CMakeLists.txt
│       │   │   └── nccl.cu
│       │   ├── test_comm.cu
│       │   ├── test_host_comm.cc
│       │   └── thread_comm.cc
│       ├── core/
│       │   ├── CMakeLists.txt
│       │   ├── allocator.cc
│       │   ├── allocator.h
│       │   ├── buffer.cc
│       │   ├── buffer.h
│       │   ├── check.cc
│       │   ├── check.h
│       │   ├── common.h
│       │   ├── context.cc
│       │   ├── context.h
│       │   ├── copy.cc
│       │   ├── copy.h
│       │   ├── core.h
│       │   ├── cuda_data_type.h
│       │   ├── data_type.h
│       │   ├── interval.h
│       │   ├── layout.cc
│       │   ├── layout.h
│       │   ├── module.cc
│       │   ├── module.h
│       │   ├── ranges.h
│       │   ├── serdes.h
│       │   ├── state.h
│       │   ├── stream.cc
│       │   ├── stream.h
│       │   ├── tensor.cc
│       │   ├── tensor.cu
│       │   ├── tensor.h
│       │   └── test_core.cc
│       ├── engine/
│       │   ├── CMakeLists.txt
│       │   ├── batch.h
│       │   ├── engine.cc
│       │   ├── engine.h
│       │   ├── gateway.cc
│       │   ├── gateway.h
│       │   ├── model_executor.cc
│       │   ├── model_executor.h
│       │   ├── model_request.cc
│       │   ├── model_request.h
│       │   ├── queue.h
│       │   ├── request.cc
│       │   ├── request.h
│       │   ├── request_queue.cc
│       │   ├── request_queue.h
│       │   └── signal_buffer.h
│       ├── generation/
│       │   ├── CMakeLists.txt
│       │   ├── base_param.h
│       │   ├── generation.cc
│       │   ├── generation.h
│       │   ├── guided_decoding.cc
│       │   ├── guided_decoding.h
│       │   ├── logits_processor.cc
│       │   ├── logits_processor.h
│       │   ├── sampling.cc
│       │   ├── sampling.h
│       │   ├── stop_criteria.cc
│       │   ├── stop_criteria.h
│       │   └── utils.h
│       ├── kernels/
│       │   ├── CMakeLists.txt
│       │   ├── activation.cu
│       │   ├── activation.h
│       │   ├── activation_kernels.cu
│       │   ├── activation_kernels.h
│       │   ├── apply_token_bitmask_inplace_cuda.cu
│       │   ├── apply_token_bitmask_inplace_cuda.h
│       │   ├── attention/
│       │   │   ├── CMakeLists.txt
│       │   │   ├── arch.h
│       │   │   ├── attention.cu
│       │   │   ├── attention.h
│       │   │   ├── attention_params.h
│       │   │   ├── attention_template.h
│       │   │   ├── attention_universal.h
│       │   │   ├── block.h
│       │   │   ├── block_iterator.h
│       │   │   ├── cp_utils.cu
│       │   │   ├── cp_utils.h
│       │   │   ├── cta_map.h
│       │   │   ├── decoding.cu
│       │   │   ├── decoding.h
│       │   │   ├── decoding_template.h
│       │   │   ├── desc.h
│       │   │   ├── impl.h
│       │   │   ├── impl_16816.h
│       │   │   ├── impl_1688.h
│       │   │   ├── impl_81616.h
│       │   │   ├── impl_884.h
│       │   │   ├── impl_m16n8.h
│       │   │   ├── impl_simt.h
│       │   │   ├── iterator.h
│       │   │   ├── iterator_sm70.h
│       │   │   ├── iterator_sm80.h
│       │   │   ├── kernel/
│       │   │   │   ├── CMakeLists.txt
│       │   │   │   ├── attention_sm70_128.cu
│       │   │   │   ├── attention_sm70_256.cu
│       │   │   │   ├── attention_sm70_576.cu
│       │   │   │   ├── attention_sm70_64.cu
│       │   │   │   ├── attention_sm75_128.cu
│       │   │   │   ├── attention_sm75_256.cu
│       │   │   │   ├── attention_sm75_576.cu
│       │   │   │   ├── attention_sm75_64.cu
│       │   │   │   ├── attention_sm80_128.cu
│       │   │   │   ├── attention_sm80_192.cu
│       │   │   │   ├── attention_sm80_256.cu
│       │   │   │   ├── attention_sm80_576.cu
│       │   │   │   ├── attention_sm80_64.cu
│       │   │   │   ├── decoding_sm70_128.cu
│       │   │   │   ├── decoding_sm70_256.cu
│       │   │   │   ├── decoding_sm70_576.cu
│       │   │   │   ├── decoding_sm70_64.cu
│       │   │   │   ├── decoding_sm75_128.cu
│       │   │   │   ├── decoding_sm75_256.cu
│       │   │   │   ├── decoding_sm75_576.cu
│       │   │   │   ├── decoding_sm75_64.cu
│       │   │   │   ├── decoding_sm80_128.cu
│       │   │   │   ├── decoding_sm80_192.cu
│       │   │   │   ├── decoding_sm80_256.cu
│       │   │   │   ├── decoding_sm80_576.cu
│       │   │   │   └── decoding_sm80_64.cu
│       │   │   ├── kernel.h
│       │   │   ├── kernel_impl.h
│       │   │   ├── kv_cache_utils_v2.cu
│       │   │   ├── kv_cache_utils_v2.h
│       │   │   ├── linear_iterator.h
│       │   │   ├── mainloop.h
│       │   │   ├── mainloop_sm70.h
│       │   │   ├── mainloop_sm80.h
│       │   │   ├── quantization.h
│       │   │   ├── reduce.cu
│       │   │   ├── reduce.h
│       │   │   ├── reference.cu
│       │   │   ├── reference.h
│       │   │   ├── registrar.h
│       │   │   ├── registry.cu
│       │   │   ├── registry.h
│       │   │   ├── rotary_embedding.h
│       │   │   ├── test_attention.cu
│       │   │   ├── test_quant.cu
│       │   │   ├── test_utils.cu
│       │   │   ├── test_utils.h
│       │   │   ├── utils.cc
│       │   │   └── utils.h
│       │   ├── ban_bad_words.cu
│       │   ├── ban_bad_words.h
│       │   ├── core/
│       │   │   ├── array.h
│       │   │   ├── array_ops.h
│       │   │   ├── common.h
│       │   │   ├── data_type.h
│       │   │   ├── floating_point.h
│       │   │   ├── layout.h
│       │   │   ├── math.h
│       │   │   ├── meta.h
│       │   │   ├── mma.h
│       │   │   ├── pipe_iter.h
│       │   │   ├── smem.h
│       │   │   ├── sub_byte_ptr.h
│       │   │   ├── sync.h
│       │   │   └── thread_map.h
│       │   ├── decoding_kernels.cu
│       │   ├── decoding_kernels.h
│       │   ├── gemm/
│       │   │   ├── CMakeLists.txt
│       │   │   ├── arch/
│       │   │   │   ├── config_simt.h
│       │   │   │   ├── config_sm70_s884.h
│       │   │   │   ├── config_sm75_s16816.h
│       │   │   │   ├── config_sm80_s16816.h
│       │   │   │   ├── mma_simt.h
│       │   │   │   ├── mma_sm70.h
│       │   │   │   ├── mma_sm80.h
│       │   │   │   ├── operand_simt.h
│       │   │   │   ├── operand_sm70_s884.h
│       │   │   │   ├── operand_sm80_s16816.h
│       │   │   │   ├── smem_copy_simt.h
│       │   │   │   ├── smem_copy_sm70.h
│       │   │   │   └── smem_copy_sm80.h
│       │   │   ├── arch.h
│       │   │   ├── cast.cu
│       │   │   ├── cast.h
│       │   │   ├── context.cu
│       │   │   ├── context.h
│       │   │   ├── convert.cuh
│       │   │   ├── convert.h
│       │   │   ├── convert_v3.cu
│       │   │   ├── cp_async.h
│       │   │   ├── cta_map.h
│       │   │   ├── cublas.cu
│       │   │   ├── desc.h
│       │   │   ├── dispatch_cache.cu
│       │   │   ├── dispatch_cache.h
│       │   │   ├── epilogue.h
│       │   │   ├── format.h
│       │   │   ├── gemm.cu
│       │   │   ├── gemm.h
│       │   │   ├── gemm_universal.h
│       │   │   ├── gemm_universal_sm90.h
│       │   │   ├── gemm_universal_sm90_v2.h
│       │   │   ├── gemm_universal_sm90_v3.h
│       │   │   ├── gemm_universal_sm90_v4.h
│       │   │   ├── gemm_universal_sm90_v5.h
│       │   │   ├── gpu_metric.cu
│       │   │   ├── gpu_metric.h
│       │   │   ├── iterator.h
│       │   │   ├── iterator_sm70.h
│       │   │   ├── iterator_sm80.h
│       │   │   ├── iterator_sm90.h
│       │   │   ├── kernel/
│       │   │   │   ├── sm70_884_16.cu
│       │   │   │   ├── sm70_884_4.cu
│       │   │   │   ├── sm70_884_8.cu
│       │   │   │   ├── sm75_16816_16.cu
│       │   │   │   ├── sm75_16816_4.cu
│       │   │   │   ├── sm75_16816_8.cu
│       │   │   │   ├── sm80_16816_16.cu
│       │   │   │   ├── sm80_16816_4.cu
│       │   │   │   ├── sm80_16816_8.cu
│       │   │   │   ├── sm90_16816_16.cu
│       │   │   │   ├── sm90_16816_4.cu
│       │   │   │   ├── sm90_16816_8.cu
│       │   │   │   └── sm90_64n32_8.cu
│       │   │   ├── kernel.cu
│       │   │   ├── kernel.h
│       │   │   ├── kernel_impl.h
│       │   │   ├── kernel_impl_sm90.h
│       │   │   ├── mainloop_sm70.h
│       │   │   ├── mainloop_sm80_v2.h
│       │   │   ├── matrix_ptr.h
│       │   │   ├── moe_utils_v2.cu
│       │   │   ├── moe_utils_v2.h
│       │   │   ├── operand.h
│       │   │   ├── predicate.h
│       │   │   ├── registry.cu
│       │   │   ├── registry.h
│       │   │   ├── scaled_gmma_fp8_sm90.h
│       │   │   ├── scheduler.cuh
│       │   │   ├── scheduler_sm70.cuh
│       │   │   ├── simt.h
│       │   │   ├── sm90_utils.h
│       │   │   ├── smem_copy.h
│       │   │   ├── test/
│       │   │   │   ├── gemm_bench.cu
│       │   │   │   ├── models.h
│       │   │   │   ├── quantization.cu
│       │   │   │   ├── quantization.h
│       │   │   │   ├── quantization_impl.h
│       │   │   │   ├── reference.cu
│       │   │   │   ├── reference.h
│       │   │   │   ├── test_gemm_v2.cc
│       │   │   │   ├── test_moe_utils.cu
│       │   │   │   ├── test_utils.cu
│       │   │   │   ├── test_utils.h
│       │   │   │   └── testbed_v3.h
│       │   │   ├── thread_group_map.h
│       │   │   ├── thread_map.h
│       │   │   ├── tiled_mma.h
│       │   │   ├── tma.cu
│       │   │   ├── tma.h
│       │   │   ├── transform.h
│       │   │   ├── tuner/
│       │   │   │   ├── cache_utils.cu
│       │   │   │   ├── cache_utils.h
│       │   │   │   ├── measurer.cu
│       │   │   │   ├── measurer.h
│       │   │   │   ├── params.cc
│       │   │   │   ├── params.h
│       │   │   │   ├── sampler.cu
│       │   │   │   ├── sampler.h
│       │   │   │   ├── stats.h
│       │   │   │   ├── stopping_criterion.cc
│       │   │   │   └── stopping_criterion.h
│       │   │   ├── types.h
│       │   │   ├── unpack.cu
│       │   │   └── utils.h
│       │   ├── gpt_kernels.cu
│       │   ├── gpt_kernels.h
│       │   ├── logprob_kernels.cu
│       │   ├── logprob_kernels.h
│       │   ├── norm/
│       │   │   ├── CMakeLists.txt
│       │   │   ├── rms_norm.cu
│       │   │   └── rms_norm.h
│       │   ├── penalty_types.h
│       │   ├── quantization.cu
│       │   ├── quantization.cuh
│       │   ├── quantization.h
│       │   ├── reduce_kernel_utils.cuh
│       │   ├── sampling_kernels.cu
│       │   ├── sampling_kernels.h
│       │   ├── sampling_penalty_kernels.cu
│       │   ├── sampling_penalty_kernels.h
│       │   ├── sampling_topk_kernels.cu
│       │   ├── sampling_topk_kernels.h
│       │   ├── sampling_topp_kernels.cu
│       │   ├── sampling_topp_kernels.h
│       │   ├── stop_criteria_kernels.cu
│       │   ├── stop_criteria_kernels.h
│       │   ├── test_quantization.cc
│       │   ├── unfused_attention_kernels.cu
│       │   └── unfused_attention_kernels.h
│       ├── macro.h
│       ├── models/
│       │   ├── CMakeLists.txt
│       │   ├── input_processor.cc
│       │   ├── input_processor.h
│       │   ├── language_model.cc
│       │   ├── language_model.h
│       │   ├── llama/
│       │   │   ├── Barrier.h
│       │   │   ├── BlockManager.cc
│       │   │   ├── BlockManager.h
│       │   │   ├── BlockTrie.cc
│       │   │   ├── BlockTrie.h
│       │   │   ├── CMakeLists.txt
│       │   │   ├── GatedDeltaNetLayer.cc
│       │   │   ├── GatedDeltaNetLayer.h
│       │   │   ├── GatedDeltaNetWeight.cc
│       │   │   ├── GatedDeltaNetWeight.h
│       │   │   ├── LlamaDecoderLayerWeight.cc
│       │   │   ├── LlamaDecoderLayerWeight.h
│       │   │   ├── LlamaDenseWeight.cc
│       │   │   ├── LlamaDenseWeight.h
│       │   │   ├── LlamaFfnLayer.cc
│       │   │   ├── LlamaFfnLayer.h
│       │   │   ├── LlamaLinear.cu
│       │   │   ├── LlamaLinear.h
│       │   │   ├── LlamaWeight.cc
│       │   │   ├── LlamaWeight.h
│       │   │   ├── SequenceManager.cc
│       │   │   ├── SequenceManager.h
│       │   │   ├── bench_conv1d_silu.cc
│       │   │   ├── bench_gated_delta_net.cc
│       │   │   ├── context.h
│       │   │   ├── gated_delta_net_kernels.cu
│       │   │   ├── gated_delta_net_kernels.h
│       │   │   ├── llama_kernels.cu
│       │   │   ├── llama_kernels.h
│       │   │   ├── llama_params.h
│       │   │   ├── llama_rope.h
│       │   │   ├── llama_utils.cu
│       │   │   ├── llama_utils.h
│       │   │   ├── mla_utils.cu
│       │   │   ├── mla_utils.h
│       │   │   ├── moe_ffn_layer.cc
│       │   │   ├── moe_ffn_layer.h
│       │   │   ├── test_cache_manager.cc
│       │   │   ├── unified_attention_layer.cc
│       │   │   ├── unified_attention_layer.h
│       │   │   ├── unified_decoder.cc
│       │   │   └── unified_decoder.h
│       │   ├── output_processor.cc
│       │   └── output_processor.h
│       ├── python/
│       │   ├── CMakeLists.txt
│       │   ├── bind.cpp
│       │   ├── dlpack.h
│       │   └── xgrammar_bind.cpp
│       ├── turbomind.cc
│       ├── turbomind.h
│       └── utils/
│           ├── CMakeLists.txt
│           ├── anomaly_handler.cu
│           ├── anomaly_handler.h
│           ├── constant.h
│           ├── cuda_bf16_fallbacks.cuh
│           ├── cuda_bf16_wrapper.h
│           ├── cuda_type_utils.cuh
│           ├── cuda_utils.cc
│           ├── cuda_utils.h
│           ├── debug_utils.h
│           ├── dispatch.h
│           ├── logger.cc
│           ├── logger.h
│           ├── memory_utils.cu
│           ├── memory_utils.h
│           ├── metrics.h
│           ├── monotonic.h
│           ├── nvtx_utils.cc
│           ├── nvtx_utils.h
│           ├── parser.cc
│           ├── parser.h
│           ├── string_utils.h
│           └── test_utils.h
└── tests/
    ├── csrc/
    │   ├── CMakeLists.txt
    │   └── unittests/
    │       ├── CMakeLists.txt
    │       ├── gtest_utils.h
    │       ├── test_logprob_kernels.cu
    │       ├── test_penalty_kernels.cu
    │       ├── test_sampling_kernels.cu
    │       ├── test_sampling_layer.cu
    │       └── unittest_utils.h
    ├── pytorch/
    │   ├── config/
    │   │   └── test_hf_overrides.py
    │   ├── engine/
    │   │   ├── test_logits_process.py
    │   │   ├── test_request.py
    │   │   └── test_zmq_rpc.py
    │   ├── kernel/
    │   │   ├── test_activation.py
    │   │   ├── test_apply_rotary.py
    │   │   ├── test_bitonic_topk.py
    │   │   ├── test_causal_conv1d.py
    │   │   ├── test_ds_index.py
    │   │   ├── test_fill_kv_cache.py
    │   │   ├── test_flash_attention.py
    │   │   ├── test_flatten_kv_cache.py
    │   │   ├── test_fuse_moe_blocked_fp8.py
    │   │   ├── test_fused_lora.py
    │   │   ├── test_fused_moe.py
    │   │   ├── test_gated_delta_rule.py
    │   │   ├── test_gemm_fp8.py
    │   │   ├── test_moe_route.py
    │   │   ├── test_multinomial_sampling.py
    │   │   ├── test_paged_attention.py
    │   │   └── test_rms_norm.py
    │   ├── nn/
    │   │   └── test_embedding.py
    │   └── paging/
    │       ├── test_block_manager.py
    │       ├── test_block_trie.py
    │       └── test_scheduler.py
    └── test_lmdeploy/
        ├── test_auto_backend.py
        ├── test_content_merge.py
        ├── test_grammar.py
        ├── test_harmony_gpt_oss_parser.py
        ├── test_lite/
        │   └── test_quantization/
        │       └── test_utils/
        │           └── test_cal_qparams.py
        ├── test_messages.py
        ├── test_model.py
        ├── test_pipeline.py
        ├── test_qwen3_parser.py
        ├── test_qwen3coder_parser.py
        ├── test_tokenizer.py
        ├── test_turbomind/
        │   └── test_converter.py
        ├── test_utils.py
        └── test_vl/
            ├── test_hf_chat_template.py
            ├── test_nonhf_chat_template.py
            ├── test_qwen3vl_processor.py
            └── test_vl_encode.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .clang-format
================================================
Language: Cpp
AccessModifierOffset: -4
AlignAfterOpenBracket: Align
AllowShortEnumsOnASingleLine: false
AlignConsecutiveAssignments: true
AlignConsecutiveDeclarations: true
AlignEscapedNewlines: Right
AlignOperands: true
AlignTrailingComments: true
AllowAllParametersOfDeclarationOnNextLine: true
AllowAllArgumentsOnNextLine: true
AllowShortBlocksOnASingleLine: Empty
AllowShortCaseLabelsOnASingleLine: false
AllowShortFunctionsOnASingleLine: Empty
AllowShortIfStatementsOnASingleLine: Never
AllowShortLoopsOnASingleLine: false
AlwaysBreakAfterReturnType: None
AlwaysBreakBeforeMultilineStrings: false
AlwaysBreakTemplateDeclarations: true
BinPackArguments: false
BinPackParameters: false
BreakBeforeBinaryOperators: NonAssignment
BreakBeforeBraces: Stroustrup
BreakBeforeTernaryOperators: false
BreakConstructorInitializers: AfterColon
BreakInheritanceList: AfterColon
BreakStringLiterals: false
ColumnLimit: 120
CompactNamespaces: false
ConstructorInitializerAllOnOneLineOrOnePerLine: true
ConstructorInitializerIndentWidth: 4
ContinuationIndentWidth: 4
Cpp11BracedListStyle: true
DerivePointerAlignment: false
FixNamespaceComments: true
IndentCaseLabels: true
IndentPPDirectives: None
IndentWidth: 4
IndentWrappedFunctionNames: false
KeepEmptyLinesAtTheStartOfBlocks: true
MaxEmptyLinesToKeep: 1
NamespaceIndentation: None
PointerAlignment: Left
ReflowComments: true
SortIncludes: true
SortUsingDeclarations: false
SpaceAfterCStyleCast: false
SpaceAfterTemplateKeyword: false
SpaceBeforeAssignmentOperators: true
SpaceBeforeCtorInitializerColon: false
SpaceBeforeInheritanceColon: false
SpaceBeforeParens: ControlStatements
SpaceInEmptyParentheses: false
SpacesBeforeTrailingComments: 2
SpacesInAngles: false
SpacesInCStyleCastParentheses: false
SpacesInContainerLiterals: false
SpacesInParentheses: false
SpacesInSquareBrackets: false
Standard: c++17
TabWidth: 4
UseTab: Never


================================================
FILE: .claude/skills/check-env/SKILL.md
================================================
---
name: check-env
description: Check if the LMDeploy dev environment is properly set up.
---

# Check LMDeploy Dev Environment

## 1. Find and activate the conda env

```bash
conda env list                        # starred = currently active
conda activate <env-name>             # pick the right env for this project
```

## 2. Verify editable install

```bash
python -c "import lmdeploy; print(lmdeploy.__file__)"
# Must point into the repo dir, e.g. /path/to/lmdeploy_vl/lmdeploy/__init__.py
```

If it doesn't:

```bash
pip install -e .                      # run from repo root
```

## 3. Confirm python and CUDA

```bash
which python                          # must show conda env path, not /usr/bin/python
python -c "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.device_count())"
```

## Troubleshooting

| Problem              | Fix                                             |
| -------------------- | ----------------------------------------------- |
| `conda: not found`   | `source ~/miniconda3/etc/profile.d/conda.sh`    |
| Wrong Python         | `conda deactivate && conda activate <env-name>` |
| `lmdeploy` not found | `pip install -e .` from repo root               |


================================================
FILE: .claude/skills/code-navigation/SKILL.md
================================================
---
name: code-navigation
description: LMDeploy codebase directory map for fast orientation.
---

# LMDeploy Project Structure

```text
lmdeploy/
├── cli/                        # Command line interface implementations
├── lib/                        # Shared libraries/binary assets
├── lite/                       # Quantization Toolkit
│   ├── apis/                   # Calibration, AWQ, and SmoothQuant entry points
│   ├── modeling/               # GPTQ/quantized model specific logic
│   ├── quantization/           # Scaling calculation (activations/weights)
│   └── utils/                  # Quantization helper functions (cal_qparams.py)
├── metrics/                    # Statistics and performance monitoring
├── monitoring/                 # Monitoring configs (Docker/Grafana)
├── pytorch/                    # PyTorch inference backend
│   ├── adapter/                # LoRA and adapter logic
│   ├── backends/               # Kernel/Operator Dispatchers (FP8, AWQ, CUDA)
│   ├── check_env/              # Environment/GPU capability sanity checks
│   ├── configurations/         # Per-model engine configurations (Llama, etc.)
│   ├── devices/                # Device management (CUDA)
│   ├── disagg/                 # Disaggregated prefill/decode logic
│   ├── engine/                 # Main Scheduler and Execution Loop
│   ├── kernels/                # Triton/CUDA Kernels (w8a8_triton_kernels.py)
│   ├── models/                 # Model Patches: Replacing HF layers with kernels
│   ├── multimodal/             # Multi-modal input types for Pytorch engine
│   ├── nn/                     # Reusable PyTorch modules
│   ├── paging/                 # PagedAttention: KV cache block management
│   ├── spec_decode/            # Speculative decoding logic
│   ├── strategies/             # Execution and dispatch strategies
│   ├── third_party/            # External dependencies/repos
│   ├── tools/                  # Internal engine debugging tools
│   ├── transformers/           # HF Transformers integration depth
│   └── weight_loader/          # Sharded/quantized weight loading engine
├── serve/                      # Serving: OpenAI-compatible API and gRPC
├── turbomind/                  # C++ TurboMind inference backend
├── vl/                         # Vision-Language (VL) Support and Image Processing
│   ├── media/                  # Image/Video/... loaders and base classes
│   └── model/                  # VL Archs (InternVL, Qwen-VL, LLaVA, etc.) and preprocess
├── api.py                      # High-level entry for model interaction
├── archs.py                    # Registry: Maps architectures to runtime patches
├── messages.py                 # Core Types: GenerationConfig, EngineConfig
├── model.py                    # Chat Templates: CRITICAL for conversation logic
├── pipeline.py                 # Main Orchestrator: Engine + Tokenizer
└── tokenizer.py                # Wrapper for HF/SentencePiece tokenizers
```


================================================
FILE: .claude/skills/resolve-review/SKILL.md
================================================
---
name: resolve-review
description: Fetch and resolve PR review comments, then push fixes.
---

# Resolve PR Review Comments

## 1. Fetch comments

```bash
gh api repos/InternLM/lmdeploy/pulls/<PR>/comments \
  | python3 -c "
import json, sys
for c in json.load(sys.stdin):
    print(f'[{c[\"path\"]}:{c.get(\"line\",\"?\")}]')
    print(c['body'])
    print()
"
```

## 2. Fix each issue

Read the flagged file, understand the comment, edit the file.

## 3. Lint

```bash
pre-commit run --all-files
```

## 4. Stage & commit

```bash
git add <fixed files>
git commit -m "fix: address PR review comments"
```

## 5. Push

```bash
git push
```


================================================
FILE: .claude/skills/submit-pr/SKILL.md
================================================
---
name: submit-pr
description: Submit a GitHub pull request for LMDeploy.
---

# Submit a PR for LMDeploy

## 1. Create branch (off main)

Skip this step if already on a feature branch.

```bash
git checkout main && git pull
git checkout -b <type>/<short-description>   # e.g. feat/qwen3-omni
```

## 2. Lint

```bash
pre-commit run --all-files
```

## 3. Stage

```bash
git add lmdeploy/path/to/changed_file.py     # specific files only, never git add .
git status                                   # verify staged set
```

## 4. Commit

```bash
git commit -m "feat: add Qwen3-Omni support"
# Conventional prefixes: feat | fix | refactor | docs | test | chore
```

## 5. Push

```bash
git push -u origin <branch>
```

## 6. Create PR

```bash
gh pr create --title "<type>: <short description>" --body "$(cat <<'EOF'
## Summary
- <bullet 1>
- <bullet 2>

## Test plan
- [ ] `pre-commit run --all-files` passes
- [ ] unit tests pass: `pytest tests/test_lmdeploy/`
- [ ] manual smoke test with pipeline

🤖 Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"
```


================================================
FILE: .claude/skills/support-new-model/SKILL.md
================================================
---
name: support-new-model
description: Add a new LLM or VLM to LMDeploy's PyTorch backend.
---

# Tutorial: Adding a New Model to LMDeploy (PyTorch Backend)

This guide walks through adding a new LLM or VLM to LMDeploy's PyTorch backend.

______________________________________________________________________

## Before Writing Any Code

**Study the reference implementations before touching any files.**

1. Read the HF model's `config.json` to understand: `model_type`, `architectures`, layer counts, hidden dims, number of attention heads, MoE parameters (if applicable).
2. Identify which category the model falls into:
   - **LLM only** — pure text model
   - **VLM** — text + vision (needs an additional preprocessor in `vl/model/`)
3. Find the closest existing model in LMDeploy and read it thoroughly:

| Reference model        | File(s)                                                                                                                                   |
| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| LLM (dense)            | `lmdeploy/pytorch/models/qwen3.py`                                                                                                        |
| LLM (MoE)              | `lmdeploy/pytorch/models/qwen3_moe.py`                                                                                                    |
| VLM preprocessor       | `lmdeploy/vl/model/qwen3.py`                                                                                                              |
| VLM (composite config) | `lmdeploy/pytorch/models/qwen3_omni_moe_thinker.py` + `lmdeploy/pytorch/configurations/qwen3_omni.py` + `lmdeploy/vl/model/qwen3_omni.py` |

______________________________________________________________________

## Key Files Quick Reference

| File                                         | Purpose                                                         |
| -------------------------------------------- | --------------------------------------------------------------- |
| `lmdeploy/pytorch/models/<model>.py`         | Attention, MLP, DecoderLayer, Model, ForCausalLM                |
| `lmdeploy/pytorch/models/module_map.py`      | HF class name → LMDeploy class path mapping                     |
| `lmdeploy/pytorch/configurations/<model>.py` | Config builder — only needed for non-standard/nested HF configs |
| `lmdeploy/vl/model/<model>.py`               | VLM: image/video preprocessing *(VLM only)*                     |
| `lmdeploy/vl/model/base.py`                  | `VisionModel` base class + `VISION_MODELS` registry             |
| `lmdeploy/archs.py`                          | VLM: arch name → task mapping *(VLM only)*                      |
| `lmdeploy/lite/apis/calibrate.py`            | Quantization: layer/norm/head mappings *(optional)*             |
| `lmdeploy/lite/quantization/awq.py`          | Quantization: AWQ scale mappings *(optional)*                   |

______________________________________________________________________

## Step-by-Step: LLM (PyTorch Backend)

### Step 1 — Create the PyTorch model file

**File:** `lmdeploy/pytorch/models/<model_name>.py`

Implement the following class hierarchy (innermost → outermost):

1. **`<Model>Attention`** — QKV projection, rotary embedding, attention forward
2. **`<Model>MLP`** — gate-up linear, activation, down projection
3. **`<Model>DecoderLayer`** — wraps Attention + MLP with layer norms and residual connections
4. **`<Model>Model`** — embedding table, all decoder layers, final norm, rotary embedding
5. **`<Model>ForCausalLM`** — top-level class; inherits `nn.Module`, `DeployModelMixinV1`, `CudaGraphMixin`

**Required imports:**

```python
import torch
import torch.nn as nn
from lmdeploy.pytorch.model_inputs import StepContext, StepContextManager
from lmdeploy.pytorch.nn import (ApplyRotaryEmb, Attention, RMSNorm, SiluAndMul,
                                  build_rotary_embedding_from_config)
from lmdeploy.pytorch.nn.linear import (build_down_linear, build_gateup_linear,
                                         build_o_proj, build_qkv_proj)
from lmdeploy.pytorch.weight_loader.model_weight_loader import load_weight
from .patch import add_prefix
from .utils.cudagraph import CudaGraphMixin
from .utils.model import DeployModelMixinV1, build_embedding
```

**Attention skeleton:**

```python
class MyModelAttention(nn.Module):
    def __init__(self, config, dtype=None, device=None, prefix=''):
        super().__init__()
        self.qkv_proj = build_qkv_proj(
            config.hidden_size,
            num_q_heads=config.num_attention_heads,
            num_kv_heads=config.num_key_value_heads,
            head_size=config.hidden_size // config.num_attention_heads,
            bias=False,
            dtype=dtype, device=device, prefix=add_prefix('qkv_proj', prefix))
        self.apply_rotary_pos_emb = ApplyRotaryEmb()
        self.attn_fwd = Attention(
            config.num_attention_heads,
            config.hidden_size // config.num_attention_heads,
            num_kv_heads=config.num_key_value_heads)
        self.o_proj = build_o_proj(
            config.num_attention_heads,
            config.hidden_size // config.num_attention_heads,
            config.hidden_size,
            bias=False,
            dtype=dtype, device=device, prefix=add_prefix('o_proj', prefix))

    def forward(self, hidden_states, rotary_pos_emb, past_key_value, attn_metadata):
        qkv_states = self.qkv_proj(hidden_states)
        # split q, k, v; apply rotary; call attn_fwd; project output
        ...
```

**MLP skeleton:**

```python
class MyModelMLP(nn.Module):
    def __init__(self, config, dtype=None, device=None, prefix=''):
        super().__init__()
        self.gate_up_proj = build_gateup_linear(
            config.hidden_size, config.intermediate_size,
            bias=False, dtype=dtype, device=device,
            prefix=add_prefix('gate_up_proj', prefix))
        self.down_proj = build_down_linear(
            config.intermediate_size, config.hidden_size,
            bias=False, dtype=dtype, device=device,
            prefix=add_prefix('down_proj', prefix))
        self.act_fn = SiluAndMul()

    def forward(self, x):
        return self.down_proj(self.act_fn(self.gate_up_proj(x)))
```

**ForCausalLM skeleton (critical fields):**

```python
class MyModelForCausalLM(nn.Module, DeployModelMixinV1, CudaGraphMixin):
    # Maps packed param name → list of original HF param suffixes
    packed_modules_mapping = {
        'qkv_proj': ['q_proj', 'k_proj', 'v_proj'],
        'gate_up_proj': ['gate_proj', 'up_proj'],
    }

    def __init__(self, config, ctx_mgr=None, prefix='', **kwargs):
        super().__init__()
        self.model = MyModelModel(config, ...)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
        self.ctx_mgr = ctx_mgr

    def get_input_embeddings(self):
        return self.model.embed_tokens

    def forward(self, input_ids, inputs_embeds, past_key_values, attn_metadata, **kwargs):
        hidden_states = self.model(input_ids, inputs_embeds, past_key_values, attn_metadata)
        return hidden_states

    def get_logits(self, hidden_states):
        return self.lm_head(hidden_states)

    # prepare_inputs_for_generation and load_weights: copy from qwen3.py,
    # update stacked_params_mapping to match this model's HF weight names.
```

______________________________________________________________________

### Step 2 — Register in `module_map.py`

**File:** `lmdeploy/pytorch/models/module_map.py`

Add an entry to `MODULE_MAP`. The key is the exact HF architecture class name from `config.json`'s `architectures` field:

```python
MODULE_MAP.update({
    'MyModelForCausalLM': f'{LMDEPLOY_PYTORCH_MODEL_PATH}.my_model.MyModelForCausalLM',
})
```

______________________________________________________________________

### Step 3 — Add config builder (if needed)

**File:** `lmdeploy/pytorch/configurations/<model_name>.py`

**Skip this step** for models with a standard flat HF config — `DefaultModelConfigBuilder` handles them automatically.

Only create this file when the HF config is non-standard, e.g.:

- Nested config (e.g., Qwen3-Omni has `hf_config.thinker_config.text_config`)
- Unusual `model_type` that needs special field remapping

```python
from .builder import AutoModelConfigBuilder, DefaultModelConfigBuilder

class MyModelConfigBuilder(AutoModelConfigBuilder):
    @classmethod
    def condition(cls, hf_config):
        # Must match model_type from config.json exactly
        return hf_config.model_type == 'my_model'

    @classmethod
    def build(cls, hf_config, model_path=None, **kwargs):
        # Extract the text config if nested; patch fields if needed
        cfg = DefaultModelConfigBuilder.build(hf_config, model_path, **kwargs)
        cfg.hf_config = hf_config  # keep full config for VLM layers
        return cfg
```

Auto-discovery: subclasses of `AutoModelConfigBuilder` register themselves automatically via `__init_subclass__()` — no import needed elsewhere.

______________________________________________________________________

### Step 4 — Add quantization mappings (optional)

Only needed to support AWQ/SmoothQuant calibration for this model family.

**`lmdeploy/lite/apis/calibrate.py`** — add layer name, norm name, and head name mappings for the new model type.

**`lmdeploy/lite/quantization/awq.py`** — add entries to `NORM_FCS_MAP` (norm → downstream FC layers) and `FC_FCS_MAP` (FC → downstream FC layers) following the existing patterns.

______________________________________________________________________

## Step-by-Step: VLM (additional steps)

### Step 5 — Create the VL preprocessor

**File:** `lmdeploy/vl/model/<model_name>.py`

The preprocessor handles image/video decoding and feature extraction before the LLM backbone sees the input.

```python
from lmdeploy.vl.model.base import VISION_MODELS, VisionModel

@VISION_MODELS.register_module()
class MyModelVLModel(VisionModel):
    # Must match hf_config.architectures exactly (can be a list for variants)
    _arch = ['MyModelForConditionalGeneration']

    def build_preprocessor(self):
        """Load the vision processor from the model checkpoint."""
        from transformers import AutoProcessor
        self.processor = AutoProcessor.from_pretrained(self.model_path)
        # Set image_token_id to the token ID of the image placeholder
        # (used by the engine to know where to inject image features)
        tokenizer = self.processor.tokenizer
        self.image_token = '<image>'  # model-specific placeholder token
        self.image_token_id = tokenizer.convert_tokens_to_ids(self.image_token)

    # preprocess and to_pytorch: copy from vl/model/qwen3.py and adapt
    # image token handling (image_token, image_token_id, image_tokens count).
```

Key points:

- `collect_images()`, `proc_messages()`, `to_pytorch_aux()` are all provided by `VisionModel` — do not re-implement them.
- `image_tokens` tells the engine how many token slots to reserve for each image.
- Auto-registered via `@VISION_MODELS.register_module()` when the module is imported. **Add an explicit import** in `lmdeploy/vl/model/builder.py` alongside the existing imports so the decorator runs at startup:

```python
from .my_model import MyModelVLModel  # noqa F401
```

______________________________________________________________________

### Step 6 — Register VLM arch in `archs.py`

**File:** `lmdeploy/archs.py`

Add the architecture name to the `supported_archs` set inside `check_vl_llm()` so the engine routes the model through the VLM code path:

```python
# lmdeploy/archs.py — inside check_vl_llm()
supported_archs = set([
    ...
    'MyModelForConditionalGeneration',  # add this line
])
```

______________________________________________________________________

## Checklist

**LLM (PyTorch backend):**

- [ ] `pytorch/models/<model>.py` — all 5 classes implemented (`Attention`, `MLP`, `DecoderLayer`, `Model`, `ForCausalLM`)
- [ ] `module_map.py` — HF architecture class name registered
- [ ] `packed_modules_mapping` matches HF parameter naming scheme
- [ ] `stacked_params_mapping` in `load_weights()` has correct shard indices
- [ ] `pytorch/configurations/<model>.py` — added only if HF config is non-standard
- [ ] Weights load cleanly from HF checkpoint (no missing/unexpected key errors)

**VLM (additional):**

- [ ] `vl/model/<model>.py` — `build_preprocessor`, `preprocess`, `to_pytorch` implemented
- [ ] `_arch` matches `config.json` `architectures[0]` exactly
- [ ] `image_token_id` correctly resolved from the tokenizer
- [ ] `image_tokens` count is correct for the image resolution/encoding scheme
- [ ] `vl/model/builder.py` — explicit import added for new model
- [ ] `archs.py` entry added

**Quantization (optional):**

- [ ] `calibrate.py` — layer/norm/head name mappings added
- [ ] `awq.py` — `NORM_FCS_MAP` / `FC_FCS_MAP` entries added

______________________________________________________________________

## Common Pitfalls

1. **Weight name mismatches** — `packed_modules_mapping` keys must match HF param name suffixes exactly. Check actual HF weight names with `list(model.state_dict().keys())[:20]` before coding.
2. **Wrong shard index order** — `stacked_params_mapping` for QKV must follow Q→0, K→1, V→2. Wrong order silently produces bad outputs.
3. **Wrong `_arch`** — must match `hf_config.architectures[0]` literally (e.g., `'Qwen3VLForConditionalGeneration'`, not `'Qwen3VL'`).
4. **`image_token_id` is None** — causes the engine to silently skip image feature injection. Always verify with `tokenizer.convert_tokens_to_ids(image_token)` returning a real token ID.
5. **Missing `role='preprocess'` append** — `to_pytorch_aux()` searches messages for exactly `role='preprocess'`; if `preprocess()` does not append it, inference will fail with a confusing error.
6. **Config builder `condition()` mismatch** — `model_type` in `condition()` must match the exact string in `config.json`, not a display name or alias.
7. **MoE routing** — MoE models need `num_experts`, `num_experts_per_tok`, and a TopK gating mechanism in the MLP. Reference `qwen3_moe.py` for the pattern.
8. **CUDA graph + dynamic control flow** — models with data-dependent branching (e.g., conditional expert dispatch) may break CUDA graph capture. Use `_no_cudagraph` guards in `CudaGraphMixin` if needed.

______________________________________________________________________

## Verification

**LLM basic test:**

```bash
python -m lmdeploy.pytorch.chat <model_path> --backend pytorch
```

**VLM basic test:**

```python
from lmdeploy import pipeline
pipe = pipeline('<model_path>')
result = pipe(('Describe this image.', 'path/to/image.jpg'))
print(result.text)
```

**Unit tests:**

```bash
pytest tests/test_lmdeploy/test_vl/     # VLM tests
pytest tests/test_lmdeploy/             # all unit tests
```

**Debug weight loading:**

```bash
LMDEPLOY_LOG_LEVEL=DEBUG python -m lmdeploy.pytorch.chat <model_path> --backend pytorch 2>&1 | grep -E "load|weight|miss"
```


================================================
FILE: .github/CONTRIBUTING.md
================================================
## Contributing to LMDeploy

Welcome to the LMDeploy community, all kinds of contributions are welcomed, including but not limited to

**Fix bug**

You can directly post a Pull Request to fix typo in code or documents

The steps to fix the bug of code implementation are as follows.

1. If the modification involve significant changes, you should create an issue first and describe the error information and how to trigger the bug. Other developers will discuss with you and propose an proper solution.

2. Posting a pull request after fixing the bug and adding corresponding unit test.

**New Feature or Enhancement**

1. If the modification involve significant changes, you should create an issue to discuss with our developers to propose an proper design.
2. Post a Pull Request after implementing the new feature or enhancement and add corresponding unit test.

**Document**

You can directly post a pull request to fix documents. If you want to add a document, you should first create an issue to check if it is reasonable.

### Pull Request Workflow

If you're not familiar with Pull Request, don't worry! The following guidance will tell you how to create a Pull Request step by step. If you want to dive into the develop mode of Pull Request, you can refer to the [official documents](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests)

#### 1. Fork and clone

If you are posting a pull request for the first time, you should fork the OpenMMLab repositories by clicking the **Fork** button in the top right corner of the GitHub page, and the forked repositories will appear under your GitHub profile.

<img src="https://user-images.githubusercontent.com/57566630/167305749-43c7f4e9-449b-4e98-ade5-0c9276d5c9ce.png" width="1200">

Then, you can clone the repositories to local:

```shell
git clone git@github.com:{username}/lmdeploy.git
```

After that, you should add official repository as the upstream repository

```bash
git remote add upstream git@github.com:InternLM/lmdeploy.git
```

Check whether remote repository has been added successfully by `git remote -v`

```bash
origin	git@github.com:{username}/lmdeploy.git (fetch)
origin	git@github.com:{username}/lmdeploy.git (push)
upstream	git@github.com:InternLM/lmdeploy.git (fetch)
upstream	git@github.com:InternLM/lmdeploy.git (push)
```

> Here's a brief introduction to origin and upstream. When we use "git clone", we create an "origin" remote by default, which points to the repository cloned from. As for "upstream", we add it ourselves to point to the target repository. Of course, if you don't like the name "upstream", you could name it as you wish. Usually, we'll push the code to "origin". If the pushed code conflicts with the latest code in official("upstream"), we should pull the latest code from upstream to resolve the conflicts, and then push to "origin" again. The posted Pull Request will be updated automatically.

#### 2. Configure pre-commit

You should configure [pre-commit](https://pre-commit.com/#intro) in the local development environment to make sure the code style matches that of LMDeploy. **Note**: The following code should be executed under the lmdeploy directory.

```shell
pip install -U pre-commit
pre-commit install
```

Check that pre-commit is configured successfully, and install the hooks defined in `.pre-commit-config.yaml`.

```shell
pre-commit run --all-files
```

<img src="https://user-images.githubusercontent.com/57566630/173660750-3df20a63-cb66-4d33-a986-1f643f1d8aaf.png" width="1200">

<img src="https://user-images.githubusercontent.com/57566630/202368856-0465a90d-8fce-4345-918e-67b8b9c82614.png" width="1200">

If the installation process is interrupted, you can repeatedly run `pre-commit run ... ` to continue the installation.

If the code does not conform to the code style specification, pre-commit will raise a warning and  fixes some of the errors automatically.

<img src="https://user-images.githubusercontent.com/57566630/202369176-67642454-0025-4023-a095-263529107aa3.png" width="1200">

If we want to commit our code bypassing the pre-commit hook, we can use the `--no-verify` option(**only for temporarily commit**).

```shell
git commit -m "xxx" --no-verify
```

#### 3. Create a development branch

After configuring the pre-commit, we should create a branch based on the master branch to develop the new feature or fix the bug. The proposed branch name is `username/pr_name`

```shell
git checkout -b yhc/refactor_contributing_doc
```

In subsequent development, if the master branch of the local repository is behind the master branch of "upstream", we need to pull the upstream for synchronization, and then execute the above command:

```shell
git pull upstream main
```

#### 4. Commit the code and pass the unit test

- lmdeploy introduces mypy to do static type checking to increase the robustness of the code. Therefore, we need to add Type Hints to our code and pass the mypy check. If you are not familiar with Type Hints, you can refer to [this tutorial](https://docs.python.org/3/library/typing.html).

- The committed code should pass through the unit test

  ```shell
  # Pass all unit tests
  pytest tests

  # Pass the unit test of runner
  pytest tests/test_runner/test_runner.py
  ```

  If the unit test fails for lack of dependencies, you can install the dependencies referring to the [guidance](#unit-test)

- If the documents are modified/added, we should check the rendering result referring to [guidance](#document-rendering)

#### 5. Push the code to remote

We could push the local commits to remote after passing through the check of unit test and pre-commit. You can associate the local branch with remote branch by adding `-u` option.

```shell
git push -u origin {branch_name}
```

This will allow you to use the `git push` command to push code directly next time, without having to specify a branch or the remote repository.

#### 6. Create a Pull Request

(1) Create a pull request in GitHub's Pull request interface

<img src="https://user-images.githubusercontent.com/57566630/201533288-516f7ac4-0b14-4dc8-afbd-912475c368b5.png" width="1200">

(2) Modify the PR description according to the guidelines so that other developers can better understand your changes

<img src="https://user-images.githubusercontent.com/57566630/202242953-c91a18ff-e388-4ff9-8591-5fae0ead6c1e.png" width="1200">

Find more details about Pull Request description in [pull request guidelines](#pr-specs).

**note**

(a) The Pull Request description should contain the reason for the change, the content of the change, and the impact of the change, and be associated with the relevant Issue (see [documentation](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue))

(b) If it is your first contribution, please sign the CLA

<img src="https://user-images.githubusercontent.com/57566630/167307569-a794b967-6e28-4eac-a942-00deb657815f.png" width="1200">

(c) Check whether the Pull Request pass through the CI

<img src="https://user-images.githubusercontent.com/57566630/167307490-f9ebf9fa-63c0-4d83-8ba1-081ea169eb3a.png" width="1200">

LMDeploy will run unit test for the posted Pull Request on different platforms (Linux, Window, Mac), based on different versions of Python, PyTorch, CUDA to make sure the code is correct. We can see the specific test information by clicking `Details` in the above image so that we can modify the code.

(3) If the Pull Request passes the CI, then you can wait for the review from other developers. You'll modify the code based on the reviewer's comments, and repeat the steps [4](#4-commit-the-code-and-pass-the-unit-test)-[5](#5-push-the-code-to-remote) until all reviewers approve it. Then, we will merge it ASAP.

<img src="https://user-images.githubusercontent.com/57566630/202145400-cc2cd8c4-10b0-472f-ba37-07e6f50acc67.png" width="1200">

#### 7. Resolve conflicts

If your local branch conflicts with the latest master branch of "upstream", you'll need to resolove them. There are two ways to do this:

```shell
git fetch --all --prune
git rebase upstream/main
```

or

```shell
git fetch --all --prune
git merge upstream/main
```

If you are very good at handling conflicts, then you can use rebase to resolve conflicts, as this will keep your commit logs tidy. If you are not familiar with `rebase`, then you can use `merge` to resolve conflicts.

### Guidance

#### Document rendering

If the documents are modified/added, we should check the rendering result. We could install the dependencies and run the following command to render the documents and check the results:

```shell
pip install -r requirements/docs.txt
cd docs/zh_cn/
# or docs/en
make html
# check file in ./docs/zh_cn/_build/html/index.html
```

### Code style

#### Python

We adopt [PEP8](https://www.python.org/dev/peps/pep-0008/) as the preferred code style.

We use the following tools for linting and formatting:

- [flake8](https://github.com/PyCQA/flake8): A wrapper around some linter tools.
- [isort](https://github.com/timothycrosley/isort): A Python utility to sort imports.
- [yapf](https://github.com/google/yapf): A formatter for Python files.
- [codespell](https://github.com/codespell-project/codespell): A Python utility to fix common misspellings in text files.
- [mdformat](https://github.com/executablebooks/mdformat): Mdformat is an opinionated Markdown formatter that can be used to enforce a consistent style in Markdown files.
- [docformatter](https://github.com/myint/docformatter): A formatter to format docstring.

We use [pre-commit hook](https://pre-commit.com/) that checks and formats for `flake8`, `yapf`, `isort`, `trailing whitespaces`, `markdown files`,
fixes `end-of-files`, `double-quoted-strings`, `python-encoding-pragma`, `mixed-line-ending`, sorts `requirments.txt` automatically on every commit.
The config for a pre-commit hook is stored in [.pre-commit-config](../.pre-commit-config.yaml).

#### C++ and CUDA

The clang-format config is stored in [.clang-format](../.clang-format). And it's recommended to use clang-format version **11**. Please do not use older or newer versions as they will result in differences after formatting, which can cause the [lint](https://github.com/InternLM/lmdeploy/blob/main/.github/workflows/lint.yml#L25) to fail.

### PR Specs

1. Use [pre-commit](https://pre-commit.com) hook to avoid issues of code style

2. One short-time branch should be matched with only one PR

3. Accomplish a detailed change in one PR. Avoid large PR

   - Bad: Support Faster R-CNN
   - Acceptable: Add a box head to Faster R-CNN
   - Good: Add a parameter to box head to support custom conv-layer number

4. Provide clear and significant commit message

5. Provide clear and meaningful PR description

   - Task name should be clarified in title. The general format is: \[Prefix\] Short description of the PR (Suffix)
   - Prefix: add new feature \[Feature\], fix bug \[Fix\], related to documents \[Docs\], in developing \[WIP\] (which will not be reviewed temporarily)
   - Introduce main changes, results and influences on other modules in short description
   - Associate related issues and pull requests with a milestone


================================================
FILE: .github/ISSUE_TEMPLATE/1-bug-report.yml
================================================
name: 🐞 Bug report
description: Create a report to help us reproduce and fix the bug
title: "[Bug] "
labels: ['Bug']

body:
- type: checkboxes
  attributes:
    label: Checklist
    options:
    - label: 1. I have searched related issues but cannot get the expected help.
    - label: 2. The bug has not been fixed in the latest version.
    - label: 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- type: textarea
  attributes:
    label: Describe the bug
    description: A clear and concise description of what the bug is.
  validations:
    required: true
- type: textarea
  attributes:
    label: Reproduction
    description: |
      1. What command or script did you run?
    placeholder: |
      A placeholder for the command.
  validations:
    required: true
- type: textarea
  attributes:
    label: Environment
    description: |
      1. Please run `lmdeploy check_env` to collect necessary environment information and paste it here.
      2. You may add addition that may be helpful for locating the problem, such as
         - Which **model** are you using?
         - How you installed PyTorch \[e.g., pip, conda, source\]
         - Other environment variables that may be related (such as `$PATH`, `$LD_LIBRARY_PATH`, `$PYTHONPATH`, etc.)
    placeholder: Environment here.
    render: Shell
  validations:
    required: true
- type: textarea
  attributes:
    label: Error traceback
    description: |
      If applicable, paste the error trackback here.
    placeholder: Logs and traceback here.
    render: Shell
- type: markdown
  attributes:
    value: >
     If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

     Thanks for your bug report. We appreciate it a lot.


================================================
FILE: .github/ISSUE_TEMPLATE/2-feature-request.yml
================================================
name: 🚀 Feature request
description: Suggest an idea for this project
title: "[Feature] "

body:
- type: markdown
  attributes:
    value: |
      We strongly appreciate you creating a PR to implement this feature [here](https://github.com/InternLM/lmdeploy/pulls)!
      If you need our help, please fill in as much of the following form as you're able to.

      **The less clear the description, the longer it will take to solve it.**
- type: textarea
  attributes:
    label: Motivation
    description: |
      A clear and concise description of the motivation of the feature.
      Ex1. It is inconvenient when \[....\].
  validations:
    required: true
- type: textarea
  attributes:
    label: Related resources
    description: |
      If there is an official code release or third-party implementations, please also provide the information here, which would be very helpful.
- type: textarea
  attributes:
    label: Additional context
    description: |
      Add any other context or screenshots about the feature request here.
      If you would like to implement the feature and create a PR, please leave a comment here and that would be much appreciated.


================================================
FILE: .github/ISSUE_TEMPLATE/3-documentation.yml
================================================
name: 📚 Documentation
description: Report an issue related to the documentation.
labels: "kind/doc,status/unconfirmed"
title: "[Docs] "

body:
- type: textarea
  attributes:
    label: 📚 The doc issue
    description: >
      A clear and concise description the issue.
  validations:
    required: true

- type: textarea
  attributes:
    label: Suggest a potential alternative/fix
    description: >
      Tell us how we could improve the documentation in this regard.
- type: markdown
  attributes:
    value: >
      Thanks for contributing 🎉!


================================================
FILE: .github/pull_request_template.md
================================================
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

## Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

## Modification

Please briefly describe what modification is made in this PR.

## BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

## Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

## Checklist

1. Pre-commit or other linting tools are used to fix the potential lint issues.
2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
4. The documentation has been modified accordingly, like docstring or example tutorials.


================================================
FILE: .github/release.yml
================================================
changelog:
  categories:
    - title: 🚀 Features
      labels:
        - feature
        - enhancement
    - title: 💥 Improvements
      labels:
        - improvement
    - title: 🐞 Bug fixes
      labels:
        - bug
        - Bug:P0
        - Bug:P1
        - Bug:P2
        - Bug:P3
    - title: 📚 Documentations
      labels:
        - documentation
    - title: 🌐 Other
      labels:
        - '*'
      exclude:
        labels:
          - feature
          - enhancement
          - improvement
          - bug
          - documentation
          - Bug:P0
          - Bug:P1
          - Bug:P2
          - Bug:P3


================================================
FILE: .github/scripts/action_tools.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import glob
import json
import logging
import os
import shutil
import subprocess
import time
from collections import OrderedDict
from typing import List

import fire
import pandas as pd
from mmengine.config import Config


def run_cmd(cmd_lines: List[str], log_path: str, cwd: str = None):
    """
    Args:
        cmd_lines: (list[str]): A command in multiple line style.
        log_path (str): Path to log file.
        cwd (str): Path to the current working directory.

    Returns:
        int: error code.
    """
    import platform

    system = platform.system().lower()

    if system == 'windows':
        sep = r'`'
    else:  # 'Linux', 'Darwin'
        sep = '\\'
    cmd_for_run = ' '.join(cmd_lines)
    cmd_for_log = f' {sep}\n'.join(cmd_lines) + '\n'
    with open(log_path, 'w', encoding='utf-8') as file_handler:
        file_handler.write(f'Command: {cmd_for_log}\n')
        file_handler.flush()
        process_res = subprocess.Popen(cmd_for_run, shell=True, cwd=cwd, stdout=file_handler, stderr=file_handler)
        process_res.wait()
        return_code = process_res.returncode

    if return_code != 0:
        logging.error(f'Got shell abnormal return code={return_code}')
        with open(log_path, 'r') as f:
            content = f.read()
            logging.error(f'Log error message\n{content}')
    return return_code


def _append_summary(content):
    summary_file = os.environ['GITHUB_STEP_SUMMARY']
    with open(summary_file, 'a') as f:
        f.write(content + '\n')


def add_summary(csv_path: str):
    """Add csv file to github step summary.

    Args:
        csv_path (str): Input csv file.
    """
    with open(csv_path, 'r') as fr:
        lines = fr.readlines()
        header = lines[0].strip().split(',')
        n_col = len(header)
        header = '|' + '|'.join(header) + '|'
        aligner = '|' + '|'.join([':-:'] * n_col) + '|'
        _append_summary(header)
        _append_summary(aligner)
        for line in lines[1:]:
            line = '|' + line.strip().replace(',', '|') + '|'
            _append_summary(line)
        _append_summary('\n')


def evaluate(models: List[str],
             datasets: List[str],
             workspace: str,
             evaluate_type: str,
             max_num_workers: int = 8,
             is_smoke: bool = False):
    """Evaluate models from lmdeploy using opencompass.

    Args:
        models: Input models.
        workspace: Working directory.
    """
    os.makedirs(workspace, exist_ok=True)
    output_csv = os.path.join(workspace, f'results_{evaluate_type}.csv')
    if os.path.exists(output_csv):
        os.remove(output_csv)
    num_model = len(models)
    for idx, ori_model in enumerate(models):
        print()
        print(50 * '==')
        print(f'Start evaluating {idx+1}/{num_model} {ori_model} ...')
        model = ori_model.lower()

        lmdeploy_dir = os.path.abspath(os.environ['LMDEPLOY_DIR'])
        config_path = os.path.join(lmdeploy_dir, f'.github/scripts/eval_{evaluate_type}_config.py')
        config_path_new = os.path.join(lmdeploy_dir, 'eval_lmdeploy.py')
        if os.path.exists(config_path_new):
            os.remove(config_path_new)
        shutil.copy(config_path, config_path_new)

        cfg = Config.fromfile(config_path_new)
        if not hasattr(cfg, model):
            logging.error(f'Model {model} not in configuration file')
            continue

        model_cfg = cfg[model]
        logging.info(f'Start evaluating {model} ...\\nn{model_cfg}\n\n')

        with open(config_path_new, 'a') as f:
            f.write(f'\ndatasets = {datasets}\n')
            if is_smoke:
                f.write('\nfor d in datasets:\n')
                f.write("    if d['reader_cfg'] is not None:\n")
                f.write("        d['reader_cfg']['test_range'] = '[0:50]'\n")
            if model.startswith('hf'):
                f.write(f'\nmodels = [*{model}]\n')
            else:
                f.write(f'\nmodels = [{model}]\n')

        work_dir = os.path.join(workspace, model)
        cmd_eval = [
            f'opencompass {config_path_new} -w {work_dir} --reuse --max-num-workers {max_num_workers} --dump-res-length'  # noqa: E501
        ]
        eval_log = os.path.join(workspace, f'eval.{ori_model}.txt')
        start_time = time.time()
        ret = run_cmd(cmd_eval, log_path=eval_log, cwd=lmdeploy_dir)
        end_time = time.time()
        task_duration_seconds = round(end_time - start_time, 2)
        logging.info(f'task_duration_seconds: {task_duration_seconds}\n')
        if ret != 0:
            continue
        csv_files = glob.glob(f'{work_dir}/*/summary/summary_*.csv')

        if len(csv_files) < 1:
            logging.error(f'Did not find summary csv file {csv_files}')
            continue
        else:
            csv_file = max(csv_files, key=os.path.getctime)
        # print csv_txt to screen
        csv_txt = csv_file.replace('.csv', '.txt')
        if os.path.exists(csv_txt):
            with open(csv_txt, 'r') as f:
                print(f.read())

        # parse evaluation results from csv file
        model_results = OrderedDict()
        with open(csv_file, 'r') as f:
            lines = f.readlines()
            for line in lines[1:]:
                row = line.strip().split(',')
                row = [_.strip() for _ in row]
                if row[-1] != '-':
                    model_results[row[0]] = row[-1]
        crows_pairs_json = glob.glob(os.path.join(work_dir, '*/results/*/crows_pairs.json'), recursive=True)
        if len(crows_pairs_json) == 1:
            with open(crows_pairs_json[0], 'r') as f:
                acc = json.load(f)['accuracy']
                acc = f'{float(acc):.2f}'  # noqa E231
                model_results['crows_pairs'] = acc
        logging.info(f'\n{model}\n{model_results}')
        dataset_names = list(model_results.keys())

        row = ','.join([model, str(task_duration_seconds)] + [model_results[_] for _ in dataset_names])

        if not os.path.exists(output_csv):
            with open(output_csv, 'w') as f:
                header = ','.join(['Model', 'task_duration_secs'] + dataset_names)
                f.write(header + '\n')
                f.write(row + '\n')
        else:
            with open(output_csv, 'a') as f:
                f.write(row + '\n')

    # write to github action summary
    _append_summary('## Evaluation Results')
    if os.path.exists(output_csv):
        add_summary(output_csv)


def create_model_links(src_dir: str, dst_dir: str):
    """Create softlinks for models."""
    paths = glob.glob(os.path.join(src_dir, '*'))
    model_paths = [os.path.abspath(p) for p in paths if os.path.isdir(p)]
    os.makedirs(dst_dir, exist_ok=True)
    for src in model_paths:
        _, model_name = os.path.split(src)
        dst = os.path.join(dst_dir, model_name)
        if not os.path.exists(dst):
            os.symlink(src, dst)
        else:
            logging.warning(f'Model_path exists: {dst}')


def generate_benchmark_report(report_path: str):
    # write to github action summary
    _append_summary('## Benchmark Results Start')
    subfolders = [f.path for f in os.scandir(report_path) if f.is_dir()]
    for dir_path in subfolders:
        second_subfolders = [f.path for f in sorted(os.scandir(dir_path), key=lambda x: x.name) if f.is_dir()]
        for sec_dir_path in second_subfolders:
            model = sec_dir_path.replace(report_path + '/', '')
            print('-' * 25, model, '-' * 25)
            _append_summary('-' * 25 + model + '-' * 25 + '\n')

            benchmark_subfolders = [
                f.path for f in sorted(os.scandir(sec_dir_path), key=lambda x: x.name) if f.is_dir()
            ]
            for backend_subfolder in benchmark_subfolders:
                benchmark_type = backend_subfolder.replace(sec_dir_path + '/', '')
                print('*' * 10, benchmark_type, '*' * 10)
                _append_summary('-' * 10 + benchmark_type + '-' * 10 + '\n')
                merged_csv_path = os.path.join(backend_subfolder, 'summary.csv')
                csv_files = glob.glob(os.path.join(backend_subfolder, '*.csv'))
                average_csv_path = os.path.join(backend_subfolder, 'average.csv')
                if merged_csv_path in csv_files:
                    csv_files.remove(merged_csv_path)
                if average_csv_path in csv_files:
                    csv_files.remove(average_csv_path)
                merged_df = pd.DataFrame()

                if len(csv_files) > 0:
                    for f in csv_files:
                        df = pd.read_csv(f)
                        merged_df = pd.concat([merged_df, df], ignore_index=True)
                    if 'throughput' in backend_subfolder or 'longtext' in backend_subfolder:
                        merged_df = merged_df.sort_values(by=merged_df.columns[1])

                        grouped_df = merged_df.groupby(merged_df.columns[1])
                    else:
                        merged_df = merged_df.sort_values(by=merged_df.columns[0])

                        grouped_df = merged_df.groupby(merged_df.columns[0])
                    if 'generation' not in backend_subfolder:
                        average_values = grouped_df.pipe((lambda group: {
                            'mean': group.mean(numeric_only=True).round(decimals=3)
                        }))['mean']
                        average_values.to_csv(average_csv_path, index=True)
                        avg_df = pd.read_csv(average_csv_path)
                        merged_df = pd.concat([merged_df, avg_df], ignore_index=True)
                        add_summary(average_csv_path)
                    merged_df.to_csv(merged_csv_path, index=False)
                    if 'generation' in backend_subfolder:
                        add_summary(merged_csv_path)

    _append_summary('## Benchmark Results End')


def generate_csv_from_profile_result(file_path: str, out_path: str):
    with open(file_path, 'r') as f:
        data = f.readlines()
        data = [json.loads(line) for line in data]

        data_csv = []
        for item in data:
            row = [
                item.get('request_rate'),
                item.get('completed'),
                round(item.get('completed') / item.get('duration'), 3),
                round(item.get('median_ttft_ms'), 3),
                round(item.get('output_throughput'), 3)
            ]
            data_csv.append(row)
        import csv
        with open(out_path, 'w', newline='') as f:
            writer = csv.writer(f)
            writer.writerow(['request_rate', 'completed', 'RPM', 'median_ttft_ms', 'output_throughput'])
            writer.writerows(data_csv)


def generate_output_for_evaluation(result_dir: str):
    # find latest result
    latest_csv_file = find_csv_files(result_dir)
    df = pd.read_csv(latest_csv_file)
    transposed_df = df.T
    head_part = transposed_df.head(4)
    tail_part = transposed_df[4:]
    sorted_tail_part = tail_part.sort_index()
    transposed_df = pd.concat([head_part, sorted_tail_part])
    transposed_df.to_csv('transposed_output.csv', header=False, index=True)
    # output to github action summary
    add_summary('transposed_output.csv')


def find_csv_files(directory):
    csv_files = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith('.csv') and file.startswith('summary'):
                csv_files.append(os.path.join(root, file))

    csv_files_with_time = {f: os.path.getctime(f) for f in csv_files}
    sorted_csv_files = sorted(csv_files_with_time.items(), key=lambda x: x[1])
    latest_csv_file = sorted_csv_files[-1][0]
    return latest_csv_file


if __name__ == '__main__':
    fire.Fire()


================================================
FILE: .github/scripts/check_lmdeploy.py
================================================
# Copyright (c) MegFlow. All rights reserved.
import glob
import os

import fire


def check_module_init(root: str):
    """Check if a module has __init__.py file."""
    all_files = glob.glob(os.path.join(root, '**/*'), recursive=True)
    not_exist = []
    for d in all_files:
        if not os.path.isdir(d):
            continue
        if '__pycache__' in d:
            continue
        elif d.startswith('lmdeploy/bin'):
            continue
        elif d.startswith('lmdeploy/lib'):
            continue
        elif d.startswith('lmdeploy/monitoring'):
            continue
        elif d.startswith('lmdeploy/serve/turbomind/triton_models'):
            continue
        elif d.startswith('lmdeploy/serve/turbomind/triton_python_backend'):
            continue
        init_file = os.path.join(d, '__init__.py')
        if not os.path.exists(init_file):
            not_exist.append(init_file)

    assert len(not_exist) == 0, f'Missing files: {not_exist}'


if __name__ == '__main__':
    fire.Fire()


================================================
FILE: .github/scripts/doc_link_checker.py
================================================
# Copyright (c) MegFlow. All rights reserved.
# /bin/python3

import argparse
import os
import re


def make_parser():
    parser = argparse.ArgumentParser('Doc link checker')
    parser.add_argument('--http', default=False, type=bool, help='check http or not ')
    parser.add_argument('--target', default='./docs', type=str, help='the directory or file to check')
    return parser


pattern = re.compile(r'\[.*?\]\(.*?\)')


def analyze_doc(home, path):
    print('analyze {}'.format(path))
    problem_list = []
    code_block = 0
    with open(path) as f:
        lines = f.readlines()
        for line in lines:
            line = line.strip()
            if line.startswith('```'):
                code_block = 1 - code_block

            if code_block > 0:
                continue

            if '[' in line and ']' in line and '(' in line and ')' in line:
                all = pattern.findall(line)
                for item in all:
                    # skip  ![]()
                    if item.find('[') == item.find(']') - 1:
                        continue

                    # process the case [text()]()
                    offset = item.find('](')
                    if offset == -1:
                        continue
                    item = item[offset:]
                    start = item.find('(')
                    end = item.find(')')
                    ref = item[start + 1:end]

                    if ref.startswith('http') or ref.startswith('#'):
                        continue
                    if '.md#' in ref:
                        ref = ref[ref.find('#'):]
                    fullpath = os.path.join(home, ref)
                    if not os.path.exists(fullpath):
                        problem_list.append(ref)
            else:
                continue
    if len(problem_list) > 0:
        print(f'{path}:')
        for item in problem_list:
            print(f'\t {item}')
        print('\n')
        raise Exception('found link error')


def traverse(target):
    if os.path.isfile(target):
        analyze_doc(os.path.dirname(target), target)
        return
    for home, dirs, files in os.walk(target):
        for filename in files:
            if filename.endswith('.md'):
                path = os.path.join(home, filename)
                if os.path.islink(path) is False:
                    analyze_doc(home, path)


if __name__ == '__main__':
    args = make_parser().parse_args()
    traverse(args.target)


================================================
FILE: .github/scripts/eval_base_config.py
================================================
from copy import deepcopy

from mmengine.config import read_base
from opencompass.models import TurboMindModel

with read_base():
    # choose a list of datasets
    from opencompass.configs.datasets.ARC_c.ARC_c_few_shot_ppl import ARC_c_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.bbh.bbh_gen_98fba6 import bbh_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.ceval.ceval_ppl import ceval_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.cmmlu.cmmlu_ppl_041cbf import cmmlu_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.crowspairs.crowspairs_ppl import crowspairs_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.drop.drop_gen_a2697c import drop_datasets  # noqa: F401, E501
    # Corebench v1.7
    from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_d21e37 import \
        GaokaoBench_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.gpqa.gpqa_few_shot_ppl_4b5a83 import gpqa_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.gsm8k.gsm8k_gen_17d0dc import gsm8k_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.hellaswag.hellaswag_10shot_ppl_59c85e import \
        hellaswag_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.humaneval.internal_humaneval_gen_ce6b06 import \
        humaneval_datasets as humaneval_v2_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.humaneval.internal_humaneval_gen_d2537e import \
        humaneval_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.math.math_4shot_base_gen_43d5b6 import math_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.MathBench.mathbench_2024_few_shot_mixed_4a3fd4 import \
        mathbench_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.mbpp.sanitized_mbpp_gen_742f0c import sanitized_mbpp_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.mmlu.mmlu_ppl_ac766d import mmlu_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.mmlu_pro.mmlu_pro_few_shot_gen_bfaf90 import mmlu_pro_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.nq.nq_open_1shot_gen_20a989 import nq_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.race.race_few_shot_ppl import race_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_few_shot_ppl import \
        BoolQ_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import TheoremQA_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_20a989 import \
        triviaqa_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.wikibench.wikibench_few_shot_ppl_c23d79 import \
        wikibench_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.winogrande.winogrande_5shot_ll_252f01 import \
        winogrande_datasets  # noqa: F401, E501
    # Summary Groups
    from opencompass.configs.summarizers.groups.cmmlu import cmmlu_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.GaokaoBench import GaokaoBench_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.mathbench_v1_2024 import \
        mathbench_2024_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.mmlu import mmlu_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.mmlu_pro import mmlu_pro_summary_groups  # noqa: F401, E501

    # read models
race_datasets = [race_datasets[1]]
mmlu_datasets = [
    x for x in mmlu_datasets if x['abbr'].replace('lukaemon_mmlu_', '') in [
        'business_ethics', 'clinical_knowledge', 'college_medicine', 'global_facts', 'human_aging', 'management',
        'marketing', 'medical_genetics', 'miscellaneous', 'nutrition', 'professional_accounting',
        'professional_medicine', 'virology'
    ]
]

summarizer = dict(
    dataset_abbrs=[
        ['race-high', 'accuracy'],
        ['ARC-c', 'accuracy'],
        ['BoolQ', 'accuracy'],
        ['mmlu_pro', 'naive_average'],
        ['GPQA_diamond', 'accuracy'],
        ['cmmlu', 'naive_average'],
        ['mmlu', 'naive_average'],
        ['drop', 'accuracy'],
        ['bbh', 'naive_average'],
        ['math', 'accuracy'],
        ['openai_humaneval', 'humaneval_pass@1'],
        ['openai_humaneval_v2', 'humaneval_pass@1'],
        ['sanitized_mbpp', 'score'],
        ['wikibench-wiki-single_choice_cncircular', 'perf_4'],
        ['gsm8k', 'accuracy'],
        ['GaokaoBench', 'weighted_average'],
        ['triviaqa_wiki_1shot', 'score'],
        ['nq_open_1shot', 'score'],
        ['winogrande', 'accuracy'],
        ['hellaswag', 'accuracy'],
        ['TheoremQA', 'score'],
        '###### MathBench-A: Application Part ######',
        'college',
        'high',
        'middle',
        'primary',
        'arithmetic',
        'mathbench-a (average)',
        '###### MathBench-T: Theory Part ######',
        'college_knowledge',
        'high_knowledge',
        'middle_knowledge',
        'primary_knowledge',
        'mathbench-t (average)',
        '###### Overall: Average between MathBench-A and MathBench-T ######',
        'Overall',
        '',
        'mmlu',
        'mmlu-stem',
        'mmlu-social-science',
        'mmlu-humanities',
        'mmlu-other',
        'cmmlu',
        'cmmlu-stem',
        'cmmlu-social-science',
        'cmmlu-humanities',
        'cmmlu-other',
        'cmmlu-china-specific',
        'mmlu_pro',
        'mmlu_pro_biology',
        'mmlu_pro_business',
        'mmlu_pro_chemistry',
        'mmlu_pro_computer_science',
        'mmlu_pro_economics',
        'mmlu_pro_engineering',
        'mmlu_pro_health',
        'mmlu_pro_history',
        'mmlu_pro_law',
        'mmlu_pro_math',
        'mmlu_pro_philosophy',
        'mmlu_pro_physics',
        'mmlu_pro_psychology',
        'mmlu_pro_other',
    ],
    summary_groups=sum([v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)

base_model = dict(
    type=TurboMindModel,
    engine_config=dict(session_len=7168, tp=1),
    gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=1024),
    max_seq_len=7168,
    max_out_len=1024,
    batch_size=32,
    run_cfg=dict(num_gpus=1),
)

turbomind_qwen2_5_1_5b = deepcopy(base_model)
turbomind_qwen2_5_1_5b['path'] = 'Qwen/Qwen2.5-1.5B'
turbomind_qwen2_5_1_5b['abbr'] = 'turbomind_qwen2_5_1_5b'
turbomind_qwen2_5_7b = deepcopy(base_model)
turbomind_qwen2_5_7b['path'] = 'Qwen/Qwen2.5-7B'
turbomind_qwen2_5_7b['abbr'] = 'turbomind_qwen2_5_7b'
turbomind_qwen2_5_32b = deepcopy(base_model)
turbomind_qwen2_5_32b['path'] = 'Qwen/Qwen2.5-32B'
turbomind_qwen2_5_32b['abbr'] = 'turbomind_qwen2_5_32b'
turbomind_qwen2_5_32b['run_cfg']['num_gpus'] = 2
turbomind_qwen2_5_32b['engine_config']['tp'] = 2
turbomind_internlm2_5_7b = deepcopy(base_model)
turbomind_internlm2_5_7b['path'] = 'internlm/internlm2_5-7b-chat'
turbomind_internlm2_5_7b['abbr'] = 'turbomind_internlm2_5_7b'
turbomind_glm_4_9b = deepcopy(base_model)
turbomind_glm_4_9b['path'] = 'THUDM/glm-4-9b'
turbomind_glm_4_9b['abbr'] = 'turbomind_glm_4_9b'
turbomind_llama_3_70b = deepcopy(base_model)
turbomind_llama_3_70b['path'] = 'meta-llama/Meta-Llama-3-70B'
turbomind_llama_3_70b['abbr'] = 'turbomind_llama_3_70b'
turbomind_llama_3_70b['run_cfg']['num_gpus'] = 4
turbomind_llama_3_70b['engine_config']['tp'] = 4
turbomind_llama_3_1_8b = deepcopy(base_model)
turbomind_llama_3_1_8b['path'] = 'meta-llama/Llama-3.1-8B'
turbomind_llama_3_1_8b['abbr'] = 'turbomind_llama_3_1_8b'
turbomind_qwen3_0_6b_base = deepcopy(base_model)
turbomind_qwen3_0_6b_base['path'] = 'Qwen/Qwen3-0.6B-Base'
turbomind_qwen3_0_6b_base['abbr'] = 'turbomind_qwen3_0_6b_base'
turbomind_qwen3_8b_base = deepcopy(base_model)
turbomind_qwen3_8b_base['path'] = 'Qwen/Qwen3-8B-Base'
turbomind_qwen3_8b_base['abbr'] = 'turbomind_qwen3_8b_base'
turbomind_qwen3_30b_A3B_base = deepcopy(base_model)
turbomind_qwen3_30b_A3B_base['path'] = 'Qwen/Qwen3-30B-A3B-Base'
turbomind_qwen3_30b_A3B_base['abbr'] = 'turbomind_qwen3_30b_A3B_base'
turbomind_qwen3_30b_A3B_base['run_cfg']['num_gpus'] = 2
turbomind_qwen3_30b_A3B_base['engine_config']['tp'] = 2

pytorch_qwen2_5_1_5b = deepcopy(base_model)
pytorch_qwen2_5_1_5b['path'] = 'Qwen/Qwen2.5-1.5B'
pytorch_qwen2_5_1_5b['abbr'] = 'pytorch_qwen2_5_1_5b'
pytorch_qwen2_5_7b = deepcopy(base_model)
pytorch_qwen2_5_7b['path'] = 'Qwen/Qwen2.5-7B'
pytorch_qwen2_5_7b['abbr'] = 'pytorch_qwen2_5_7b'
pytorch_qwen2_5_32b = deepcopy(base_model)
pytorch_qwen2_5_32b['path'] = 'Qwen/Qwen2.5-32B'
pytorch_qwen2_5_32b['abbr'] = 'pytorch_qwen2_5_32b'
pytorch_qwen2_5_32b['run_cfg']['num_gpus'] = 2
pytorch_qwen2_5_32b['engine_config']['tp'] = 2
pytorch_internlm2_5_7b = deepcopy(base_model)
pytorch_internlm2_5_7b['path'] = 'internlm/internlm2_5-7b-chat'
pytorch_internlm2_5_7b['abbr'] = 'pytorch_internlm2_5_7b'
pytorch_gemma_2_9b = deepcopy(base_model)
pytorch_gemma_2_9b['path'] = 'google/gemma-2-9b'
pytorch_gemma_2_9b['abbr'] = 'pytorch_gemma_2_9b'
pytorch_llama_3_70b = deepcopy(base_model)
pytorch_llama_3_70b['path'] = 'meta-llama/Meta-Llama-3-70B'
pytorch_llama_3_70b['abbr'] = 'pytorch_llama_3_70b'
pytorch_llama_3_70b['run_cfg']['num_gpus'] = 4
pytorch_llama_3_70b['engine_config']['tp'] = 4
pytorch_llama_3_1_8b = deepcopy(base_model)
pytorch_llama_3_1_8b['path'] = 'meta-llama/Llama-3.1-8B'
pytorch_llama_3_1_8b['abbr'] = 'pytorch_llama_3_1_8b'
pytorch_qwen3_0_6b_base = deepcopy(base_model)
pytorch_qwen3_0_6b_base['path'] = 'Qwen/Qwen3-0.6B-Base'
pytorch_qwen3_0_6b_base['abbr'] = 'pytorch_qwen3_0_6b_base'
pytorch_qwen3_8b_base = deepcopy(base_model)
pytorch_qwen3_8b_base['path'] = 'Qwen/Qwen3-8B-Base'
pytorch_qwen3_8b_base['abbr'] = 'pytorch_qwen3_8b_base'
pytorch_qwen3_30b_A3B_base = deepcopy(base_model)
pytorch_qwen3_30b_A3B_base['path'] = 'Qwen/Qwen3-30B-A3B-Base'
pytorch_qwen3_30b_A3B_base['abbr'] = 'pytorch_qwen3_30b_A3B_base'
pytorch_qwen3_30b_A3B_base['run_cfg']['num_gpus'] = 2
pytorch_qwen3_30b_A3B_base['engine_config']['tp'] = 2

for model in [v for k, v in locals().items() if k.startswith('pytorch_')]:
    model['backend'] = 'pytorch'


================================================
FILE: .github/scripts/eval_chat_config.py
================================================
from copy import deepcopy

from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.utils.text_postprocessors import extract_non_reasoning_content

with read_base():
    # choose a list of datasets
    from opencompass.configs.datasets.bbh.bbh_gen_5b92b0 import bbh_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.ceval.ceval_gen_2daf24 import ceval_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.cmmlu.cmmlu_gen_c13365 import cmmlu_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.crowspairs.crowspairs_gen_381af0 import crowspairs_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_4c31db import \
        GaokaoBench_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.gpqa.gpqa_gen_4baadb import gpqa_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.hellaswag.hellaswag_10shot_gen_e42710 import \
        hellaswag_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.IFEval.IFEval_gen_3321a3 import ifeval_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.math.math_0shot_gen_393424 import math_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.mbpp.sanitized_mbpp_gen_a0fc46 import sanitized_mbpp_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.mmlu.mmlu_gen_4d595a import mmlu_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_cot_gen_08c1de import \
        mmlu_pro_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.nq.nq_open_1shot_gen_01cf41 import nq_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.race.race_gen_69ee4f import race_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import TheoremQA_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_eaf81e import \
        triviaqa_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.winogrande.winogrande_5shot_gen_b36770 import \
        winogrande_datasets  # noqa: F401, E501
    # read models
    from opencompass.configs.models.baichuan.hf_baichuan2_7b_chat import \
        models as hf_baichuan2_chat_7b  # noqa: F401, E501
    from opencompass.configs.models.gemma.hf_gemma2_9b_it import models as hf_gemma2_9b_it  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b_chat import \
        models as hf_internlm2_5_7b_chat  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.hf_internlm2_5_20b_chat import \
        models as hf_internlm2_5_20b_chat  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import \
        models as hf_internlm2_chat_7b  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.hf_internlm2_chat_20b import \
        models as hf_internlm2_chat_20b  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
        models as lmdeploy_internlm2_5_7b_chat  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_20b_chat import \
        models as lmdeploy_internlm2_5_20b_chat  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b import \
        models as lmdeploy_internlm2_chat_7b  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_20b import \
        models as lmdeploy_internlm2_chat_20b  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm3_8b_instruct import \
        models as lmdeploy_internlm3_8b_instruct  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm_chat_7b import \
        models as lmdeploy_internlm_chat_7b  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.hf_llama2_7b_chat import models as hf_llama2_chat_7b  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.hf_llama3_1_8b_instruct import \
        models as hf_llama3_1_8b_instruct  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.hf_llama3_8b_instruct import \
        models as hf_llama_3_8b_instruct  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.lmdeploy_llama2_7b_chat import \
        models as lmdeploy_llama2_7b_chat  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import \
        models as lmdeploy_llama3_1_8b_instruct  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b_instruct import \
        models as lmdeploy_llama3_8b_instruct  # noqa: F401, E501
    from opencompass.configs.models.mistral.hf_mistral_7b_instruct_v0_1 import \
        models as hf_mistral_chat_7b  # noqa: F401, E501
    from opencompass.configs.models.mistral.hf_mixtral_8x7b_instruct_v0_1 import \
        models as hf_mixtral_chat_8x7b  # noqa: F401, E501
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import \
        models as lmdeploy_qwen2_5_7b_instruct  # noqa: F401, E501
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_32b_instruct import \
        models as lmdeploy_qwen2_5_32b_instruct  # noqa: F401, E501
    from opencompass.configs.models.qwen.hf_qwen1_5_7b_chat import models as hf_qwen1_5_chat_7b  # noqa: F401, E501
    from opencompass.configs.models.qwen.hf_qwen1_5_moe_a2_7b_chat import \
        models as hf_qwen1_5_moe_a2_7b_chat  # noqa: F401, E501
    from opencompass.configs.models.qwen.hf_qwen2_7b_instruct import models as hf_qwen2_7b_instruct  # noqa: F401, E501
    from opencompass.configs.models.qwen.hf_qwen_7b_chat import models as hf_qwen_chat_7b  # noqa: F401, E501
    from opencompass.configs.models.qwen.lmdeploy_qwen1_5_7b_chat import \
        models as lmdeploy_qwen1_5_7b_chat  # noqa: F401, E501
    from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import \
        models as lmdeploy_qwen2_7b_instruct  # noqa: F401, E501
    from opencompass.configs.models.qwen.lmdeploy_qwen_7b_chat import \
        models as lmdeploy_qwen_7b_chat  # noqa: F401, E501
    # Summary Groups
    from opencompass.configs.summarizers.groups.bbh import bbh_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.cmmlu import cmmlu_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.ds1000 import ds1000_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.GaokaoBench import GaokaoBench_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.humanevalx import humanevalx_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.mathbench_v1_2024 import \
        mathbench_2024_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.mmlu import mmlu_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.mmlu_pro import mmlu_pro_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.scicode import scicode_summary_groups  # noqa: F401, E501
    from opencompass.configs.summarizers.groups.teval import teval_summary_groups  # noqa: F401, E501

llama2_meta_template = dict(round=[
    dict(role='HUMAN', begin='[INST] ', end=' [/INST]'),
    dict(role='BOT', begin='', end='', generate=True),
],
                            eos_token_id=2)

MAX_SESSION_LEN = 2048
MAX_NEW_TOKENS = 1024

# ===== Configs for internlm/internlm2-chat-7b =====
turbomind_internlm2_chat_7b = deepcopy(*lmdeploy_internlm2_chat_7b)
turbomind_internlm2_chat_7b_4bits = deepcopy(*lmdeploy_internlm2_chat_7b)
turbomind_internlm2_chat_7b_kvint4 = deepcopy(*lmdeploy_internlm2_chat_7b)
turbomind_internlm2_chat_7b_kvint8 = deepcopy(*lmdeploy_internlm2_chat_7b)
pytorch_internlm2_chat_7b = deepcopy(*lmdeploy_internlm2_chat_7b)

# ===== Configs for internlm/internlm2_5_7b_chat =====
turbomind_internlm2_5_7b_chat = deepcopy(*lmdeploy_internlm2_5_7b_chat)
turbomind_internlm2_5_7b_chat_4bits = deepcopy(*lmdeploy_internlm2_5_7b_chat)
turbomind_internlm2_5_7b_chat_kvint4 = deepcopy(*lmdeploy_internlm2_5_7b_chat)
turbomind_internlm2_5_7b_chat_kvint8 = deepcopy(*lmdeploy_internlm2_5_7b_chat)
pytorch_internlm2_5_7b_chat = deepcopy(*lmdeploy_internlm2_5_7b_chat)
pytorch_internlm2_5_7b_chat_w8a8 = deepcopy(*lmdeploy_internlm2_5_7b_chat)
turbomind_internlm2_5_7b_chat_batch1 = deepcopy(*lmdeploy_internlm2_5_7b_chat)
turbomind_internlm2_5_7b_chat_batch1_4bits = deepcopy(*lmdeploy_internlm2_5_7b_chat)

turbomind_internlm3_8b_instruct = deepcopy(*lmdeploy_internlm3_8b_instruct)
turbomind_internlm3_8b_instruct_4bits = deepcopy(*lmdeploy_internlm3_8b_instruct)
turbomind_internlm3_8b_instruct_kvint4 = deepcopy(*lmdeploy_internlm3_8b_instruct)
turbomind_internlm3_8b_instruct_kvint8 = deepcopy(*lmdeploy_internlm3_8b_instruct)
pytorch_internlm3_8b_instruct = deepcopy(*lmdeploy_internlm3_8b_instruct)
pytorch_internlm3_8b_instruct_w8a8 = deepcopy(*lmdeploy_internlm3_8b_instruct)

# ===== Configs for internlm/internlm2_5_20b_chat =====
turbomind_internlm2_5_20b_chat = deepcopy(*lmdeploy_internlm2_5_20b_chat)
turbomind_internlm2_5_20b_chat_4bits = deepcopy(*lmdeploy_internlm2_5_20b_chat)
turbomind_internlm2_5_20b_chat_kvint4 = deepcopy(*lmdeploy_internlm2_5_20b_chat)
turbomind_internlm2_5_20b_chat_kvint8 = deepcopy(*lmdeploy_internlm2_5_20b_chat)
pytorch_internlm2_5_20b_chat = deepcopy(*lmdeploy_internlm2_5_20b_chat)

# ===== Configs for internlm/internlm2_chat_20b =====
turbomind_internlm2_chat_20b = deepcopy(*lmdeploy_internlm2_chat_20b)
turbomind_internlm2_chat_20b_4bits = deepcopy(*lmdeploy_internlm2_chat_20b)
turbomind_internlm2_chat_20b_kvint4 = deepcopy(*lmdeploy_internlm2_chat_20b)
turbomind_internlm2_chat_20b_kvint8 = deepcopy(*lmdeploy_internlm2_chat_20b)
pytorch_internlm2_chat_20b = deepcopy(*lmdeploy_internlm2_chat_20b)

# ===== Configs for Qwen/Qwen1.5-7B-Chat =====
turbomind_qwen1_5_7b_chat = deepcopy(*lmdeploy_qwen1_5_7b_chat)
turbomind_qwen1_5_7b_chat_4bits = deepcopy(*lmdeploy_qwen1_5_7b_chat)
turbomind_qwen1_5_7b_chat_kvint4 = deepcopy(*lmdeploy_qwen1_5_7b_chat)
turbomind_qwen1_5_7b_chat_kvint8 = deepcopy(*lmdeploy_qwen1_5_7b_chat)
pytorch_qwen1_5_7b_chat = deepcopy(*lmdeploy_qwen1_5_7b_chat)

# ===== Configs for Qwen/Qwen-7B-Chat =====
turbomind_qwen_7b_chat = deepcopy(*lmdeploy_qwen_7b_chat)
turbomind_qwen_7b_chat_4bits = deepcopy(*lmdeploy_qwen_7b_chat)
turbomind_qwen_7b_chat_kvint4 = deepcopy(*lmdeploy_qwen_7b_chat)
turbomind_qwen_7b_chat_kvint8 = deepcopy(*lmdeploy_qwen_7b_chat)
pytorch_qwen_7b_chat = deepcopy(*lmdeploy_qwen_7b_chat)

# ===== Configs for meta-llama/Meta-Llama-3-8B-Instruct =====
turbomind_llama3_8b_instruct = deepcopy(*lmdeploy_llama3_8b_instruct)
turbomind_llama3_8b_instruct_4bits = deepcopy(*lmdeploy_llama3_8b_instruct)
turbomind_llama3_8b_instruct_kvint4 = deepcopy(*lmdeploy_llama3_8b_instruct)
turbomind_llama3_8b_instruct_kvint8 = deepcopy(*lmdeploy_llama3_8b_instruct)
pytorch_llama3_8b_instruct = deepcopy(*lmdeploy_llama3_8b_instruct)

# ===== Configs for meta-llama/Meta-Llama-3.1-8B-Instruct =====
turbomind_llama3_1_8b_instruct = deepcopy(*lmdeploy_llama3_1_8b_instruct)
turbomind_llama3_1_8b_instruct['path'] = 'meta-llama/Meta-Llama-3-1-8B-Instruct'
turbomind_llama3_1_8b_instruct_4bits = deepcopy(turbomind_llama3_1_8b_instruct)
turbomind_llama3_1_8b_instruct_kvint4 = deepcopy(turbomind_llama3_1_8b_instruct)
turbomind_llama3_1_8b_instruct_kvint8 = deepcopy(turbomind_llama3_1_8b_instruct)
pytorch_llama3_1_8b_instruct = deepcopy(turbomind_llama3_1_8b_instruct)
pytorch_llama3_1_8b_instruct_w8a8 = deepcopy(turbomind_llama3_1_8b_instruct)

# ===== Configs for Qwen/Qwen2-7B-Instruct =====
turbomind_qwen2_7b_instruct = deepcopy(*lmdeploy_qwen2_7b_instruct)
turbomind_qwen2_7b_instruct_4bits = deepcopy(*lmdeploy_qwen2_7b_instruct)
turbomind_qwen2_7b_instruct_kvint4 = deepcopy(*lmdeploy_qwen2_7b_instruct)
turbomind_qwen2_7b_instruct_kvint8 = deepcopy(*lmdeploy_qwen2_7b_instruct)
pytorch_qwen2_7b_instruct = deepcopy(*lmdeploy_qwen2_7b_instruct)
pytorch_qwen2_7b_instruct_w8a8 = deepcopy(*lmdeploy_qwen2_7b_instruct)

# ===== Configs for Qwen/Qwen25-7B-Instruct =====
turbomind_qwen2_5_7b_instruct = deepcopy(*lmdeploy_qwen2_5_7b_instruct)
turbomind_qwen2_5_7b_instruct_4bits = deepcopy(*lmdeploy_qwen2_5_7b_instruct)
turbomind_qwen2_5_7b_instruct_kvint4 = deepcopy(*lmdeploy_qwen2_5_7b_instruct)
turbomind_qwen2_5_7b_instruct_kvint8 = deepcopy(*lmdeploy_qwen2_5_7b_instruct)
pytorch_qwen2_5_7b_instruct = deepcopy(*lmdeploy_qwen2_5_7b_instruct)
pytorch_qwen2_5_7b_instruct_w8a8 = deepcopy(*lmdeploy_qwen2_5_7b_instruct)

# ===== Configs for Qwen/Qwen25-32B-Instruct =====
turbomind_qwen2_5_32b_instruct = deepcopy(*lmdeploy_qwen2_5_32b_instruct)
turbomind_qwen2_5_32b_instruct_4bits = deepcopy(*lmdeploy_qwen2_5_32b_instruct)
turbomind_qwen2_5_32b_instruct_kvint4 = deepcopy(*lmdeploy_qwen2_5_32b_instruct)
turbomind_qwen2_5_32b_instruct_kvint8 = deepcopy(*lmdeploy_qwen2_5_32b_instruct)
pytorch_qwen2_5_32b_instruct = deepcopy(*lmdeploy_qwen2_5_32b_instruct)
pytorch_qwen2_5_32b_instruct_w8a8 = deepcopy(*lmdeploy_qwen2_5_32b_instruct)

# ===== Configs for meta-llama/Llama-2-7b-chat-hf =====
turbomind_llama2_7b_chat = deepcopy(*lmdeploy_llama2_7b_chat)
turbomind_llama2_7b_chat_4bits = deepcopy(*lmdeploy_llama2_7b_chat)
turbomind_llama2_7b_chat_kvint4 = deepcopy(*lmdeploy_llama2_7b_chat)
turbomind_llama2_7b_chat_kvint8 = deepcopy(*lmdeploy_llama2_7b_chat)

base_model = dict(type=TurboMindModelwithChatTemplate,
                  engine_config=dict(session_len=32768, max_batch_size=256),
                  gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=32768),
                  max_seq_len=32768,
                  max_out_len=32768,
                  batch_size=500,
                  pred_postprocessor=dict(type=extract_non_reasoning_content),
                  run_cfg=dict(num_gpus=1))

turbomind_qwen3_32b = deepcopy(base_model)
pytorch_qwen3_32b = deepcopy(base_model)
turbomind_qwen3_32b_4bits = deepcopy(base_model)
turbomind_qwen3_32b_kvint8 = deepcopy(base_model)

turbomind_qwen3_30b_a3b = deepcopy(base_model)
pytorch_qwen3_30b_a3b = deepcopy(base_model)
turbomind_qwen3_30b_a3b_4bits = deepcopy(base_model)
turbomind_qwen3_30b_a3b_kvint8 = deepcopy(base_model)
turbomind_qwen3_30b_a3b_fp8 = deepcopy(base_model)
pytorch_qwen3_30b_a3b_fp8 = deepcopy(base_model)
turbomind_qwen3_30b_a3b_fp8['engine_config']['cache_max_entry_count'] = 0.6

turbomind_qwen3_235b_a22b = deepcopy(base_model)
pytorch_qwen3_235b_a22b = deepcopy(base_model)
turbomind_qwen3_235b_a22b_4bits = deepcopy(base_model)
turbomind_qwen3_235b_a22b_kvint8 = deepcopy(base_model)
turbomind_qwen3_235b_a22b_fp8 = deepcopy(base_model)
pytorch_qwen3_235b_a22b_fp8 = deepcopy(base_model)

# update config for Qwen3-32B, Qwen3-30B-A3B, Qwen3-235B-A22B
for model in [
        v for k, v in locals().items() if k.startswith('turbomind_qwen3_32b') or k.startswith('pytorch_qwen3_32b')
]:
    model['abbr'] = 'qwen3_32b_turbomind'
    model['path'] = 'Qwen/Qwen3-32B'

for model in [
        v for k, v in locals().items()
        if k.startswith('turbomind_qwen3_30b_a3b') or k.startswith('pytorch_qwen3_30b_a3b')
]:
    model['abbr'] = 'qwen3_30b_a3b_turbomind'
    model['path'] = 'Qwen/Qwen3-30B-A3B'

for model in [
        v for k, v in locals().items()
        if k.startswith('turbomind_qwen3_30b_a3b_fp8') or k.startswith('pytorch_qwen3_30b_a3b_fp8')
]:
    model['abbr'] = 'qwen3_30b_a3b_fp8_turbomind'
    model['path'] = 'Qwen/Qwen3-30B-A3B-FP8'

for model in [
        v for k, v in locals().items()
        if k.startswith('turbomind_qwen3_235b_a22b') or k.startswith('pytorch_qwen3_235b_a22b')
]:
    model['abbr'] = 'qwen3_235b_a22b_turbomind'
    model['path'] = 'Qwen/Qwen3-235B-A22B'

for model in [
        v for k, v in locals().items()
        if k.startswith('turbomind_qwen3_235b_a22b_fp8') or k.startswith('pytorch_qwen3_235b_a22b_fp8')
]:
    model['abbr'] = 'qwen3_235b_a22b_fp8_turbomind'
    model['path'] = 'Qwen/Qwen3-235B-A22B-FP8'

# update config for turbomind, w4a4, w8a8, kvint4, kvint8, pytorch models
for model in [v for k, v in locals().items() if k.startswith('turbomind_')]:
    model['engine_config']['max_batch_size'] = 512
    model['gen_config']['do_sample'] = False
    model['batch_size'] = 1000

for model in [v for k, v in locals().items() if k.endswith('_4bits')]:
    model['engine_config']['model_format'] = 'awq'
    model['abbr'] = model['abbr'] + '_4bits'
    model['path'] = model['path'] + '-inner-4bits'

for model in [v for k, v in locals().items() if k.endswith('_w8a8')]:
    model['abbr'] = model['abbr'] + '_w8a8'
    model['path'] = model['path'] + '-inner-w8a8'

for model in [v for k, v in locals().items() if k.endswith('_kvint4')]:
    model['engine_config']['quant_policy'] = 4
    model['abbr'] = model['abbr'] + '_kvint4'

for model in [v for k, v in locals().items() if k.endswith('_kvint8')]:
    model['engine_config']['quant_policy'] = 8
    model['abbr'] = model['abbr'] + '_kvint8'

for model in [v for k, v in locals().items() if k.startswith('pytorch_')]:
    model['abbr'] = model['abbr'].replace('turbomind', 'pytorch')
    model['backend'] = 'pytorch'
    model['engine_config']['max_batch_size'] = 512
    model['gen_config']['do_sample'] = False
    model['batch_size'] = 1000

for model in [v for k, v in locals().items() if '_batch1' in k]:
    model['abbr'] = model['abbr'] + '_batch1'
    model['engine_config']['max_batch_size'] = 1
    model['batch_size'] = 1

# update config for Qwen3-32B, Qwen3-30B-A3B, Qwen3-235B-A22B
for model in [
        v for k, v in locals().items() if k.startswith('turbomind_qwen3_32b') or k.startswith('pytorch_qwen3_32b')
]:
    model['run_cfg']['num_gpus'] = 2
    model['engine_config']['tp'] = 2
    model['engine_config']['max_batch_size'] = 1024
    model['batch_size'] = 2048

for model in [
        v for k, v in locals().items()
        if k.startswith('turbomind_qwen3_30b_a3b') or k.startswith('pytorch_qwen3_30b_a3b')
]:
    model['run_cfg']['num_gpus'] = 2
    model['engine_config']['tp'] = 2
    model['engine_config']['max_batch_size'] = 1024
    model['batch_size'] = 2048

for model in [
        v for k, v in locals().items()
        if k.startswith('turbomind_qwen3_235b_a22b') or k.startswith('pytorch_qwen3_235b_a22b')
]:
    model['run_cfg']['num_gpus'] = 8
    model['engine_config']['tp'] = 8
    model['engine_config']['max_batch_size'] = 1024
    model['batch_size'] = 2048

turbomind_qwen3_235b_a22b_fp8['engine_config']['cache_max_entry_count'] = 0.6
turbomind_qwen3_235b_a22b_fp8['engine_config']['tp'] = 4
turbomind_qwen3_235b_a22b_fp8['run_cfg']['num_gpus'] = 4
pytorch_qwen3_235b_a22b_fp8['engine_config']['tp'] = 4
pytorch_qwen3_235b_a22b_fp8['run_cfg']['num_gpus'] = 4

basic_pytorch_chat_tp1 = dict(type=TurboMindModelwithChatTemplate,
                              engine_config=dict(session_len=MAX_SESSION_LEN, max_batch_size=512, tp=1),
                              gen_config=dict(do_sample=False, max_new_tokens=MAX_NEW_TOKENS),
                              max_out_len=MAX_NEW_TOKENS,
                              max_seq_len=MAX_SESSION_LEN,
                              batch_size=1000,
                              run_cfg=dict(num_gpus=1))

# ===== Configs for Qwen/Qwen1.5-MoE-A2.7B-Chat =====
pytorch_qwen1_5_moe_2_7b_chat = deepcopy(basic_pytorch_chat_tp1)
pytorch_qwen1_5_moe_2_7b_chat['abbr'] = 'pytorch_qwen1_5_moe_2_7b_chat'
pytorch_qwen1_5_moe_2_7b_chat['path'] = 'Qwen/Qwen1.5-MoE-A2.7B-Chat'

# ===== Configs for google/gemma2-7b-it =====
pytorch_gemma_2_9b_it = deepcopy(basic_pytorch_chat_tp1)
pytorch_gemma_2_9b_it['abbr'] = 'pytorch_gemma_2_9b_it'
pytorch_gemma_2_9b_it['path'] = 'google/gemma-2-9b-it'

# ===== Configs for google/gemma2-27b-it =====
pytorch_gemma_2_27b_it = deepcopy(basic_pytorch_chat_tp1)
pytorch_gemma_2_27b_it['abbr'] = 'pytorch_gemma_2_27b_it'
pytorch_gemma_2_27b_it['path'] = 'google/gemma-2-27b-it'
pytorch_gemma_2_27b_it['run_cfg']['num_gpus'] = 2
pytorch_gemma_2_27b_it['engine_config']['tp'] = 2

race_datasets = [race_datasets[1]]

# Summarizer
summarizer = dict(
    dataset_abbrs=[
        ['race-high', 'accuracy'],
        ['ARC-c', 'accuracy'],
        ['BoolQ', 'accuracy'],
        ['mmlu_pro', 'naive_average'],
        ['drop', 'accuracy'],
        ['bbh', 'naive_average'],
        ['GPQA_diamond', 'accuracy'],
        ['math', 'accuracy'],
        ['wikibench-wiki-single_choice_cncircular', 'perf_4'],
        ['openai_humaneval', 'humaneval_pass@1'],
        ['sanitized_mbpp', 'score'],
        ['cmmlu', 'naive_average'],
        ['mmlu', 'naive_average'],
        ['teval', 'naive_average'],
        ['SciCode', 'accuracy'],
        ['SciCode', 'sub_accuracy'],
        ['humanevalx', 'naive_average'],
        ['ds1000', 'naive_average'],
        ['IFEval', 'Prompt-level-strict-accuracy'],
        ['gsm8k', 'accuracy'],
        ['GaokaoBench', 'weighted_average'],
        ['triviaqa_wiki_1shot', 'score'],
        ['nq_open_1shot', 'score'],
        ['hellaswag', 'accuracy'],
        ['TheoremQA', 'score'],
        '###### MathBench-A: Application Part ######',
        'college',
        'high',
        'middle',
        'primary',
        'arithmetic',
        'mathbench-a (average)',
        '###### MathBench-T: Theory Part ######',
        'college_knowledge',
        'high_knowledge',
        'middle_knowledge',
        'primary_knowledge',
        'mathbench-t (average)',
        '###### Overall: Average between MathBench-A and MathBench-T ######',
        'Overall',
        '',
        ''
        'mmlu',
        'mmlu-stem',
        'mmlu-social-science',
        'mmlu-humanities',
        'mmlu-other',
        '',
        'cmmlu',
        'cmmlu-stem',
        'cmmlu-social-science',
        'cmmlu-humanities',
        'cmmlu-other',
        'cmmlu-china-specific',
        '',
        'mmlu_pro',
        'mmlu_pro_biology',
        'mmlu_pro_business',
        'mmlu_pro_chemistry',
        'mmlu_pro_computer_science',
        'mmlu_pro_economics',
        'mmlu_pro_engineering',
        'mmlu_pro_health',
        'mmlu_pro_history',
        'mmlu_pro_law',
        'mmlu_pro_math',
        'mmlu_pro_philosophy',
        'mmlu_pro_physics',
        'mmlu_pro_psychology',
        'mmlu_pro_other',
        '',
        'humanevalx-python',
        'humanevalx-cpp',
        'humanevalx-go',
        'humanevalx-java',
        'humanevalx-js',
        '',
        'ds1000_Pandas',
        'ds1000_Numpy',
        'ds1000_Tensorflow',
        'ds1000_Scipy',
        'ds1000_Sklearn',
        'ds1000_Pytorch',
        'ds1000_Matplotlib',
    ],
    summary_groups=sum([v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)


================================================
FILE: .github/scripts/eval_regression_base_models.py
================================================
from copy import deepcopy

from mmengine.config import read_base

with read_base():
    # choose a list of datasets
    from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import gpqa_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.gsm8k.gsm8k_gen_17d0dc import gsm8k_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.race.race_ppl import race_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.winogrande.winogrande_5shot_ll_252f01 import \
        winogrande_datasets  # noqa: F401, E501
    # read hf models - chat models
    from opencompass.configs.models.chatglm.lmdeploy_glm4_9b import models as lmdeploy_glm4_9b_model  # noqa: F401, E501
    from opencompass.configs.models.deepseek.lmdeploy_deepseek_7b_base import \
        models as lmdeploy_deepseek_7b_base_model  # noqa: F401, E501
    from opencompass.configs.models.deepseek.lmdeploy_deepseek_67b_base import \
        models as lmdeploy_deepseek_67b_base_model  # noqa: F401, E501
    from opencompass.configs.models.deepseek.lmdeploy_deepseek_v2 import lmdeploy_deepseek_v2_model  # noqa: F401, E501
    from opencompass.configs.models.gemma.lmdeploy_gemma_9b import models as pytorch_gemma_9b_model  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_1_8b import \
        models as lmdeploy_internlm2_1_8b_model  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b import \
        models as lmdeploy_internlm2_5_7b_model  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_20b import \
        models as lmdeploy_internlm2_20b_model  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_base_7b import \
        models as lmdeploy_internlm2_base_7b_model  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b import \
        models as lmdeploy_llama3_1_8b_model  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b import \
        models as lmdeploy_llama3_8b_model  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.lmdeploy_llama3_70b import \
        models as lmdeploy_llama3_70b_model  # noqa: F401, E501
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b import \
        models as lmdeploy_qwen2_5_1_5b_model  # noqa: F401, E501
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b import \
        models as lmdeploy_qwen2_5_7b_model  # noqa: F401, E501
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_32b import \
        models as lmdeploy_qwen2_5_32b_model  # noqa: F401, E501
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b import \
        models as lmdeploy_qwen2_5_72b_model  # noqa: F401, E501
    from opencompass.configs.models.qwen.lmdeploy_qwen2_1_5b import \
        models as lmdeploy_qwen2_1_5b_model  # noqa: F401, E501
    from opencompass.configs.models.qwen.lmdeploy_qwen2_7b import models as lmdeploy_qwen2_7b_model  # noqa: F401, E501
    from opencompass.configs.models.yi.lmdeploy_yi_1_5_9b import models as lmdeploy_yi_1_5_9b_model  # noqa: F401, E501

    from .volc import infer as volc_infer  # noqa: F401, E501

race_datasets = [race_datasets[1]]
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])

pytorch_glm4_9b_model = deepcopy(lmdeploy_glm4_9b_model)
pytorch_deepseek_7b_base_model = deepcopy(lmdeploy_deepseek_7b_base_model)
pytorch_deepseek_67b_base_model = deepcopy(lmdeploy_deepseek_67b_base_model)
pytorch_deepseek_v2_model = deepcopy(lmdeploy_deepseek_v2_model)
pytorch_internlm2_5_7b_model = deepcopy(lmdeploy_internlm2_5_7b_model)
pytorch_internlm2_20b_model = deepcopy(lmdeploy_internlm2_20b_model)
pytorch_internlm2_base_7b_model = deepcopy(lmdeploy_internlm2_base_7b_model)
pytorch_llama3_1_8b_model = deepcopy(lmdeploy_llama3_1_8b_model)
pytorch_llama3_70b_model = deepcopy(lmdeploy_llama3_70b_model)
pytorch_qwen2_5_1_5b_model = deepcopy(lmdeploy_qwen2_5_1_5b_model)
pytorch_qwen2_5_72b_model = deepcopy(lmdeploy_qwen2_5_72b_model)
pytorch_qwen2_7b_model = deepcopy(lmdeploy_qwen2_7b_model)
pytorch_yi_1_5_9b_model = deepcopy(lmdeploy_yi_1_5_9b_model)
pytorch_deepseek_v2_model['engine_config']['cache_max_entry_count'] = 0.6

lmdeploy_glm4_9b_model_native = deepcopy(lmdeploy_glm4_9b_model)
lmdeploy_deepseek_7b_base_model_native = deepcopy(lmdeploy_deepseek_7b_base_model)
lmdeploy_deepseek_67b_base_model_native = deepcopy(lmdeploy_deepseek_67b_base_model)
lmdeploy_deepseek_v2_model_native = deepcopy(lmdeploy_deepseek_v2_model)
lmdeploy_internlm2_5_7b_model_native = deepcopy(lmdeploy_internlm2_5_7b_model)
lmdeploy_internlm2_20b_model_native = deepcopy(lmdeploy_internlm2_20b_model)
lmdeploy_internlm2_base_7b_model_native = deepcopy(lmdeploy_internlm2_base_7b_model)
lmdeploy_llama3_1_8b_model_native = deepcopy(lmdeploy_llama3_1_8b_model)
lmdeploy_llama3_70b_model_native = deepcopy(lmdeploy_llama3_70b_model)
lmdeploy_qwen2_5_1_5b_model_native = deepcopy(lmdeploy_qwen2_5_1_5b_model)
lmdeploy_qwen2_5_72b_model_native = deepcopy(lmdeploy_qwen2_5_72b_model)
lmdeploy_qwen2_7b_model_native = deepcopy(lmdeploy_qwen2_7b_model)
lmdeploy_yi_1_5_9b_model_native = deepcopy(lmdeploy_yi_1_5_9b_model)

for model in [v for k, v in locals().items() if k.startswith('lmdeploy_') or k.startswith('pytorch_')]:
    for m in model:
        m['engine_config']['max_batch_size'] = 512
        m['gen_config']['do_sample'] = False
        m['batch_size'] = 5000

for model in [v for k, v in locals().items() if k.startswith('lmdeploy_')]:
    for m in model:
        m['backend'] = 'turbomind'

for model in [v for k, v in locals().items() if k.startswith('pytorch_')]:
    for m in model:
        m['abbr'] = m['abbr'].replace('turbomind', 'pytorch').replace('lmdeploy', 'pytorch')
        m['backend'] = 'pytorch'

for model in [v for k, v in locals().items() if k.endswith('_native')]:
    for m in model:
        m['abbr'] = m['abbr'] + '_native'
        m['engine_config']['communicator'] = 'native'

# models = sum([v for k, v in locals().items() if  k.startswith('lmdeploy_') or k.startswith('pytorch_')], [])
# models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])

summarizer = dict(
    dataset_abbrs=[
        ['gsm8k', 'accuracy'],
        ['GPQA_diamond', 'accuracy'],
        ['race-high', 'accuracy'],
        ['winogrande', 'accuracy'],
    ],
    summary_groups=sum([v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)


================================================
FILE: .github/scripts/eval_regression_chat_models.py
================================================
from copy import deepcopy

from mmengine.config import read_base

with read_base():
    # choose a list of datasets
    from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import gpqa_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.IFEval.IFEval_gen_353ae7 import ifeval_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.math.math_0shot_gen_11c4b5 import math_datasets  # noqa: F401, E501
    # read hf models - chat models
    from opencompass.configs.models.chatglm.lmdeploy_glm4_9b_chat import \
        models as lmdeploy_glm4_9b_chat_model  # noqa: F401, E501
    from opencompass.configs.models.deepseek.lmdeploy_deepseek_r1_distill_qwen_32b import \
        models as lmdeploy_deepseek_r1_distill_qwen_32b_model  # noqa: F401, E501
    from opencompass.configs.models.deepseek.lmdeploy_deepseek_v2_5_1210 import \
        models as lmdeploy_deepseek_v2_5_1210_model  # noqa: F401, E501
    from opencompass.configs.models.deepseek.lmdeploy_deepseek_v2_lite import \
        models as lmdeploy_deepseek_v2_lite_model  # noqa: F401, E501
    from opencompass.configs.models.gemma.lmdeploy_gemma_9b_it import \
        models as pytorch_gemma_9b_it_model  # noqa: F401, E501
    from opencompass.configs.models.gemma.lmdeploy_gemma_27b_it import \
        models as pytorch_gemma_27b_it_model  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
        models as lmdeploy_internlm2_5_7b_chat_model  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_20b_chat import \
        models as lmdeploy_internlm2_5_20b_chat_model  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_1_8b import \
        models as lmdeploy_internlm2_chat_1_8b_model  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_1_8b_sft import \
        models as lmdeploy_internlm2_chat_1_8b_sft_model  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b import \
        models as lmdeploy_internlm2_chat_7b_model  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b_sft import \
        models as lmdeploy_internlm2_chat_7b_sft_model  # noqa: F401, E501
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm3_8b_instruct import \
        models as lmdeploy_internlm3_8b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.lmdeploy_llama2_7b_chat import \
        models as lmdeploy_llama2_7b_chat_model  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import \
        models as lmdeploy_llama3_1_8b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.lmdeploy_llama3_2_3b_instruct import \
        models as lmdeploy_llama3_2_3b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.lmdeploy_llama3_3_70b_instruct import \
        models as lmdeploy_llama3_3_70b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b_instruct import \
        models as lmdeploy_llama3_8b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.mistral.lmdeploy_mistral_large_instruct_2411 import \
        models as lmdeploy_mistral_large_instruct_2411_model  # noqa: F401, E501
    from opencompass.configs.models.mistral.lmdeploy_mistral_nemo_instruct_2407 import \
        models as lmdeploy_mistral_nemo_instruct_2407_model  # noqa: F401, E501
    from opencompass.configs.models.mistral.lmdeploy_mistral_small_instruct_2409 import \
        models as lmdeploy_mistral_small_instruct_2409_model  # noqa: F401, E501
    from opencompass.configs.models.nvidia.lmdeploy_nemotron_70b_instruct_hf import \
        models as lmdeploy_nemotron_70b_instruct_hf_model  # noqa: F401, E501
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_0_5b_instruct import \
        models as lmdeploy_qwen2_5_0_5b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_3b_instruct import \
        models as lmdeploy_qwen2_5_3b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import \
        models as lmdeploy_qwen2_5_14b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_32b_instruct import \
        models as lmdeploy_qwen2_5_32b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b_instruct import \
        models as lmdeploy_qwen2_5_72b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.qwen.lmdeploy_qwen2_1_5b_instruct import \
        models as lmdeploy_qwen2_1_5b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import \
        models as lmdeploy_qwen2_7b_instruct_model  # noqa: F401, E501
    from opencompass.configs.models.yi.lmdeploy_yi_1_5_6b_chat import \
        models as lmdeploy_yi_1_5_6b_chat_model  # noqa: F401, E501
    from opencompass.configs.models.yi.lmdeploy_yi_1_5_9b_chat import \
        models as lmdeploy_yi_1_5_9b_chat_model  # noqa: F401, E501
    from opencompass.configs.models.yi.lmdeploy_yi_1_5_34b_chat import \
        models as lmdeploy_yi_1_5_34b_chat_model  # noqa: F401, E501

    from .volc import infer as volc_infer  # noqa: F401, E501

datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])

pytorch_glm4_9b_chat_model = deepcopy(lmdeploy_glm4_9b_chat_model)
pytorch_deepseek_v2_lite_model = deepcopy(lmdeploy_deepseek_v2_lite_model)
pytorch_deepseek_v2_5_1210_model = deepcopy(lmdeploy_deepseek_v2_5_1210_model)
pytorch_internlm3_8b_instruct_model = deepcopy(lmdeploy_internlm3_8b_instruct_model)
pytorch_internlm2_5_7b_chat_model = deepcopy(lmdeploy_internlm2_5_7b_chat_model)
pytorch_internlm2_5_20b_chat_model = deepcopy(lmdeploy_internlm2_5_20b_chat_model)
pytorch_llama3_2_3b_instruct_model = deepcopy(lmdeploy_llama3_2_3b_instruct_model)
pytorch_llama3_3_70b_instruct_model = deepcopy(lmdeploy_llama3_3_70b_instruct_model)
pytorch_mistral_nemo_instruct_2407_model = deepcopy(lmdeploy_mistral_nemo_instruct_2407_model)
pytorch_mistral_small_instruct_2409_model = deepcopy(lmdeploy_mistral_small_instruct_2409_model)
pytorch_qwen2_5_72b_instruct_model = deepcopy(lmdeploy_qwen2_5_72b_instruct_model)
pytorch_qwen2_5_32b_instruct_model = deepcopy(lmdeploy_qwen2_5_32b_instruct_model)
pytorch_qwen2_7b_instruct_model = deepcopy(lmdeploy_qwen2_7b_instruct_model)
pytorch_yi_1_5_34b_chat_model = deepcopy(lmdeploy_yi_1_5_34b_chat_model)
pytorch_deepseek_v2_5_1210_model['engine_config']['cache_max_entry_count'] = 0.6

lmdeploy_glm4_9b_chat_model_native = deepcopy(lmdeploy_glm4_9b_chat_model)
lmdeploy_deepseek_r1_distill_qwen_32b_model_native = deepcopy(lmdeploy_deepseek_r1_distill_qwen_32b_model)
lmdeploy_deepseek_v2_lite_model_native = deepcopy(lmdeploy_deepseek_v2_lite_model)
lmdeploy_deepseek_v2_5_1210_model_native = deepcopy(lmdeploy_deepseek_v2_5_1210_model)
lmdeploy_internlm3_8b_instruct_model_native = deepcopy(lmdeploy_internlm3_8b_instruct_model)
lmdeploy_internlm2_5_7b_chat_model_native = deepcopy(lmdeploy_internlm2_5_7b_chat_model)
lmdeploy_internlm2_5_20b_chat_model_native = deepcopy(lmdeploy_internlm2_5_20b_chat_model)
lmdeploy_llama3_1_8b_instruct_model_native = deepcopy(lmdeploy_llama3_1_8b_instruct_model)
lmdeploy_llama3_2_3b_instruct_model_native = deepcopy(lmdeploy_llama3_2_3b_instruct_model)
lmdeploy_llama3_8b_instruct_model_native = deepcopy(lmdeploy_llama3_8b_instruct_model)
lmdeploy_llama3_3_70b_instruct_model_native = deepcopy(lmdeploy_llama3_3_70b_instruct_model)
lmdeploy_mistral_large_instruct_2411_model_native = deepcopy(lmdeploy_mistral_large_instruct_2411_model)
lmdeploy_mistral_nemo_instruct_2407_model_native = deepcopy(lmdeploy_mistral_nemo_instruct_2407_model)
lmdeploy_mistral_small_instruct_2409_model_native = deepcopy(lmdeploy_mistral_small_instruct_2409_model)
lmdeploy_nemotron_70b_instruct_hf_model_native = deepcopy(lmdeploy_nemotron_70b_instruct_hf_model)
lmdeploy_qwen2_5_0_5b_instruct_model_native = deepcopy(lmdeploy_qwen2_5_0_5b_instruct_model)
lmdeploy_qwen2_5_14b_instruct_model_native = deepcopy(lmdeploy_qwen2_5_14b_instruct_model)
lmdeploy_qwen2_5_32b_instruct_model_native = deepcopy(lmdeploy_qwen2_5_32b_instruct_model)
lmdeploy_qwen2_5_72b_instruct_model_native = deepcopy(lmdeploy_qwen2_5_72b_instruct_model)
lmdeploy_qwen2_7b_instruct_model_native = deepcopy(lmdeploy_qwen2_7b_instruct_model)
lmdeploy_yi_1_5_6b_chat_model_native = deepcopy(lmdeploy_yi_1_5_6b_chat_model)
lmdeploy_yi_1_5_34b_chat_model_native = deepcopy(lmdeploy_yi_1_5_34b_chat_model)

for model in [v for k, v in locals().items() if k.startswith('lmdeploy_') or k.startswith('pytorch_')]:
    for m in model:
        m['engine_config']['max_batch_size'] = 512
        m['gen_config']['do_sample'] = False
        m['batch_size'] = 5000

for model in [v for k, v in locals().items() if k.startswith('lmdeploy_')]:
    for m in model:
        m['backend'] = 'turbomind'

for model in [v for k, v in locals().items() if k.startswith('pytorch_')]:
    for m in model:
        m['abbr'] = m['abbr'].replace('turbomind', 'pytorch').replace('lmdeploy', 'pytorch')
        m['backend'] = 'pytorch'

for model in [v for k, v in locals().items() if k.endswith('_native')]:
    for m in model:
        m['abbr'] = m['abbr'] + '_native'
        m['engine_config']['communicator'] = 'native'

# models = sum([v for k, v in locals().items() if  k.startswith('lmdeploy_') or k.startswith('pytorch_')], [])
# models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])

summarizer = dict(
    dataset_abbrs=[
        ['GPQA_diamond', 'accuracy'],
        ['math', 'accuracy'],
        ['IFEval', 'Prompt-level-strict-accuracy'],
    ],
    summary_groups=sum([v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)


================================================
FILE: .github/scripts/eval_stable_object_config.py
================================================
from mmengine.config import read_base
from opencompass.models import OpenAISDK

with read_base():
    # choose a list of datasets
    from opencompass.configs.datasets.ARC_c.ARC_c_cot_gen_926652 import ARC_c_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.bbh.bbh_gen_5b92b0 import bbh_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.CHARM.charm_reason_cot_only_gen_f7b7d3 import \
        charm_reason_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.cmmlu.cmmlu_0shot_cot_gen_305931 import cmmlu_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.drop.drop_openai_simple_evals_gen_3857b0 import drop_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.ds1000.ds1000_service_eval_gen_cbc84f import ds1000_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import gpqa_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.gsm8k.gsm8k_0shot_v2_gen_a58960 import gsm8k_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.hellaswag.hellaswag_10shot_gen_e42710 import \
        hellaswag_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.humaneval.humaneval_openai_sample_evals_gen_159614 import \
        humaneval_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.humanevalx.humanevalx_gen_620cfa import humanevalx_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.IFEval.IFEval_gen_3321a3 import ifeval_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.LCBench.lcbench_gen_5ff288 import LCBench_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.math.math_0shot_gen_393424 import math_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.MathBench.mathbench_2024_gen_50a320 import mathbench_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.mbpp.sanitized_mbpp_mdblock_gen_a447ff import \
        sanitized_mbpp_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.mmlu.mmlu_openai_simple_evals_gen_b618ea import mmlu_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_cot_gen_08c1de import \
        mmlu_pro_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.race.race_cot_gen_d95929 import race_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.scicode.scicode_gen_085b98 import SciCode_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_cot_gen_1d56df import \
        BoolQ_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.teval.teval_en_gen_1ac254 import \
        teval_datasets as teval_en_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.teval.teval_zh_gen_1ac254 import \
        teval_datasets as teval_zh_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import TheoremQA_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.wikibench.wikibench_gen_0978ad import wikibench_datasets  # noqa: F401, E501

datasets = sum(
    (v for k, v in locals().items() if k.endswith('_datasets') and 'scicode' not in k.lower() and 'teval' not in k), [])
datasets += teval_en_datasets
datasets += teval_zh_datasets
datasets += SciCode_datasets

api_meta_template = dict(
    round=[
        dict(role='HUMAN', api_role='HUMAN'),
        dict(role='BOT', api_role='BOT', generate=True),
    ],
    reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)

models = [
    dict(
        abbr='lmdeploy-api-test',
        type=OpenAISDK,
        key='EMPTY',
        openai_api_base='http://localhost:23344/v1',
        path='/nvme/qa_test_models/internlm/internlm2_5-20b-chat',
        tokenizer_path='/nvme/qa_test_models/internlm/internlm2_5-20b-chat',
        rpm_verbose=True,
        meta_template=api_meta_template,
        query_per_second=100,
        max_out_len=1024,
        max_seq_len=4096,
        temperature=0.01,
        batch_size=128,
        retry=3,
    )
]


================================================
FILE: .github/scripts/eval_stable_subject_config.py
================================================
from mmengine.config import read_base
from opencompass.models import OpenAISDK
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks.subjective_eval import SubjectiveEvalTask

with read_base():
    # choose a list of datasets
    from opencompass.configs.datasets.subjective.alignbench.alignbench_judgeby_critiquellm import \
        alignbench_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4 import \
        alpacav2_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.subjective.arena_hard.arena_hard_compare import \
        arenahard_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.subjective.compassarena.compassarena_compare import \
        compassarena_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.subjective.fofo.fofo_bilingual_judge import fofo_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.subjective.multiround.mtbench101_judge import \
        mtbench101_datasets  # noqa: F401, E501
    from opencompass.configs.datasets.subjective.wildbench.wildbench_pair_judge import \
        wildbench_datasets  # noqa: F401, E501

datasets = sum((v for k, v in locals().items() if k.endswith('_datasets') and 'wildbench' not in k), [])
datasets += wildbench_datasets

api_meta_template = dict(
    round=[
        dict(role='HUMAN', api_role='HUMAN'),
        dict(role='BOT', api_role='BOT', generate=True),
    ],
    reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)

models = [
    dict(
        abbr='lmdeploy-api-test',
        type=OpenAISDK,
        key='EMPTY',
        openai_api_base='http://localhost:23344/v1',
        path='/nvme/qa_test_models/internlm/internlm2_5-20b-chat',
        tokenizer_path='/nvme/qa_test_models/internlm/internlm2_5-20b-chat',
        rpm_verbose=True,
        meta_template=api_meta_template,
        query_per_second=100,
        max_out_len=1024,
        max_seq_len=4096,
        temperature=0.01,
        batch_size=128,
        retry=3,
    )
]

judge_models = models

eval = dict(
    partitioner=dict(
        type=SubjectiveNaivePartitioner,
        models=models,
        judge_models=judge_models,
    ),
    runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)


================================================
FILE: .github/workflows/api_eval.yml
================================================
name: api_eval

on:
  workflow_dispatch:
    inputs:
      repo_org:
        required: false
        description: 'Tested repository organization name. Default is InternLM/lmdeploy'
        type: string
        default: 'InternLM/lmdeploy'
      repo_ref:
        required: false
        description: 'Set branch or tag or commit id. Default is "main"'
        type: string
        default: 'main'
      backend:
        required: true
        description: 'Set backend filter. Default is "["turbomind", "pytorch"]"'
        type: string
        default: "['turbomind', 'pytorch']"
      execution_mode:
        required: false
        description: 'Select execution mode: infer, eval, or both. Default is "both"'
        type: choice
        options:
          - both
          - infer
          - eval
        default: 'both'
      run_id:
        required: false
        description: 'Set custom run ID. If not provided, github.run_id will be used'
        type: string
        default: ''
      offline_mode:
        required: true
        description: 'Whether start a offline mode, if true, you should prepare code and whl package by yourself'
        type: boolean
        default: false

env:
  HOST_PIP_CACHE_DIR: /nvme/github-actions/pip-cache
  HOST_LOCALTIME: /usr/share/zoneinfo/Asia/Shanghai
  ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
  REPORT_DIR: /nvme/qa_test_models/evaluation_report/allure_report/${{ inputs.repo_ref }}_${{ github.run_id }}
  COV_PARAM: --cov /opt/py3/lib/python3.10/site-packages/lmdeploy
  TEST_CODE_PATH: /nvme/qa_test_models/test_pkg/lmdeploy/${{ inputs.repo_ref }}_${{ github.run_id }}
  OFFLINE_CODE_PATH: /nvme/qa_test_models/offline_pkg/lmdeploy
  COMPASS_DATA_CACHE: /nvme/qa_test_models/compass_data_cache
  HF_DATASETS_OFFLINE: 1
  HF_DATASETS_CACHE: /nvme/qa_test_models/hf_datasets
  HF_HUB_OFFLINE: 1
  HF_EVALUATE_OFFLINE: 1
  RUN_ID: ${{ inputs.repo_ref }}_${{ github.run_id }}

jobs:
  linux-build:
    if: ${{github.event_name == 'schedule' || (!cancelled() && !inputs.offline_mode)}}
    strategy:
      matrix:
        pyver: [py310]
    runs-on: ubuntu-latest
    env:
      PYTHON_VERSION: ${{ matrix.pyver }}
      PLAT_NAME: manylinux2014_x86_64
      DOCKER_TAG: cuda12.8
      OUTPUT_FOLDER: cuda12.8_dist_${{ github.run_id }}
    steps:
      - name: Free disk space
        uses: jlumbroso/free-disk-space@main
        with:
          # This might remove tools that are actually needed, if set to "true" but frees about 6 GB
          tool-cache: false
          docker-images: false
          # All of these default to true, but feel free to set to "false" if necessary for your workflow
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          swap-storage: false
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          repository: ${{ github.event.inputs.repo_org || 'InternLM/lmdeploy' }}
          ref: ${{github.event.inputs.repo_ref || 'main'}}
      - name: Build
        run: |
          echo ${PYTHON_VERSION}
          echo ${PLAT_NAME}
          echo ${DOCKER_TAG}
          echo ${OUTPUT_FOLDER}
          echo ${GITHUB_RUN_ID}
          # remove -it
          sed -i 's/docker run --rm -it/docker run --rm/g' builder/manywheel/build_wheel.sh
          bash builder/manywheel/build_wheel.sh ${PYTHON_VERSION} ${PLAT_NAME} ${DOCKER_TAG} ${OUTPUT_FOLDER}
      - name: Upload Artifacts
        uses: actions/upload-artifact@v4
        with:
          if-no-files-found: error
          path: builder/manywheel/${{ env.OUTPUT_FOLDER }}
          retention-days: 1
          name: my-artifact-${{ github.run_id }}-${{ matrix.pyver }}


  download_pkgs:
    needs: linux-build
    if: ${{!cancelled()}}
    runs-on: [self-hosted, linux-a100]
    timeout-minutes: 50
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /mnt/121:/mnt/121
        - /mnt/104:/mnt/104
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Clone repository
        uses: actions/checkout@v2
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        with:
          repository: ${{ github.event.inputs.repo_org || 'InternLM/lmdeploy' }}
          ref: ${{github.event.inputs.repo_ref || 'main'}}
      - name: Copy repository
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        run: rm -rf ${{env.TEST_CODE_PATH}} && mkdir ${{env.TEST_CODE_PATH}} && chmod 777 ${{env.TEST_CODE_PATH}} && cp -r . ${{env.TEST_CODE_PATH}}
      - name: Copy repository - offline
        if: ${{inputs.offline_mode}}
        run: rm -rf ${{env.TEST_CODE_PATH}} && mkdir ${{env.TEST_CODE_PATH}} && chmod 777 ${{env.TEST_CODE_PATH}} && cp -r ${{env.OFFLINE_CODE_PATH}}/. ${{env.TEST_CODE_PATH}}
      - name: Download Artifacts
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        uses: actions/download-artifact@v4
        with:
          name: my-artifact-${{ github.run_id }}-py310
      - name: Copy Artifacts
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        run: rm ${{env.TEST_CODE_PATH}}/lmdeploy-*.whl -f && cp lmdeploy-*.whl ${{env.TEST_CODE_PATH}}
      - name: Copy Artifacts - offline
        if: ${{inputs.offline_mode}}
        run: rm ${{env.TEST_CODE_PATH}}/lmdeploy-*.whl -f && cp ${{env.OFFLINE_CODE_PATH}}/lmdeploy-*.whl ${{env.TEST_CODE_PATH}}
      - name: Mark as start
        run: |
          chmod -R 777 ${{env.TEST_CODE_PATH}}
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt

  test_evaluation:
    needs: download_pkgs
    if: ${{ !cancelled() }}
    runs-on: [self-hosted, linux-a100]
    timeout-minutes: 7200
    strategy:
      fail-fast: false
      matrix:
        backend: ${{ fromJSON(inputs.backend || '["turbomind", "pytorch"]')}}
        gpu_num: ['gpu_num_1', 'gpu_num_2', 'gpu_num_4', 'gpu_num_8']
        transformers: ["", "legacy"]
    env:
      TEST_ENV: ${{ matrix.transformers }}
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/github-actions/packages:/root/packages
        - /nvme/github-actions/resources:/root/resources
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /nvme/huggingface_hub:/nvme/huggingface_hub
        - /mnt/121:/mnt/121
        - /mnt/104:/mnt/104
        - /mnt/bigdisk:/mnt/bigdisk
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install -r /nvme/qa_test_models/offline_pkg/requirements.txt
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
      - name: Install opencompass
        run: |
          git clone https://github.com/open-compass/opencompass.git --depth 1
          cd opencompass
          python3 -m pip install .
          python3 -m pip install langdetect
      - name: Downgrade transformers
        if: ${{matrix.transformers == 'legacy'}}
        run: |
          pip install transformers==4.57.6
      - name: Check env
        run: |
          python3 -m pip list
          lmdeploy check_env
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Setup paths for evaluation
        if: (matrix.backend == 'pytorch' || matrix.backend == 'turbomind')
        run: |
          overall_exit=0
          ln -s /mnt/104/opencompass-data/data ./data
          ln -s /nvme/qa_test_models/resource/nltk_data /usr/share/nltk_data
          execution_mode="${{ github.event.inputs.execution_mode || 'both' }}"
          ulimit -n 65535
          if [ "$execution_mode" = "both" ] || [ "$execution_mode" = "infer" ]; then
            pytest autotest/evaluate/test_api_evaluate.py -m "${{matrix.gpu_num}} and ${{matrix.backend}} and infer" --alluredir=${{env.REPORT_DIR}} || overall_exit=$?
          fi
          if [ "$execution_mode" = "both" ] || [ "$execution_mode" = "eval" ]; then
            pytest autotest/evaluate/test_api_evaluate.py -m "${{matrix.gpu_num}} and ${{matrix.backend}} and eval" -n 4 --alluredir=${{env.REPORT_DIR}} || overall_exit=$?
          fi
          exit $overall_exit
      - name: Clear workspace
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          chmod -R 777 ${{env.REPORT_DIR}}
          export workdir=$(pwd)
          rm -rf $workdir/*


================================================
FILE: .github/workflows/benchmark.yml
================================================
name: benchmark_test

on:
  workflow_dispatch:
    inputs:
      repo_org:
        required: false
        description: 'Tested repository organization name. Default is InternLM'
        type: string
        default: 'InternLM/lmdeploy'
      repo_ref:
        required: false
        description: 'Set branch or tag or commit id. Default is "main"'
        type: string
        default: 'main'
      benchmark_type:
        required: true
        description: 'Set benchmark type. Default is "["longtext", "throughput", "api_server", "prefixcache"]"'
        type: string
        default: "['apiserver', 'mllm_apiserver', 'throughput', 'longtext', 'prefixcache']"
      backend:
        required: true
        description: 'Set backend filter. Default is "["turbomind", "pytorch"]"'
        type: string
        default: "['turbomind', 'pytorch']"
      offline_mode:
        required: true
        description: 'Whether start a offline mode, if true, you should prepare code and whl package by yourself'
        type: boolean
        default: false

env:
  HOST_PIP_CACHE_DIR: /nvme/github-actions/pip-cache
  HOST_LOCALTIME: /usr/share/zoneinfo/Asia/Shanghai
  OUTPUT_FOLDER: cuda12.8_dist_${{ github.run_id }}
  REPORT_DIR: /nvme/qa_test_models/benchmark_report/${{ inputs.repo_ref }}_${{ github.run_id }}
  ALLURE_REPORT_DIR: /nvme/qa_test_models/benchmark_report/allure_report/${{ inputs.repo_ref }}_${{ github.run_id }}
  TEST_CODE_PATH: /nvme/qa_test_models/test_pkg/lmdeploy/${{ inputs.repo_ref }}_${{ github.run_id }}
  OFFLINE_CODE_PATH: /nvme/qa_test_models/offline_pkg/lmdeploy
  ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
  RUN_ID: ${{ inputs.repo_ref }}_${{ github.run_id }}

jobs:
  linux-build:
    if: ${{github.event_name == 'schedule' || (!cancelled() && !inputs.offline_mode)}}
    strategy:
      matrix:
        pyver: [py310]
    runs-on: ubuntu-latest
    env:
      PYTHON_VERSION: ${{ matrix.pyver }}
      PLAT_NAME: manylinux2014_x86_64
      DOCKER_TAG: cuda12.8
    steps:
      - name: Free disk space
        uses: jlumbroso/free-disk-space@main
        with:
          # This might remove tools that are actually needed, if set to "true" but frees about 6 GB
          tool-cache: false
          docker-images: false
          # All of these default to true, but feel free to set to "false" if necessary for your workflow
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          swap-storage: false
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          repository: ${{ github.event.inputs.repo_org || 'InternLM/lmdeploy' }}
          ref: ${{github.event.inputs.repo_ref || 'main'}}
      - name: Build
        run: |
          echo ${PYTHON_VERSION}
          echo ${PLAT_NAME}
          echo ${DOCKER_TAG}
          echo ${OUTPUT_FOLDER}
          echo ${GITHUB_RUN_ID}
          # remove -it
          sed -i 's/docker run --rm -it/docker run --rm/g' builder/manywheel/build_wheel.sh
          bash builder/manywheel/build_wheel.sh ${PYTHON_VERSION} ${PLAT_NAME} ${DOCKER_TAG} ${OUTPUT_FOLDER}
      - name: Upload Artifacts
        uses: actions/upload-artifact@v4
        with:
          if-no-files-found: error
          path: builder/manywheel/${{ env.OUTPUT_FOLDER }}
          retention-days: 1
          name: my-artifact-${{ github.run_id }}-${{ matrix.pyver }}

  download_pkgs:
    needs: linux-build
    if: ${{!cancelled()}}
    runs-on: [self-hosted, linux-a100]
    timeout-minutes: 50
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /mnt/121:/mnt/121
        - /mnt/104:/mnt/104
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Clone repository
        uses: actions/checkout@v2
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        with:
          repository: ${{ github.event.inputs.repo_org || 'InternLM/lmdeploy' }}
          ref: ${{github.event.inputs.repo_ref || 'main'}}
      - name: Copy repository
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        run: rm -rf ${{env.TEST_CODE_PATH}} && mkdir ${{env.TEST_CODE_PATH}} && chmod 777 ${{env.TEST_CODE_PATH}} && cp -r . ${{env.TEST_CODE_PATH}}
      - name: Copy repository - offline
        if: ${{inputs.offline_mode}}
        run: rm -rf ${{env.TEST_CODE_PATH}} && mkdir ${{env.TEST_CODE_PATH}} && chmod 777 ${{env.TEST_CODE_PATH}} && cp -r ${{env.OFFLINE_CODE_PATH}}/. ${{env.TEST_CODE_PATH}}
      - name: Download Artifacts
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        uses: actions/download-artifact@v4
        with:
          name: my-artifact-${{ github.run_id }}-py310
      - name: Copy Artifacts
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        run: rm ${{env.TEST_CODE_PATH}}/lmdeploy-*.whl -f && cp lmdeploy-*.whl ${{env.TEST_CODE_PATH}}
      - name: Copy Artifacts - offline
        if: ${{inputs.offline_mode}}
        run: rm ${{env.TEST_CODE_PATH}}/lmdeploy-*.whl -f && cp ${{env.OFFLINE_CODE_PATH}}/lmdeploy-*.whl ${{env.TEST_CODE_PATH}}
      - name: Mark as start
        run: |
          chmod -R 777 ${{env.TEST_CODE_PATH}}
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt

  benchmark:
    needs: download_pkgs
    if: ${{github.event_name == 'schedule' || !cancelled()}}
    runs-on: [self-hosted, linux-a100]
    strategy:
      fail-fast: false
      matrix:
        benchmark_type: ${{fromJSON(github.event.inputs.benchmark_type)}}
        gpu_num: ['gpu_num_1', 'gpu_num_2', 'gpu_num_4', 'gpu_num_8']
        transformers: ["", "legacy"]
        include:
          - n: 8
            gpu_num: gpu_num_1
          - n: 4
            gpu_num: gpu_num_2
          - n: 2
            gpu_num: gpu_num_4
          - n: 1
            gpu_num: gpu_num_8
    env:
      TEST_ENV: ${{ matrix.transformers }}
    timeout-minutes: 480
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/github-actions/packages:/root/packages
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /nvme/huggingface_hub:/nvme/huggingface_hub
        - /mnt/121:/mnt/121
        - /mnt/104:/mnt/104
        - /mnt/bigdisk:/mnt/bigdisk
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install -r /nvme/qa_test_models/offline_pkg/requirements.txt
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
      - name: Downgrade transformers
        if: ${{matrix.transformers == 'legacy'}}
        run: |
          pip install transformers==4.57.6
      - name: Check env
        run: |
          python3 -m pip list
          lmdeploy check_env
      - name: Run other benchmark - all
        if: contains(fromJson(github.event.inputs.backend), 'turbomind') && contains(fromJson(github.event.inputs.backend), 'pytorch')
        run: |
            pytest autotest/benchmark/test_${{matrix.benchmark_type}}_performance.py -n ${{matrix.n}} -m '${{matrix.gpu_num}} and not pr_test and not function' --alluredir=${{env.ALLURE_REPORT_DIR}}
      - name: Run other benchmark - turbomind
        if: contains(fromJson(github.event.inputs.backend), 'turbomind') && !contains(fromJson(github.event.inputs.backend), 'pytorch')
        run: |
            pytest autotest/benchmark/test_${{matrix.benchmark_type}}_performance.py -n ${{matrix.n}} -m '${{matrix.gpu_num}} and not pr_test and not function and turbomind' --alluredir=${{env.ALLURE_REPORT_DIR}}
      - name: Run other benchmark - pytorch
        if: contains(fromJson(github.event.inputs.backend), 'pytorch') && !contains(fromJson(github.event.inputs.backend), 'turbomind')
        run: |
            pytest autotest/benchmark/test_${{matrix.benchmark_type}}_performance.py -n ${{matrix.n}} -m '${{matrix.gpu_num}} and not pr_test and not function and pytorch' --alluredir=${{env.ALLURE_REPORT_DIR}}
      - name: Clear workfile
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          chmod -R 777 $REPORT_DIR
          export workdir=$(pwd)
          cd ..
          rm -rf $workdir
          mkdir $workdir
          chmod -R 777 $workdir


================================================
FILE: .github/workflows/cuda12.8_whl_release.yml
================================================
name: cuda12.8-whl-release

on:
  push:
    tags:
      - '*'
  workflow_dispatch:

permissions:
  contents: write

jobs:
  linux-build:
    strategy:
      matrix:
        pyver: [py310, py311, py312, py313]
    runs-on: ubuntu-latest
    env:
      PYTHON_VERSION: ${{ matrix.pyver }}
      PLAT_NAME: manylinux2014_x86_64
      DOCKER_TAG: cuda12.8
      OUTPUT_FOLDER: cuda12.8_dist
      CUDA_VER: 12.8
    steps:
      - name: Free disk space
        uses: jlumbroso/free-disk-space@main
        with:
          # This might remove tools that are actually needed, if set to "true" but frees about 6 GB
          tool-cache: false
          docker-images: false
          # All of these default to true, but feel free to set to "false" if necessary for your workflow
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          swap-storage: false
      - name: Checkout repository
        uses: actions/checkout@v3
      - name: Build
        run: |
          echo ${PYTHON_VERSION}
          echo ${PLAT_NAME}
          echo ${DOCKER_TAG}
          echo ${OUTPUT_FOLDER}
          # remove -it
          sed -i 's/docker run --rm -it/docker run --rm/g' builder/manywheel/build_wheel.sh
          bash builder/manywheel/build_wheel.sh ${PYTHON_VERSION} ${PLAT_NAME} ${DOCKER_TAG} ${OUTPUT_FOLDER}
      - name: Upload Artifacts
        uses: actions/upload-artifact@v4
        with:
          if-no-files-found: error
          path: builder/manywheel/${{ env.OUTPUT_FOLDER }}/*
          retention-days: 1
          name: linux-${{ matrix.pyver }}

  windows-build:
    strategy:
      matrix:
        pyver: ['3.10', '3.11', '3.12', '3.13']
    runs-on: windows-latest
    steps:
      - name: Set git for windows
        run: |
          git config --global core.longpaths true
      - name: Checkout repository
        uses: actions/checkout@v3
      - name: Set up python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.pyver }}
      - name: Install python packages
        run: |
          pip install build change-wheel-version
      - name: Setup CUDA Toolkit
        id: cuda-toolkit
        shell: pwsh
        run: ./builder/windows/setup_cuda.ps1
        env:
            INPUT_CUDA_VERSION: '12.8.1'
      - name: Build wheel
        run: |
          python -m build --wheel -o build/wheel
          Get-ChildItem -Path "build" -Filter "*.whl" | ForEach-Object { change_wheel_version $_.FullName --local-version cu128 --delete-old-wheel }
      - name: Upload Artifacts
        uses: actions/upload-artifact@v4
        with:
          if-no-files-found: error
          path: build/wheel/*
          retention-days: 1
          name: windows-${{ matrix.pyver }}

  publish:
    runs-on: ubuntu-latest
    environment: 'prod'
    needs:
      - linux-build
      - windows-build
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3
      - name: Download artifacts
        uses: actions/download-artifact@v4
        with:
          path: artifact
          merge-multiple: true
      - name: Add cuda version to package name
        run: |
          ver=$(cat lmdeploy/version.py | grep '__version__ =' | cut -d\' -f2)
          cuver=$ver+cu128
          ls -lh
          cd artifact
          for file in *; do
            mv "$file" "`echo $file | sed "s/$ver/$cuver/g"`";
          done
      - name: Display artifacts
        run: ls artifact/ -lh
      - name: Publish
        uses: softprops/action-gh-release@v1
        if: startsWith(github.ref, 'refs/tags/')
        with:
          files: artifact/*


================================================
FILE: .github/workflows/daily_ete_test.yml
================================================
name: daily_ete_test

on:
  workflow_dispatch:
    inputs:
      repo_org:
        required: false
        description: 'Tested repository organization name. Default is InternLM'
        type: string
        default: 'InternLM/lmdeploy'
      repo_ref:
        required: false
        description: 'Set branch or tag or commit id. Default is "main"'
        type: string
        default: 'main'
      backend:
        required: true
        description: 'Set backend filter. Default is "["turbomind", "pytorch"]"'
        type: string
        default: "['turbomind', 'pytorch']"
      model:
        required: true
        description: 'Set testcase module filter: llm, mllm. Default contains all models'
        type: string
        default: "['llm','mllm']"
      function:
        required: true
        description: 'Set testcase function filter: chat, restful, pipeline. Default contains all functions'
        type: string
        default: '["pipeline", "restful", "chat"]'
      offline_mode:
        required: true
        description: 'Whether start a offline mode, if true, you should prepare code and whl package by yourself'
        type: boolean
        default: false
      regression_func:
        required: true
        description: 'regression functions'
        type: string
        default: "['quant', 'tools','restful','pipeline','benchmark','evaluation']"
  schedule:
    - cron:  '00 14 * * 0-4'

env:
  HOST_PIP_CACHE_DIR: /nvme/github-actions/pip-cache
  HOST_LOCALTIME: /usr/share/zoneinfo/Asia/Shanghai
  OUTPUT_FOLDER: cuda12.8_dist_${{ github.run_id }}
  ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
  ROOT_DIR: /nvme/qa_test_models
  REPORT_DIR: /nvme/qa_test_models/test-reports/${{ inputs.repo_ref || 'main' }}_${{ github.run_id }}
  COV_PARAM: --cov /opt/py3/lib/python3.10/site-packages/lmdeploy
  TEST_CODE_PATH: /nvme/qa_test_models/test_pkg/lmdeploy/${{ inputs.repo_ref || 'main' }}_${{ github.run_id }}
  OFFLINE_CODE_PATH: /nvme/qa_test_models/offline_pkg/lmdeploy
  OFFLINE_REQUIREMENTS: /nvme/qa_test_models/offline_pkg/requirements.txt
  DEEPSEEK_VL: /nvme/qa_test_models/offline_pkg/DeepSeek-VL
  RUN_ID: ${{ inputs.repo_ref || 'main' }}_${{ github.run_id }}

jobs:
  linux-build:
    if: ${{!cancelled() && (github.event_name == 'schedule' || !inputs.offline_mode)}}
    strategy:
      matrix:
        pyver: [py310]
    runs-on: ubuntu-latest
    env:
      PYTHON_VERSION: ${{ matrix.pyver }}
      PLAT_NAME: manylinux2014_x86_64
      DOCKER_TAG: cuda12.8
    steps:
      - name: Free disk space
        uses: jlumbroso/free-disk-space@main
        with:
          # This might remove tools that are actually needed, if set to "true" but frees about 6 GB
          tool-cache: false
          docker-images: false
          # All of these default to true, but feel free to set to "false" if necessary for your workflow
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          swap-storage: false
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          repository: ${{ github.event.inputs.repo_org || 'InternLM/lmdeploy' }}
          ref: ${{github.event.inputs.repo_ref || 'main'}}
      - name: Build
        run: |
          echo ${PYTHON_VERSION}
          echo ${PLAT_NAME}
          echo ${DOCKER_TAG}
          echo ${OUTPUT_FOLDER}
          echo ${GITHUB_RUN_ID}
          # remove -it
          sed -i 's/docker run --rm -it/docker run --rm/g' builder/manywheel/build_wheel.sh
          bash builder/manywheel/build_wheel.sh ${PYTHON_VERSION} ${PLAT_NAME} ${DOCKER_TAG} ${OUTPUT_FOLDER}
      - name: Upload Artifacts
        uses: actions/upload-artifact@v4
        with:
          if-no-files-found: error
          path: builder/manywheel/${{ env.OUTPUT_FOLDER }}
          retention-days: 1
          name: my-artifact-${{ github.run_id }}-${{ matrix.pyver }}


  download_pkgs:
    needs: linux-build
    if: ${{!cancelled()}}
    runs-on: [self-hosted, linux-a100]
    timeout-minutes: 50
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Clone repository
        uses: actions/checkout@v2
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        with:
          repository: ${{ github.event.inputs.repo_org || 'InternLM/lmdeploy' }}
          ref: ${{github.event.inputs.repo_ref || 'main'}}
      - name: Copy repository
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        run: rm -rf ${{env.TEST_CODE_PATH}} && mkdir ${{env.TEST_CODE_PATH}} && chmod 777 ${{env.TEST_CODE_PATH}} && cp -r . ${{env.TEST_CODE_PATH}}
      - name: Copy repository - offline
        if: ${{inputs.offline_mode}}
        run: rm -rf ${{env.TEST_CODE_PATH}} && mkdir ${{env.TEST_CODE_PATH}} && chmod 777 ${{env.TEST_CODE_PATH}} && cp -r ${{env.OFFLINE_CODE_PATH}}/. ${{env.TEST_CODE_PATH}}
      - name: Download Artifacts
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        uses: actions/download-artifact@v4
        with:
          name: my-artifact-${{ github.run_id }}-py310
      - name: Copy Artifacts
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        run: rm ${{env.TEST_CODE_PATH}}/lmdeploy-*.whl -f && cp lmdeploy-*.whl ${{env.TEST_CODE_PATH}}
      - name: Copy Artifacts - offline
        if: ${{inputs.offline_mode}}
        run: rm ${{env.TEST_CODE_PATH}}/lmdeploy-*.whl -f && cp ${{env.OFFLINE_CODE_PATH}}/lmdeploy-*.whl ${{env.TEST_CODE_PATH}}
      - name: Mark as start
        run: |
          chmod -R 777 ${{env.TEST_CODE_PATH}}
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt

  test_quantization:
    needs: download_pkgs
    if: ${{!cancelled() && (github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'quant') )}}
    runs-on: [self-hosted, linux-a100]
    timeout-minutes: 150
    strategy:
      matrix:
        transformers: ["", "legacy"]
    env:
      PYTHONPATH: /nvme/qa_test_models/offline_pkg/LLaVA
      MODELSCOPE_CACHE: /nvme/qa_test_models/modelscope_hub
      MODELSCOPE_MODULES_CACHE: /nvme/qa_test_models/modelscope_modules
      TEST_ENV: ${{ matrix.transformers }}
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/github-actions/packages:/root/packages
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /nvme/huggingface_hub:/nvme/huggingface_hub
        - /mnt/121:/mnt/121
        - /mnt/104:/mnt/104
        - /mnt/bigdisk:/mnt/bigdisk
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install auto_gptq matplotlib attrdict
          python3 -m pip install -r requirements/lite.txt
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
          pip install ${{env.DEEPSEEK_VL}} --no-deps
          rm -rf ${{env.DEEPSEEK_VL}}/build
      - name: Check env
        run: |
          pip install transformers==4.57.6
          python3 -m pip list
          lmdeploy check_env
          rm -rf allure-results
          # remove tmp log in testcase
          mkdir ${{env.REPORT_DIR}}/.pytest_cache -p && rm autotest/.pytest_cache -f
          ln -s ${{env.REPORT_DIR}}/.pytest_cache autotest
      - name: Test lmdeploy - quantization w4a16
        continue-on-error: true
        if: github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.backend), 'turbomind')
        run: |
          pytest autotest/tools/quantization/test_quantization_awq.py -m 'not pr_test' -n 8 --alluredir=${{env.REPORT_DIR}} --clean-alluredir ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - quantization w8a8
        continue-on-error: true
        if: github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.backend), 'pytorch')
        run: |
          pytest autotest/tools/quantization/test_quantization_w8a8.py -n 8 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Clear workfile
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          chmod -R 777 ${{env.ROOT_DIR}}
          export workdir=$(pwd)
          cd ..
          rm -rf $workdir
          mkdir $workdir
          chmod -R 777 $workdir

  test_tools:
    if: ${{!cancelled() && (github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'tools'))}}
    runs-on: [self-hosted, linux-a100]
    needs: test_quantization
    timeout-minutes: 300
    strategy:
      fail-fast: false
      matrix:
        backend: ${{ fromJSON(inputs.backend || '["turbomind", "pytorch"]')}}
        model: ${{ fromJSON(inputs.model || '["llm", "mllm"]')}}
        transformers: ["", "legacy"]
        function: ${{ fromJSON(inputs.function || '["pipeline","restful","chat"]')}}
        exclude:
          - backend: turbomind
            model: mllm
            function: chat
          - backend: pytorch
            model: mllm
            function: chat
        include:
          - backend: turbomind
            model: llm
            function: other
    env:
      PYTHONPATH: /nvme/qa_test_models/offline_pkg/LLaVA
      MODELSCOPE_CACHE: /nvme/qa_test_models/modelscope_hub
      MODELSCOPE_MODULES_CACHE: /nvme/qa_test_models/modelscope_modules
      TEST_ENV: ${{ matrix.transformers }}
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/github-actions/packages:/root/packages
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /nvme/huggingface_hub:/nvme/huggingface_hub
        - /mnt/121:/mnt/121
        - /mnt/104:/mnt/104
        - /mnt/bigdisk:/mnt/bigdisk
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install -r ${{env.OFFLINE_REQUIREMENTS}}
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
          pip install ${{env.DEEPSEEK_VL}} --no-deps
          rm -rf ${{env.DEEPSEEK_VL}}/build
      - name: Downgrade transformers
        if: ${{matrix.transformers == 'legacy'}}
        run: |
          pip install transformers==4.57.6
      - name: Check env
        run: |
          python3 -m pip list
          lmdeploy check_env
          cp -r /nvme/qa_test_models/offline_pkg/lora .
          rm -rf allure-results
          # remove tmp log in testcase
          mkdir ${{env.REPORT_DIR}}/.pytest_cache -p && rm autotest/.pytest_cache -f
          ln -s ${{env.REPORT_DIR}}/.pytest_cache autotest
      - name: Test lmdeploy - chat
        continue-on-error: true
        if: (matrix.backend == 'pytorch' || matrix.backend == 'turbomind') && matrix.model == 'llm' && matrix.function == 'chat'
        run: |
          pytest autotest/tools/chat/test_command_chat_hf_${{matrix.backend}}.py -m 'gpu_num_1 and not pr_test' -n 8 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
          pytest autotest/tools/chat/test_command_chat_hf_${{matrix.backend}}.py -m 'gpu_num_2 and not pr_test' -n 4 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
          pytest autotest/tools/chat/test_command_chat_hf_${{matrix.backend}}.py -m 'gpu_num_4 and not pr_test' -n 2 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
          pytest autotest/tools/chat/test_command_chat_hf_${{matrix.backend}}.py -m 'gpu_num_8 and not pr_test' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - pipeline
        continue-on-error: true
        if: matrix.function == 'pipeline'
        run: |
          pytest autotest/tools/pipeline/test_pipeline_chat_${{matrix.backend}}_${{matrix.model}}.py -m 'gpu_num_1 and not pr_test' -n 8 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
          pytest autotest/tools/pipeline/test_pipeline_chat_${{matrix.backend}}_${{matrix.model}}.py -m 'gpu_num_2 and not pr_test' -n 4 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
          pytest autotest/tools/pipeline/test_pipeline_chat_${{matrix.backend}}_${{matrix.model}}.py -m 'gpu_num_4 and not pr_test' -n 2 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
          pytest autotest/tools/pipeline/test_pipeline_chat_${{matrix.backend}}_${{matrix.model}}.py -m 'gpu_num_8 and not pr_test' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - restful
        continue-on-error: true
        if: matrix.function == 'restful'
        run: |
          pytest autotest/tools/restful/test_restful_chat_hf_${{matrix.backend}}_${{matrix.model}}.py -m 'gpu_num_1 and not pr_test' -n 8 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
          pytest autotest/tools/restful/test_restful_chat_hf_${{matrix.backend}}_${{matrix.model}}.py -m 'gpu_num_2 and not pr_test' -n 4 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
          pytest autotest/tools/restful/test_restful_chat_hf_${{matrix.backend}}_${{matrix.model}}.py -m 'gpu_num_4 and not pr_test' -n 2 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
          pytest autotest/tools/restful/test_restful_chat_hf_${{matrix.backend}}_${{matrix.model}}.py -m 'gpu_num_8 and not pr_test' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - local testcase
        if: matrix.backend == 'turbomind' && matrix.model == 'llm' && matrix.function == 'other'
        run: |
          pytest autotest/toolchain --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Clear workfile
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          chmod -R 777 ${{env.ROOT_DIR}}
          export workdir=$(pwd)
          cd ..
          rm -rf $workdir
          mkdir $workdir
          chmod -R 777 $workdir

  test_restful:
    if: ${{!cancelled() && (github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'restful'))}}
    runs-on: [self-hosted, linux-a100]
    needs: test_quantization
    strategy:
      fail-fast: false
      matrix:
        backend: ${{ fromJSON(inputs.backend || '["turbomind", "pytorch"]')}}
        model_path: ['Qwen/Qwen3-8B-Base', 'Qwen/Qwen3-30B-A3B', 'Qwen/Qwen3-32B', 'OpenGVLab/InternVL3_5-30B-A3B', 'OpenGVLab/InternVL3-38B', 'Qwen/Qwen3-VL-8B-Instruct', 'Qwen/Qwen3-VL-30B-A3B-Instruct']
        include:
          - tp: 2
            model: Qwen3-8B-Base
            model_path: Qwen/Qwen3-8B-Base
            case_info: ['completions_v1']
            generate_type: base
          - tp: 2
            model: Qwen3-30B-A3B
            model_path: Qwen/Qwen3-30B-A3B
            case_info: ['chat_completions_v1', 'generate']
            generate_type: all
            extra: '--logprobs-mode raw_logprobs --enable-return-routed-experts'
            backend: pytorch
          - tp: 2
            model: Qwen3-30B-A3B
            model_path: Qwen/Qwen3-30B-A3B
            case_info: ['chat_completions_v1', 'generate']
            generate_type: logprob
            extra: '--logprobs-mode raw_logprobs'
            backend: turbomind
          - tp: 2
            model: InternVL3_5-30B-A3B
            model_path: OpenGVLab/InternVL3_5-30B-A3B
            case_info: ['chat_completions_v1', 'generate']
            generate_type: logprob
            extra: '--logprobs-mode raw_logprobs --enable-return-routed-experts'
            backend: pytorch
          - tp: 2
            model: InternVL3_5-30B-A3B
            model_path: OpenGVLab/InternVL3_5-30B-A3B
            case_info: ['chat_completions_v1', 'generate']
            generate_type: logprob
            extra: '--logprobs-mode raw_logprobs'
            backend: turbomind
          - tp: 2
            model: Qwen3-VL-30B-A3B-Instruct
            model_path: Qwen/Qwen3-VL-30B-A3B-Instruct
            case_info: ['chat_completions_v1', 'generate']
            generate_type: logprob
            extra: '--logprobs-mode raw_logprobs --enable-return-routed-experts'
            backend: pytorch
          - tp: 2
            model: Qwen3-VL-30B-A3B-Instruct
            model_path: Qwen/Qwen3-VL-30B-A3B-Instruct
            case_info: ['chat_completions_v1', 'generate']
            generate_type: logprob
            extra: '--logprobs-mode raw_logprobs'
            backend: turbomind
          - tp: 2
            model: Qwen3-32B
            model_path: Qwen/Qwen3-32B
            case_info: ['chat_completions_v1', 'generate']
            generate_type: logprob
            extra: '--logprobs-mode raw_logprobs'
          - tp: 1
            model: Qwen3-VL-8B-Instruct
            model_path: Qwen/Qwen3-VL-8B-Instruct
            case_info: ['chat_completions_v1', 'generate']
            generate_type: logprob
            extra: '--logprobs-mode raw_logprobs --enable-return-routed-experts'
            backend: pytorch
          - tp: 1
            model: Qwen3-VL-8B-Instruct
            model_path: Qwen/Qwen3-VL-8B-Instruct
            case_info: ['chat_completions_v1', 'generate']
            generate_type: logprob
            extra: '--logprobs-mode raw_logprobs'
            backend: turbomind
          - tp: 2
            model: InternVL3-38B
            model_path: OpenGVLab/InternVL3-38B
            case_info: ['chat_completions_v1', 'generate']
            generate_type: logprob
            extra: '--logprobs-mode raw_logprobs'
    timeout-minutes: 60
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/github-actions/packages:/root/packages
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /nvme/huggingface_hub:/nvme/huggingface_hub
        - /mnt/121:/mnt/121
        - /mnt/104:/mnt/104
        - /mnt/bigdisk:/mnt/bigdisk
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install -r ${{env.OFFLINE_REQUIREMENTS}}
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
      - name: Check env
        run: |
          python3 -m pip list
          lmdeploy check_env
          rm -rf allure-results
          # remove tmp log in testcase
          mkdir ${{env.REPORT_DIR}}/.pytest_cache -p && rm autotest/.pytest_cache -f
          ln -s ${{env.REPORT_DIR}}/.pytest_cache autotest
      - name: Start restful api
        run: |
          lmdeploy serve api_server /nvme/qa_test_models/${{matrix.model_path}} --tp ${{matrix.tp}} --backend ${{matrix.backend}} ${{matrix.extra}} --allow-terminate-by-client > ${{env.REPORT_DIR}}/${{matrix.backend}}_${{matrix.model}}_${{matrix.generate_type}}_start_restful.log 2>&1 &
          echo "restful_pid=$!"
          for i in $(seq 1 240)
          do
            sleep 5
            echo "health check try $i"
            if curl -f -s http://127.0.0.1:23333/health > /dev/null 2>&1; then
              echo "health check success"
              exit 0
            fi
          done

          echo "health check fail"
          curl -f -s http://127.0.0.1:23333/terminate > /dev/null 2>&1
          exit 1
      - name: Test lmdeploy - chat_completions_v1
        if:  matrix.model != 'internlm2_5-20b-chat' && matrix.model != 'Intern-S1' && contains(matrix.case_info, 'chat_completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_chat_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not internlm2_5 and not interns1' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - chat_completions_v1
        if: matrix.model == 'Intern-S1' && contains(matrix.case_info, 'chat_completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_chat_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not internlm2_5' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - chat_completions_v1 - internlm2_5-20b-chat
        if:  matrix.model == 'internlm2_5-20b-chat' && contains(matrix.case_info, 'chat_completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_chat_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not interns1' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - completions_v1 - internlm2_5-20b
        if: matrix.model == 'internlm2_5-20b' && contains(matrix.case_info, 'completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - completions_v1 - other
        if: matrix.model != 'internlm2_5-20b' && contains(matrix.case_info, 'completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}} and not internlm2_5' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test generate - base
        if:  matrix.generate_type == 'base' && contains(matrix.case_info, 'generate')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_generate.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not logprob and not experts' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test generate - logprob
        if:  matrix.generate_type == 'logprob' && contains(matrix.case_info, 'generate')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_generate.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not experts' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test generate - all
        if:  matrix.generate_type == 'all' && contains(matrix.case_info, 'generate')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_generate.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}}' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Kill api server
        if: always()
        run: |
          curl -f -s http://127.0.0.1:23333/terminate > /dev/null 2>&1
      - name: Clear workfile
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          chmod -R 777 ${{env.ROOT_DIR}}
          export workdir=$(pwd)
          cd ..
          rm -rf $workdir
          mkdir $workdir
          chmod -R 777 $workdir

  test_pipeline:
    if: ${{!cancelled() && (github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'pipeline'))}}
    runs-on: [self-hosted, linux-a100]
    needs: test_quantization
    timeout-minutes: 240
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/github-actions/packages:/root/packages
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /nvme/huggingface_hub:/nvme/huggingface_hub
        - /mnt/121:/mnt/121
        - /mnt/104:/mnt/104
        - /mnt/bigdisk:/mnt/bigdisk
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install -r ${{env.OFFLINE_REQUIREMENTS}}
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
          pip install ${{env.DEEPSEEK_VL}} --no-deps
          rm -rf ${{env.DEEPSEEK_VL}}/build
      - name: Check env
        run: |
          python3 -m pip list
          lmdeploy check_env
          rm -rf allure-results
          # remove tmp log in testcase
          mkdir ${{env.REPORT_DIR}}/.pytest_cache -p && rm autotest/.pytest_cache -f
          ln -s ${{env.REPORT_DIR}}/.pytest_cache autotest
      - name: Test lmdeploy - interface pipeline case
        run: |
          pytest autotest/interface/pipeline/test_pipeline_func.py -m 'not pr_test' -n 4 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
          pytest autotest/interface/pipeline/test_pipeline_longtext_func.py -m 'gpu_num_1 and not pr_test' -n 8 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
          pytest autotest/interface/pipeline/test_pipeline_longtext_func.py -m 'gpu_num_2 and not pr_test' -n 4 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
          pytest autotest/interface/pipeline/test_pipeline_longtext_func.py -m 'gpu_num_4 and not pr_test' -n 2 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
          pytest autotest/interface/pipeline/test_pipeline_longtext_func.py -m 'gpu_num_8 and not pr_test' -n 1 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Clear workfile
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          chmod -R 777 ${{env.ROOT_DIR}}
          export workdir=$(pwd)
          cd ..
          rm -rf $workdir
          mkdir $workdir
          chmod -R 777 $workdir


  test_benchmark:
    if: ${{!cancelled() && (github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'benchmark'))}}
    runs-on: [self-hosted, linux-a100]
    needs: test_quantization
    timeout-minutes: 120
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/github-actions/packages:/root/packages
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /nvme/huggingface_hub:/nvme/huggingface_hub
        - /mnt/121:/mnt/121
        - /mnt/104:/mnt/104
        - /mnt/bigdisk:/mnt/bigdisk
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install -r ${{env.OFFLINE_REQUIREMENTS}}
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
          pip install ${{env.DEEPSEEK_VL}} --no-deps
          rm -rf ${{env.DEEPSEEK_VL}}/build
      - name: Check env
        run: |
          python3 -m pip list
          lmdeploy check_env
          rm -rf allure-results
          # remove tmp log in testcase
          mkdir ${{env.REPORT_DIR}}/.pytest_cache -p && rm autotest/.pytest_cache -f
          ln -s ${{env.REPORT_DIR}}/.pytest_cache autotest
      - name: Test benchmark script
        run: |
          pytest autotest/benchmark -n 4 -m function --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Clear workfile
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          chmod -R 777 ${{env.ROOT_DIR}}
          export workdir=$(pwd)
          cd ..
          rm -rf $workdir
          mkdir $workdir
          chmod -R 777 $workdir


  test_restful_legacy:
    if: ${{!cancelled() && (github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'restful'))}}
    runs-on: [self-hosted, linux-a100]
    needs: test_quantization
    strategy:
      fail-fast: false
      matrix:
        backend: ${{ fromJSON(inputs.backend || '["turbomind", "pytorch"]')}}
        model_path: ['internlm/Intern-S1']
        include:
          - tp: 8
            model: Intern-S1
            model_path: internlm/Intern-S1
            case_info: ['chat_completions_v1', 'generate']
            generate_type: base
    timeout-minutes: 60
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/github-actions/packages:/root/packages
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /nvme/huggingface_hub:/nvme/huggingface_hub
        - /mnt/121:/mnt/121
        - /mnt/104:/mnt/104
        - /mnt/bigdisk:/mnt/bigdisk
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install -r ${{env.OFFLINE_REQUIREMENTS}}
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
      - name: Check env
        run: |
          pip install transformers==4.57.6
          python3 -m pip list
          lmdeploy check_env
          rm -rf allure-results
          # remove tmp log in testcase
          mkdir ${{env.REPORT_DIR}}/.pytest_cache -p && rm autotest/.pytest_cache -f
          ln -s ${{env.REPORT_DIR}}/.pytest_cache autotest
      - name: Start restful api
        run: |
          lmdeploy serve api_server /nvme/qa_test_models/${{matrix.model_path}} --tp ${{matrix.tp}} --backend ${{matrix.backend}} ${{matrix.extra}} --allow-terminate-by-client > ${{env.REPORT_DIR}}/${{matrix.backend}}_${{matrix.model}}_${{matrix.generate_type}}_start_restful.log 2>&1 &
          echo "restful_pid=$!"
          for i in $(seq 1 240)
          do
            sleep 5
            echo "health check try $i"
            if curl -f -s http://127.0.0.1:23333/health > /dev/null 2>&1; then
              echo "health check success"
              exit 0
            fi
          done

          echo "health check fail"
          curl -f -s http://127.0.0.1:23333/terminate > /dev/null 2>&1
          exit 1
      - name: Test lmdeploy - chat_completions_v1
        if:  matrix.model != 'internlm2_5-20b-chat' && matrix.model != 'Intern-S1' && contains(matrix.case_info, 'chat_completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_chat_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not internlm2_5 and not interns1' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - chat_completions_v1
        if: matrix.model == 'Intern-S1' && contains(matrix.case_info, 'chat_completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_chat_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not internlm2_5' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - chat_completions_v1 - internlm2_5-20b-chat
        if:  matrix.model == 'internlm2_5-20b-chat' && contains(matrix.case_info, 'chat_completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_chat_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not interns1' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - completions_v1 - internlm2_5-20b
        if: matrix.model == 'internlm2_5-20b' && contains(matrix.case_info, 'completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - completions_v1 - other
        if: matrix.model != 'internlm2_5-20b' && contains(matrix.case_info, 'completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}} and not internlm2_5' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test generate - base
        if:  matrix.generate_type == 'base' && contains(matrix.case_info, 'generate')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_generate.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not logprob and not experts' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test generate - logprob
        if:  matrix.generate_type == 'logprob' && contains(matrix.case_info, 'generate')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_generate.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not experts' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test generate - all
        if:  matrix.generate_type == 'all' && contains(matrix.case_info, 'generate')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_generate.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}}' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Kill api server
        if: always()
        run: |
          curl -f -s http://127.0.0.1:23333/terminate > /dev/null 2>&1
      - name: Clear workfile
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          chmod -R 777 ${{env.ROOT_DIR}}
          export workdir=$(pwd)
          cd ..
          rm -rf $workdir
          mkdir $workdir
          chmod -R 777 $workdir

  test_pipeline_legacy:
    if: ${{!cancelled() && (github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'pipeline'))}}
    runs-on: [self-hosted, linux-a100]
    needs: test_quantization
    timeout-minutes: 240
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/github-actions/packages:/root/packages
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /nvme/huggingface_hub:/nvme/huggingface_hub
        - /mnt/121:/mnt/121
        - /mnt/104:/mnt/104
        - /mnt/bigdisk:/mnt/bigdisk
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install -r ${{env.OFFLINE_REQUIREMENTS}}
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
          pip install ${{env.DEEPSEEK_VL}} --no-deps
          rm -rf ${{env.DEEPSEEK_VL}}/build
      - name: Check env
        run: |
          pip install transformers==4.57.6
          python3 -m pip list
          lmdeploy check_env
          rm -rf allure-results
          # remove tmp log in testcase
          mkdir ${{env.REPORT_DIR}}/.pytest_cache -p && rm autotest/.pytest_cache -f
          ln -s ${{env.REPORT_DIR}}/.pytest_cache autotest
      - name: Test lmdeploy - interface pipeline case
        run: |
          pytest autotest/interface/pipeline/test_pipeline_func.py -m 'not pr_test' -n 4 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
          pytest autotest/interface/pipeline/test_pipeline_longtext_func.py -m 'gpu_num_1 and not pr_test' -n 8 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
          pytest autotest/interface/pipeline/test_pipeline_longtext_func.py -m 'gpu_num_2 and not pr_test' -n 4 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
          pytest autotest/interface/pipeline/test_pipeline_longtext_func.py -m 'gpu_num_4 and not pr_test' -n 2 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
          pytest autotest/interface/pipeline/test_pipeline_longtext_func.py -m 'gpu_num_8 and not pr_test' -n 1 --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Clear workfile
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          chmod -R 777 ${{env.ROOT_DIR}}
          export workdir=$(pwd)
          cd ..
          rm -rf $workdir
          mkdir $workdir
          chmod -R 777 $workdir

  get_coverage_report:
    if: ${{!cancelled()}}
    runs-on: [self-hosted, linux-a100]
    needs: [test_tools, test_restful, test_pipeline, test_benchmark]
    timeout-minutes: 5
    container:
      image: openmmlab/lmdeploy:latest-cu12.8
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/github-actions/packages:/root/packages
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: cp -r ${{env.TEST_CODE_PATH}}/. .
      - name: Install lmdeploy
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
      - name: Get coverage report
        run: |
          pip install coverage
          coverage combine ${{env.REPORT_DIR}}
          coverage xml -o ${{env.REPORT_DIR}}/coverage.xml
          coverage report -m
          mv .coverage ${{env.REPORT_DIR}}/.coverage
      - name: Clear workfile
        if: always()
        run: |
          chmod -R 777 ${{env.ROOT_DIR}}
          export workdir=$(pwd)
          cd ..
          rm -rf $workdir
          mkdir $workdir
          chmod -R 777 $workdir


================================================
FILE: .github/workflows/daily_ete_test_3090.yml
================================================
name: daily_ete_test_3090

on:
  workflow_dispatch:
    inputs:
      repo_org:
        required: false
        description: 'Tested repository organization name. Default is InternLM'
        type: string
        default: 'InternLM/lmdeploy'
      repo_ref:
        required: false
        description: 'Set branch or tag or commit id. Default is "main"'
        type: string
        default: 'main'
      backend:
        required: true
        description: 'Set backend filter. Default is "["turbomind", "pytorch"]"'
        type: string
        default: "['turbomind', 'pytorch']"
      model:
        required: true
        description: 'Set testcase module filter: llm, mllm. Default contains all models'
        type: string
        default: "['llm','mllm']"
      function:
        required: true
        description: 'Set testcase function filter: chat, restful, pipeline. Default contains all functions'
        type: string
        default: '["pipeline", "restful", "chat"]'
      offline_mode:
        required: true
        description: 'Whether start a offline mode, if true, you should prepare code and whl package by yourself'
        type: boolean
        default: false
      regression_func:
        required: true
        description: 'regression functions'
        type: string
        default: "['quant', 'tools', 'restful']"
  schedule:
    - cron:  '00 14 * * 0-4'

env:
  HOST_PIP_CACHE_DIR: /nvme/github-actions/pip-cache
  HOST_LOCALTIME: /usr/share/zoneinfo/Asia/Shanghai
  OUTPUT_FOLDER: cuda12.4_dist_${{ github.run_id }}
  ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
  REPORT_DIR: /nvme/qa_test_models/test-reports/${{ inputs.repo_ref || 'main' }}_${{ github.run_id }}
  COV_PARAM: --cov /opt/py3/lib/python3.10/site-packages/lmdeploy
  FAIL_CONFIG: ${{ github.event_name == 'schedule' && github.run_attempt != 1 && '--lf --lfnf none' || '--lf'}}
  TEST_CODE_PATH: /nvme/qa_test_models/test_pkg/lmdeploy/${{ inputs.repo_ref || 'main' }}_${{ github.run_id }}
  OFFLINE_CODE_PATH: /nvme/qa_test_models/offline_pkg/lmdeploy
  OFFLINE_REQUIREMENTS: /nvme/qa_test_models/offline_pkg/requirements.txt
  RUN_ID: ${{ inputs.repo_ref || 'main' }}_${{ github.run_id }}

jobs:
  linux-build:
    if: ${{!cancelled() && (github.event_name == 'schedule' || !inputs.offline_mode)}}
    strategy:
      matrix:
        pyver: [py310]
    runs-on: ubuntu-latest
    env:
      PYTHON_VERSION: ${{ matrix.pyver }}
      PLAT_NAME: manylinux2014_x86_64
      DOCKER_TAG: cuda12.4
    steps:
      - name: Free disk space
        uses: jlumbroso/free-disk-space@main
        with:
          # This might remove tools that are actually needed, if set to "true" but frees about 6 GB
          tool-cache: false
          docker-images: false
          # All of these default to true, but feel free to set to "false" if necessary for your workflow
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          swap-storage: false
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          repository: ${{ github.event.inputs.repo_org || 'InternLM/lmdeploy' }}
          ref: ${{github.event.inputs.repo_ref || 'main'}}
      - name: Build
        run: |
          echo ${PYTHON_VERSION}
          echo ${PLAT_NAME}
          echo ${DOCKER_TAG}
          echo ${OUTPUT_FOLDER}
          echo ${GITHUB_RUN_ID}
          # remove -it
          sed -i 's/docker run --rm -it/docker run --rm/g' builder/manywheel/build_wheel.sh
          bash builder/manywheel/build_wheel.sh ${PYTHON_VERSION} ${PLAT_NAME} ${DOCKER_TAG} ${OUTPUT_FOLDER}
      - name: Upload Artifacts
        uses: actions/upload-artifact@v4
        with:
          if-no-files-found: error
          path: builder/manywheel/${{ env.OUTPUT_FOLDER }}
          retention-days: 1
          name: my-artifact-${{ github.run_id }}-${{ matrix.pyver }}


  download_pkgs:
    needs: linux-build
    if: ${{!cancelled()}}
    runs-on: [self-hosted, 3090-r1]
    timeout-minutes: 50
    container:
      image: openmmlab/lmdeploy:latest-cu12
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /data1:/data1
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Clone repository
        uses: actions/checkout@v2
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        with:
          repository: ${{ github.event.inputs.repo_org || 'InternLM/lmdeploy' }}
          ref: ${{github.event.inputs.repo_ref || 'main'}}
      - name: Copy repository
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        run: rm -rf ${{env.TEST_CODE_PATH}} && mkdir ${{env.TEST_CODE_PATH}} && cp -r . ${{env.TEST_CODE_PATH}}
      - name: Copy repository - offline
        if: ${{inputs.offline_mode}}
        run: rm -rf ${{env.TEST_CODE_PATH}} && mkdir ${{env.TEST_CODE_PATH}} && cp -r ${{env.OFFLINE_CODE_PATH}}/. ${{env.TEST_CODE_PATH}}
      - name: Download Artifacts
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        uses: actions/download-artifact@v4
        with:
          name: my-artifact-${{ github.run_id }}-py310
      - name: Copy Artifacts
        if: ${{github.event_name == 'schedule' || !inputs.offline_mode}}
        run: rm ${{env.TEST_CODE_PATH}}/lmdeploy-*.whl -f && cp lmdeploy-*.whl ${{env.TEST_CODE_PATH}}
      - name: Copy Artifacts - offline
        if: ${{inputs.offline_mode}}
        run: rm ${{env.TEST_CODE_PATH}}/lmdeploy-*.whl -f && cp ${{env.OFFLINE_CODE_PATH}}/lmdeploy-*.whl ${{env.TEST_CODE_PATH}}
      - name: Mark as start
        run: |
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt

  test_quantization:
    needs: download_pkgs
    if: ${{!cancelled() && contains(needs.download_pkgs.result, 'success') && (github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'quant') )}}
    runs-on: [self-hosted, 3090-r1]
    timeout-minutes: 150
    env:
      PYTHONPATH: /nvme/qa_test_models/offline_pkg/LLaVA
      MODELSCOPE_CACHE: /nvme/qa_test_models/modelscope_hub
      MODELSCOPE_MODULES_CACHE: /nvme/qa_test_models/modelscope_modules
      TEST_ENV: 3090_legacy
    container:
      image: openmmlab/lmdeploy:latest-cu12
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /data1:/data1
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install auto_gptq matplotlib
          python3 -m pip install -r requirements/lite.txt
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
      - name: Check env
        run: |
          python3 -m pip list
          pip install transformers==4.57.6
          lmdeploy check_env
          rm -rf allure-results
          # remove tmp log in testcase
          mkdir ${{env.REPORT_DIR}}/.pytest_cache -p && rm autotest/.pytest_cache -f
          ln -s ${{env.REPORT_DIR}}/.pytest_cache autotest
      - name: Test lmdeploy - quantization w4a16
        continue-on-error: true
        if: github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.backend), 'turbomind')
        run: |
          pytest autotest/tools/quantization/test_quantization_awq.py -m 'not pr_test and test_3090' --alluredir=${{env.REPORT_DIR}} --clean-alluredir ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - quantization w8a8
        continue-on-error: true
        if: github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.backend), 'pytorch')
        run: |
          pytest autotest/tools/quantization/test_quantization_w8a8.py --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Clear workfile
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          chmod -R 777 $REPORT_DIR
          export workdir=$(pwd)
          cd ..
          rm -rf $workdir
          mkdir $workdir
          chmod -R 777 $workdir

  test_tools:
    if: ${{!cancelled() && !contains(needs.test_quantization.result, 'fail') && (github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'tools'))}}
    runs-on: [self-hosted, 3090-r1]
    needs: test_quantization
    timeout-minutes: 300
    strategy:
      fail-fast: false
      matrix:
        backend: ${{ fromJSON(inputs.backend || '["turbomind", "pytorch"]')}}
        transformers: ["3090", "3090_legacy"]
        model: ${{ fromJSON(inputs.model || '["llm", "mllm"]')}}
        function: ${{ fromJSON(inputs.function || '["pipeline","restful","chat"]')}}
        exclude:
          - backend: turbomind
            model: mllm
            function: chat
          - backend: pytorch
            model: mllm
            function: chat
    env:
      PYTHONPATH: /nvme/qa_test_models/offline_pkg/LLaVA
      MODELSCOPE_CACHE: /nvme/qa_test_models/modelscope_hub
      MODELSCOPE_MODULES_CACHE: /nvme/qa_test_models/modelscope_modules
      TEST_ENV: ${{matrix.transformers}}
    container:
      image: openmmlab/lmdeploy:latest-cu12
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /data1:/data1
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install -r ${{env.OFFLINE_REQUIREMENTS}}
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
      - name: Downgrade transformers
        if: ${{matrix.transformers == '3090_legacy'}}
        run: |
          pip install transformers==4.57.6
      - name: Check env
        run: |
          python3 -m pip list
          lmdeploy check_env
          rm -rf allure-results
          # remove tmp log in testcase
          mkdir ${{env.REPORT_DIR}}/.pytest_cache -p && rm autotest/.pytest_cache -f
          ln -s ${{env.REPORT_DIR}}/.pytest_cache autotest
      - name: Test lmdeploy - chat
        continue-on-error: true
        if: (matrix.backend == 'pytorch' || matrix.backend == 'turbomind') && matrix.model == 'llm' && matrix.function == 'chat'
        run: |
          pytest autotest/tools/chat/test_command_chat_hf_${{matrix.backend}}.py -m 'gpu_num_1 and not pr_test and test_3090' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
      - name: Test lmdeploy - pipeline
        continue-on-error: true
        if: matrix.function == 'pipeline'
        run: |
          pytest autotest/tools/pipeline/test_pipeline_chat_${{matrix.backend}}_${{matrix.model}}.py -m 'gpu_num_1 and not pr_test and test_3090' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
      - name: Test lmdeploy - restful
        continue-on-error: true
        if: matrix.function == 'restful'
        run: |
          pytest autotest/tools/restful/test_restful_chat_hf_${{matrix.backend}}_${{matrix.model}}.py -m 'gpu_num_1 and not pr_test and test_3090' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S') || true
      - name: Clear workfile
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.txt
          chmod -R 777 $REPORT_DIR
          export workdir=$(pwd)
          cd ..
          rm -rf $workdir
          mkdir $workdir
          chmod -R 777 $workdir

  test_restful:
    if: ${{!cancelled() && !contains(needs.test_quantization.result, 'fail') && (github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'restful'))}}
    runs-on: [self-hosted, 3090-r1]
    needs: test_quantization
    strategy:
      fail-fast: false
      matrix:
        backend: ${{ fromJSON(inputs.backend || '["turbomind", "pytorch"]')}}
        transformers: ["3090", "3090_legacy"]
        model_path: ['internlm/internlm3-8b-instruct', 'Qwen/Qwen3-8B']
        include:
          - tp: 1
            model: internlm3-8b-instruct
            model_path: internlm/internlm3-8b-instruct
            case_info: ['chat_completions_v1', 'generate']
            generate_type: logprob
            extra: '--logprobs-mode raw_logprobs'
          - tp: 1
            model: Qwen3-8B
            model_path: Qwen/Qwen3-8B
            case_info: ['completions_v1']
            generate_type: base
    timeout-minutes: 60
    container:
      image: openmmlab/lmdeploy:latest-cu12
      options: "--gpus=all --ipc=host --user root -e PIP_CACHE_DIR=/root/.cache/pip -e NVIDIA_DISABLE_REQUIRE=1 --pull never"
      volumes:
        - /nvme/github-actions/pip-cache:/root/.cache/pip
        - /nvme/qa_test_models:/nvme/qa_test_models
        - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
    env:
      TEST_ENV: ${{matrix.transformers}}
    steps:
      - name: Copy repository and Artifacts
        run: |
          cp -r ${{env.TEST_CODE_PATH}}/. .
          mkdir ${{env.REPORT_DIR}} -p
          echo "starttime=$(date +%s)" > ${{env.REPORT_DIR}}/status.txt
      - name: Install lmdeploy - dependency
        run: |
          python3 -m pip install -r ${{env.OFFLINE_REQUIREMENTS}}
      - name: Install lmdeploy
        run: |
          python3 -m pip uninstall lmdeploy -y && python3 -m pip install lmdeploy-*.whl --no-deps
          python3 -m pip install -r requirements/test.txt
      - name: Downgrade transformers
        if: ${{matrix.transformers == '3090_legacy'}}
        run: |
          pip install transformers==4.57.6
      - name: Check env
        run: |
          python3 -m pip list
          lmdeploy check_env
          rm -rf allure-results
          # remove tmp log in testcase
          mkdir ${{env.REPORT_DIR}}/.pytest_cache -p && rm autotest/.pytest_cache -f
          ln -s ${{env.REPORT_DIR}}/.pytest_cache autotest
      - name: Start restful api
        run: |
          lmdeploy serve api_server /nvme/qa_test_models/${{matrix.model_path}} --tp ${{matrix.tp}} --backend ${{matrix.backend}} ${{matrix.extra}} > ${{env.REPORT_DIR}}/${{matrix.backend}}_${{matrix.model}}_${{matrix.generate_type}}_start_restful.log 2>&1 &
          echo "restful_pid=$!" >> "$GITHUB_ENV"
          for i in $(seq 1 180)
          do
            sleep 5
            echo "health check try $i"
            if curl -f -s http://127.0.0.1:23333/health > /dev/null 2>&1; then
              echo "health check success"
              exit 0
            fi
          done

          echo "health check fail"
          kill -15 $restful_pid 2>/dev/null || true
          exit 1
      - name: Test lmdeploy - chat_completions_v1
        if:  contains(matrix.case_info, 'chat_completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_chat_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not internlm2_5 and not interns1' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test lmdeploy - completions_v1 - other
        if: contains(matrix.case_info, 'completions_v1')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_completions_v1.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}} and not internlm2_5' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Test generate - logprob
        if:  matrix.generate_type == 'logprob' && contains(matrix.case_info, 'generate')
        timeout-minutes: 60
        run: |
          pytest autotest/interface/restful/test_restful_generate.py -n 20 -k '${{matrix.model_path}} and ${{matrix.backend}}' -m 'not not_${{matrix.backend}} and not experts' --alluredir=${{env.REPORT_DIR}} ${{env.COV_PARAM}} || true
          mv .coverage ${{env.REPORT_DIR}}/.coverage.$(date +'%Y%m%d%H%M%S')
      - name: Kill api server
        if: always()
        run: |
          kill -15 "$restful_pid"
      - name: Clear workfile
        if: always()
        run: |
          echo "status=done" >> ${{env.REPORT_DIR}}/status.t
Download .txt
gitextract_4p86pot8/

├── .clang-format
├── .claude/
│   └── skills/
│       ├── check-env/
│       │   └── SKILL.md
│       ├── code-navigation/
│       │   └── SKILL.md
│       ├── resolve-review/
│       │   └── SKILL.md
│       ├── submit-pr/
│       │   └── SKILL.md
│       └── support-new-model/
│           └── SKILL.md
├── .github/
│   ├── CONTRIBUTING.md
│   ├── ISSUE_TEMPLATE/
│   │   ├── 1-bug-report.yml
│   │   ├── 2-feature-request.yml
│   │   └── 3-documentation.yml
│   ├── pull_request_template.md
│   ├── release.yml
│   ├── scripts/
│   │   ├── action_tools.py
│   │   ├── check_lmdeploy.py
│   │   ├── doc_link_checker.py
│   │   ├── eval_base_config.py
│   │   ├── eval_chat_config.py
│   │   ├── eval_regression_base_models.py
│   │   ├── eval_regression_chat_models.py
│   │   ├── eval_stable_object_config.py
│   │   └── eval_stable_subject_config.py
│   └── workflows/
│       ├── api_eval.yml
│       ├── benchmark.yml
│       ├── cuda12.8_whl_release.yml
│       ├── daily_ete_test.yml
│       ├── daily_ete_test_3090.yml
│       ├── daily_ete_test_5080.yml
│       ├── docker.yml
│       ├── docker_dev.yml
│       ├── evaluate.yml
│       ├── lint.yml
│       ├── linux_x64_gpu.yml
│       ├── mllm_api_eval.yml
│       ├── pr_ete_test.yml
│       ├── pypi.yml
│       ├── stable.yml
│       ├── stale.yml
│       ├── test_docker.yml
│       ├── unit_test.yml
│       └── windows_x64_gpu.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .pylintrc
├── CLAUDE.md
├── CMakeLists.txt
├── LICENSE
├── MANIFEST.in
├── README.md
├── README_ja.md
├── README_zh-CN.md
├── autotest/
│   ├── benchmark/
│   │   ├── test_apiserver_performance.py
│   │   ├── test_longtext_performance.py
│   │   ├── test_mllm_apiserver_performance.py
│   │   ├── test_prefixcache_performance.py
│   │   └── test_throughput_performance.py
│   ├── chat_prompt_case.yml
│   ├── config.yml
│   ├── config_3090.yml
│   ├── config_3090_legacy.yml
│   ├── config_5080.yml
│   ├── config_5080_legacy.yml
│   ├── config_ascend.yml
│   ├── config_h.yml
│   ├── config_h800.yml
│   ├── config_h_legacy.yml
│   ├── config_legacy.yml
│   ├── config_test.yml
│   ├── config_testascend.yml
│   ├── conftest.py
│   ├── evaluate/
│   │   ├── eval_config_chat.py
│   │   ├── test_api_evaluate.py
│   │   └── test_mllm_api_evaluate.py
│   ├── interface/
│   │   ├── pipeline/
│   │   │   ├── test_pipeline_func.py
│   │   │   └── test_pipeline_longtext_func.py
│   │   └── restful/
│   │       ├── test_restful_chat_completions_v1.py
│   │       ├── test_restful_completions_v1.py
│   │       └── test_restful_generate.py
│   ├── prompt_case.yml
│   ├── pytest.ini
│   ├── template.json
│   ├── toolchain/
│   │   └── test_lagent.py
│   ├── tools/
│   │   ├── chat/
│   │   │   ├── test_command_chat_hf_pytorch.py
│   │   │   └── test_command_chat_hf_turbomind.py
│   │   ├── common_case_config.py
│   │   ├── pipeline/
│   │   │   ├── llm_case.py
│   │   │   ├── mllm_case.py
│   │   │   ├── test_pipeline_chat_pytorch_llm.py
│   │   │   ├── test_pipeline_chat_pytorch_mllm.py
│   │   │   ├── test_pipeline_chat_turbomind_llm.py
│   │   │   └── test_pipeline_chat_turbomind_mllm.py
│   │   ├── quantization/
│   │   │   ├── test_quantization_awq.py
│   │   │   └── test_quantization_w8a8.py
│   │   └── restful/
│   │       ├── test_restful_chat_hf_pytorch_llm.py
│   │       ├── test_restful_chat_hf_pytorch_mllm.py
│   │       ├── test_restful_chat_hf_turbomind_llm.py
│   │       └── test_restful_chat_hf_turbomind_mllm.py
│   └── utils/
│       ├── benchmark_utils.py
│       ├── common_utils.py
│       ├── config_utils.py
│       ├── constant.py
│       ├── evaluate_utils.py
│       ├── get_run_config.py
│       ├── mp_log_utils.py
│       ├── pipeline_chat.py
│       ├── proxy_distributed_utils.py
│       ├── quantization_utils.py
│       ├── ray_distributed_utils.py
│       ├── restful_return_check.py
│       ├── rule_condition_assert.py
│       ├── run_client_chat.py
│       ├── run_restful_chat.py
│       └── toolkit.py
├── benchmark/
│   ├── README.md
│   ├── benchmark_decode.py
│   ├── benchmark_pipeline.py
│   ├── benchmark_serving.py
│   ├── benchmark_throughput.py
│   ├── lmdeploy.yml
│   ├── profile_pipeline_api.py
│   ├── profile_restful_api.py
│   └── profile_throughput.py
├── builder/
│   ├── manywheel/
│   │   ├── Dockerfile_2014
│   │   ├── README.md
│   │   ├── build_all_lmdeploy_builders.sh
│   │   ├── build_all_wheel.sh
│   │   ├── build_lmdeploy_builder.sh
│   │   ├── build_wheel.sh
│   │   ├── entrypoint_build.sh
│   │   └── scripts/
│   │       ├── install_conda.sh
│   │       ├── install_cuda.sh
│   │       └── install_openmpi.sh
│   └── windows/
│       ├── README.md
│       ├── generate.ps1
│       └── setup_cuda.ps1
├── cmake/
│   ├── Modules/
│   │   └── FindNCCL.cmake
│   ├── TritonTurboMindBackendConfig.cmake.in
│   ├── TurboMindConfig.cmake.in
│   └── yaml-cpp_cmake_policy.patch
├── debug.sh
├── docker/
│   ├── Dockerfile
│   ├── Dockerfile.jetson
│   ├── Dockerfile_ascend_a2_300i
│   ├── Dockerfile_ascend_a3
│   ├── Dockerfile_dev
│   ├── InternVL_Dockerfile
│   ├── Qwen2VL_Dockerfile
│   ├── build.sh
│   ├── install.sh
│   └── prepare_wheel.sh
├── docs/
│   ├── en/
│   │   ├── .readthedocs.yaml
│   │   ├── Makefile
│   │   ├── _static/
│   │   │   └── css/
│   │   │       └── readthedocs.css
│   │   ├── advance/
│   │   │   ├── chat_template.md
│   │   │   ├── context_parallel.md
│   │   │   ├── debug_turbomind.md
│   │   │   ├── long_context.md
│   │   │   ├── metrics.md
│   │   │   ├── pytorch_multinodes.md
│   │   │   ├── pytorch_multithread.md
│   │   │   ├── pytorch_new_model.md
│   │   │   ├── pytorch_profiling.md
│   │   │   ├── spec_decoding.md
│   │   │   ├── structed_output.md
│   │   │   └── update_weights.md
│   │   ├── api/
│   │   │   ├── cli.rst
│   │   │   ├── openapi.rst
│   │   │   └── pipeline.rst
│   │   ├── benchmark/
│   │   │   ├── a100_fp16.md
│   │   │   ├── benchmark.md
│   │   │   ├── evaluate_with_opencompass.md
│   │   │   └── evaluate_with_vlmevalkit.md
│   │   ├── conf.py
│   │   ├── faq.md
│   │   ├── get_started/
│   │   │   ├── ascend/
│   │   │   │   └── get_started.md
│   │   │   ├── camb/
│   │   │   │   └── get_started.md
│   │   │   ├── get_started.md
│   │   │   ├── index.rst
│   │   │   ├── installation.md
│   │   │   └── maca/
│   │   │       └── get_started.md
│   │   ├── index.rst
│   │   ├── inference/
│   │   │   ├── load_hf.md
│   │   │   ├── pytorch.md
│   │   │   ├── turbomind.md
│   │   │   └── turbomind_config.md
│   │   ├── llm/
│   │   │   ├── api_server.md
│   │   │   ├── api_server_lora.md
│   │   │   ├── api_server_reasoning.md
│   │   │   ├── api_server_tools.md
│   │   │   ├── codellama.md
│   │   │   ├── pipeline.md
│   │   │   └── proxy_server.md
│   │   ├── make.bat
│   │   ├── multi_modal/
│   │   │   ├── api_server_vl.md
│   │   │   ├── cogvlm.md
│   │   │   ├── deepseek_vl2.md
│   │   │   ├── gemma3.md
│   │   │   ├── index.rst
│   │   │   ├── internvl.md
│   │   │   ├── llava.md
│   │   │   ├── minicpmv.md
│   │   │   ├── molmo.md
│   │   │   ├── phi3.md
│   │   │   ├── qwen2_5_vl.md
│   │   │   ├── qwen2_vl.md
│   │   │   ├── vl_pipeline.md
│   │   │   └── xcomposer2d5.md
│   │   ├── quantization/
│   │   │   ├── kv_quant.md
│   │   │   ├── llm_compressor.md
│   │   │   ├── w4a16.md
│   │   │   └── w8a8.md
│   │   └── supported_models/
│   │       ├── reward_models.md
│   │       └── supported_models.md
│   └── zh_cn/
│       ├── .readthedocs.yaml
│       ├── Makefile
│       ├── _static/
│       │   └── css/
│       │       └── readthedocs.css
│       ├── advance/
│       │   ├── chat_template.md
│       │   ├── context_parallel.md
│       │   ├── debug_turbomind.md
│       │   ├── long_context.md
│       │   ├── metrics.md
│       │   ├── pytorch_multinodes.md
│       │   ├── pytorch_multithread.md
│       │   ├── pytorch_new_model.md
│       │   ├── pytorch_profiling.md
│       │   ├── spec_decoding.md
│       │   ├── structed_output.md
│       │   └── update_weights.md
│       ├── api/
│       │   ├── cli.rst
│       │   ├── openapi.rst
│       │   └── pipeline.rst
│       ├── benchmark/
│       │   ├── benchmark.md
│       │   ├── evaluate_with_opencompass.md
│       │   └── evaluate_with_vlmevalkit.md
│       ├── conf.py
│       ├── faq.md
│       ├── get_started/
│       │   ├── ascend/
│       │   │   └── get_started.md
│       │   ├── camb/
│       │   │   └── get_started.md
│       │   ├── get_started.md
│       │   ├── index.rst
│       │   ├── installation.md
│       │   └── maca/
│       │       └── get_started.md
│       ├── index.rst
│       ├── inference/
│       │   ├── load_hf.md
│       │   ├── pytorch.md
│       │   ├── turbomind.md
│       │   └── turbomind_config.md
│       ├── llm/
│       │   ├── api_server.md
│       │   ├── api_server_lora.md
│       │   ├── api_server_reasoning.md
│       │   ├── api_server_tools.md
│       │   ├── codellama.md
│       │   ├── pipeline.md
│       │   └── proxy_server.md
│       ├── make.bat
│       ├── multi_modal/
│       │   ├── api_server_vl.md
│       │   ├── cogvlm.md
│       │   ├── deepseek_vl2.md
│       │   ├── gemma3.md
│       │   ├── index.rst
│       │   ├── internvl.md
│       │   ├── llava.md
│       │   ├── minicpmv.md
│       │   ├── molmo.md
│       │   ├── phi3.md
│       │   ├── qwen2_5_vl.md
│       │   ├── qwen2_vl.md
│       │   ├── vl_pipeline.md
│       │   └── xcomposer2d5.md
│       ├── quantization/
│       │   ├── kv_quant.md
│       │   ├── llm_compressor.md
│       │   ├── w4a16.md
│       │   └── w8a8.md
│       └── supported_models/
│           ├── reward_models.md
│           └── supported_models.md
├── eval/
│   ├── config.py
│   └── eval.py
├── examples/
│   └── lite/
│       ├── qwen3_30b_a3b_awq.py
│       └── qwen3_30b_a3b_gptq.py
├── generate.sh
├── k8s/
│   ├── deployment.yaml
│   └── service.yaml
├── lmdeploy/
│   ├── __init__.py
│   ├── __main__.py
│   ├── api.py
│   ├── archs.py
│   ├── cli/
│   │   ├── __init__.py
│   │   ├── chat.py
│   │   ├── cli.py
│   │   ├── entrypoint.py
│   │   ├── lite.py
│   │   ├── serve.py
│   │   └── utils.py
│   ├── lite/
│   │   ├── __init__.py
│   │   ├── apis/
│   │   │   ├── __init__.py
│   │   │   ├── auto_awq.py
│   │   │   ├── calibrate.py
│   │   │   ├── get_small_sharded_hf.py
│   │   │   ├── gptq.py
│   │   │   └── smooth_quant.py
│   │   ├── defaults.py
│   │   ├── modeling/
│   │   │   ├── __init__.py
│   │   │   ├── internlm2_gptq.py
│   │   │   └── internlm3_gptq.py
│   │   ├── quantization/
│   │   │   ├── __init__.py
│   │   │   ├── activation/
│   │   │   │   ├── __init__.py
│   │   │   │   └── observer.py
│   │   │   ├── awq.py
│   │   │   ├── calibration.py
│   │   │   ├── modules/
│   │   │   │   ├── __init__.py
│   │   │   │   └── linear.py
│   │   │   └── weight/
│   │   │       ├── __init__.py
│   │   │       ├── quant_utils.py
│   │   │       └── quantizer.py
│   │   └── utils/
│   │       ├── __init__.py
│   │       ├── batch_split.py
│   │       ├── cal_qparams.py
│   │       ├── calib_dataloader.py
│   │       ├── collect.py
│   │       ├── global_avail.py
│   │       ├── load.py
│   │       └── memory_efficient.py
│   ├── logger.py
│   ├── messages.py
│   ├── metrics/
│   │   ├── __init__.py
│   │   ├── loggers.py
│   │   ├── metrics_processor.py
│   │   └── stats.py
│   ├── model.py
│   ├── monitoring/
│   │   ├── docker-compose.yaml
│   │   ├── grafana/
│   │   │   ├── dashboards/
│   │   │   │   ├── config/
│   │   │   │   │   └── dashboard.yaml
│   │   │   │   └── json/
│   │   │   │       └── lmdeploy-dashboard.json
│   │   │   └── datasources/
│   │   │       └── datasource.yaml
│   │   └── prometheus.yaml
│   ├── pipeline.py
│   ├── profiler.py
│   ├── pytorch/
│   │   ├── __init__.py
│   │   ├── adapter/
│   │   │   ├── __init__.py
│   │   │   └── adapter.py
│   │   ├── backends/
│   │   │   ├── __init__.py
│   │   │   ├── activation.py
│   │   │   ├── apply_rotary_emb.py
│   │   │   ├── attention.py
│   │   │   ├── awq_modules.py
│   │   │   ├── base.py
│   │   │   ├── blockedf8_modules.py
│   │   │   ├── causal_conv1d.py
│   │   │   ├── cuda/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── activation.py
│   │   │   │   ├── apply_rotary_emb.py
│   │   │   │   ├── attention/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── default.py
│   │   │   │   │   ├── fa3.py
│   │   │   │   │   └── mla.py
│   │   │   │   ├── awq_modules.py
│   │   │   │   ├── blockedf8_modules.py
│   │   │   │   ├── causal_conv1d.py
│   │   │   │   ├── flash_attention.py
│   │   │   │   ├── gated_delta_rule.py
│   │   │   │   ├── graph_runner.py
│   │   │   │   ├── lora.py
│   │   │   │   ├── moe/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── blocked_fp8.py
│   │   │   │   │   ├── default.py
│   │   │   │   │   ├── ep_utils.py
│   │   │   │   │   └── w8a8.py
│   │   │   │   ├── moe_router.py
│   │   │   │   ├── multinomial_sampling.py
│   │   │   │   ├── norm.py
│   │   │   │   ├── nsa.py
│   │   │   │   ├── op_backend.py
│   │   │   │   ├── qmodules.py
│   │   │   │   ├── token_dispatcher.py
│   │   │   │   ├── utils.py
│   │   │   │   └── warmup_manager.py
│   │   │   ├── deepep_moe_checker.py
│   │   │   ├── default/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── activation.py
│   │   │   │   ├── apply_rotary_emb.py
│   │   │   │   ├── awq_modules.py
│   │   │   │   ├── embedding.py
│   │   │   │   ├── linear.py
│   │   │   │   ├── moe.py
│   │   │   │   ├── moe_router.py
│   │   │   │   ├── multinomial_sampling.py
│   │   │   │   ├── norm.py
│   │   │   │   ├── op_backend.py
│   │   │   │   ├── rotary_embedding.py
│   │   │   │   └── token_dispatcher.py
│   │   │   ├── dlinfer/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── activation.py
│   │   │   │   ├── apply_rotary_emb.py
│   │   │   │   ├── ascend/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── op_backend.py
│   │   │   │   │   └── utils.py
│   │   │   │   ├── attention.py
│   │   │   │   ├── awq_modules.py
│   │   │   │   ├── camb/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   └── op_backend.py
│   │   │   │   ├── flash_attention.py
│   │   │   │   ├── linear.py
│   │   │   │   ├── maca/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   └── op_backend.py
│   │   │   │   ├── moe.py
│   │   │   │   ├── norm.py
│   │   │   │   ├── op_backend.py
│   │   │   │   ├── qmodules.py
│   │   │   │   └── rotary_embedding.py
│   │   │   ├── embedding.py
│   │   │   ├── flash_attention.py
│   │   │   ├── gated_delta_rule.py
│   │   │   ├── graph_runner.py
│   │   │   ├── linear.py
│   │   │   ├── lora.py
│   │   │   ├── moe.py
│   │   │   ├── moe_router.py
│   │   │   ├── multinomial_sampling.py
│   │   │   ├── norm.py
│   │   │   ├── nsa.py
│   │   │   ├── qmodules.py
│   │   │   ├── rotary_embedding.py
│   │   │   ├── selector.py
│   │   │   └── token_dispatcher.py
│   │   ├── block.py
│   │   ├── check_env/
│   │   │   ├── __init__.py
│   │   │   ├── adapter.py
│   │   │   ├── base.py
│   │   │   ├── cuda.py
│   │   │   ├── deeplink.py
│   │   │   ├── dist.py
│   │   │   ├── model.py
│   │   │   ├── torch.py
│   │   │   ├── transformers.py
│   │   │   ├── triton.py
│   │   │   └── triton_custom_add.py
│   │   ├── config.py
│   │   ├── configurations/
│   │   │   ├── __init__.py
│   │   │   ├── builder.py
│   │   │   ├── chatglm.py
│   │   │   ├── cogvlm.py
│   │   │   ├── deepseek_v2.py
│   │   │   ├── deepseek_v32.py
│   │   │   ├── deepseek_vl2.py
│   │   │   ├── default.py
│   │   │   ├── gemma.py
│   │   │   ├── glm4.py
│   │   │   ├── gpt_oss.py
│   │   │   ├── interns1_pro.py
│   │   │   ├── internvl.py
│   │   │   ├── internvl3_hf.py
│   │   │   ├── llama.py
│   │   │   ├── llama4.py
│   │   │   ├── llava_hf.py
│   │   │   ├── minicpm3.py
│   │   │   ├── qwen.py
│   │   │   ├── qwen3_5.py
│   │   │   ├── qwen3_next.py
│   │   │   ├── qwen3_vl.py
│   │   │   ├── sdar.py
│   │   │   └── utils.py
│   │   ├── consts.py
│   │   ├── devices/
│   │   │   ├── __init__.py
│   │   │   └── device_manager.py
│   │   ├── disagg/
│   │   │   ├── README.md
│   │   │   ├── __init__.py
│   │   │   ├── backend/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── backend.py
│   │   │   │   ├── base.py
│   │   │   │   ├── dlslime.py
│   │   │   │   └── mooncake.py
│   │   │   ├── config.py
│   │   │   ├── conn/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── engine_conn.py
│   │   │   │   ├── protocol.py
│   │   │   │   └── proxy_conn.py
│   │   │   └── messages.py
│   │   ├── distributed.py
│   │   ├── engine/
│   │   │   ├── __init__.py
│   │   │   ├── base.py
│   │   │   ├── cache_engine.py
│   │   │   ├── config_builder.py
│   │   │   ├── engine.py
│   │   │   ├── engine_checker.py
│   │   │   ├── engine_instance.py
│   │   │   ├── engine_loop.py
│   │   │   ├── executor/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base.py
│   │   │   │   ├── base_worker.py
│   │   │   │   ├── dist_utils.py
│   │   │   │   ├── mp_executor.py
│   │   │   │   ├── ray_executor.py
│   │   │   │   └── uni_executor.py
│   │   │   ├── guided_process.py
│   │   │   ├── input_process.py
│   │   │   ├── inputs_maker.py
│   │   │   ├── logits_process.py
│   │   │   ├── model_agent/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── agent.py
│   │   │   │   ├── inputs_maker.py
│   │   │   │   └── profiler.py
│   │   │   ├── mp_engine/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base.py
│   │   │   │   ├── base_worker.py
│   │   │   │   ├── ray_engine.py
│   │   │   │   ├── zmq_engine.py
│   │   │   │   └── zmq_rpc.py
│   │   │   └── request.py
│   │   ├── envs.py
│   │   ├── kernels/
│   │   │   ├── __init__.py
│   │   │   ├── cuda/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── activation.py
│   │   │   │   ├── apply_rotary_pos_emb.py
│   │   │   │   ├── awq_kernels.py
│   │   │   │   ├── bitonic_topk.py
│   │   │   │   ├── blocked_fp8_fused_moe.py
│   │   │   │   ├── blocked_gemm_fp8.py
│   │   │   │   ├── causal_conv1d.py
│   │   │   │   ├── ds_index.py
│   │   │   │   ├── fill_kv_cache.py
│   │   │   │   ├── flashattention.py
│   │   │   │   ├── flatten_kv_cache.py
│   │   │   │   ├── fused_lora.py
│   │   │   │   ├── fused_moe.py
│   │   │   │   ├── fused_moe_ep.py
│   │   │   │   ├── fused_noaux_tc.py
│   │   │   │   ├── gated_delta_rule.py
│   │   │   │   ├── multinomial_sampling.py
│   │   │   │   ├── pagedattention.py
│   │   │   │   ├── rms_norm.py
│   │   │   │   ├── utils.py
│   │   │   │   ├── w8a8_fused_moe.py
│   │   │   │   └── w8a8_triton_kernels.py
│   │   │   ├── default/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── multinomial_sampling.py
│   │   │   │   └── w8a8_kernels.py
│   │   │   ├── dispatcher.py
│   │   │   ├── dlinfer/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── activation.py
│   │   │   │   ├── apply_rotary_pos_emb.py
│   │   │   │   ├── awq_kernels.py
│   │   │   │   ├── fill_kv_cache.py
│   │   │   │   ├── flash_attention.py
│   │   │   │   ├── fused_moe.py
│   │   │   │   ├── fused_rotary_emb.py
│   │   │   │   ├── linear.py
│   │   │   │   ├── moe_gating_topk_softmax.py
│   │   │   │   ├── pagedattention.py
│   │   │   │   ├── rms_norm.py
│   │   │   │   └── w8a8_kernels.py
│   │   │   └── w8a8_triton_kernels.py
│   │   ├── messages.py
│   │   ├── model_inputs.py
│   │   ├── models/
│   │   │   ├── __init__.py
│   │   │   ├── baichuan.py
│   │   │   ├── chatglm2.py
│   │   │   ├── cogvlm.py
│   │   │   ├── deepseek.py
│   │   │   ├── deepseek_mtp.py
│   │   │   ├── deepseek_v2.py
│   │   │   ├── deepseek_v32.py
│   │   │   ├── deepseek_vl2.py
│   │   │   ├── gemma.py
│   │   │   ├── gemma3_vl.py
│   │   │   ├── glm4.py
│   │   │   ├── glm4_1v.py
│   │   │   ├── glm4_moe.py
│   │   │   ├── glm4moe_mtp.py
│   │   │   ├── gpt_oss.py
│   │   │   ├── internlm.py
│   │   │   ├── internlm2.py
│   │   │   ├── internlm2_reward.py
│   │   │   ├── internlm2_ve.py
│   │   │   ├── internlm3.py
│   │   │   ├── interns1_pro.py
│   │   │   ├── interns1_pro_ts.py
│   │   │   ├── internvl.py
│   │   │   ├── internvl3_hf.py
│   │   │   ├── internvl_patch.py
│   │   │   ├── llama.py
│   │   │   ├── llama4.py
│   │   │   ├── llama_eagle.py
│   │   │   ├── llama_eagle3.py
│   │   │   ├── llava.py
│   │   │   ├── minicpm3.py
│   │   │   ├── minicpmv26.py
│   │   │   ├── mistral.py
│   │   │   ├── mixtral.py
│   │   │   ├── module_map.py
│   │   │   ├── patch.py
│   │   │   ├── phi3.py
│   │   │   ├── phi3_moe.py
│   │   │   ├── phi3_v.py
│   │   │   ├── q_modules.py
│   │   │   ├── qwen.py
│   │   │   ├── qwen2.py
│   │   │   ├── qwen2_5_vl.py
│   │   │   ├── qwen2_moe.py
│   │   │   ├── qwen2_reward.py
│   │   │   ├── qwen2_vl.py
│   │   │   ├── qwen3.py
│   │   │   ├── qwen3_5.py
│   │   │   ├── qwen3_5_moe.py
│   │   │   ├── qwen3_moe.py
│   │   │   ├── qwen3_next.py
│   │   │   ├── qwen3_vl.py
│   │   │   ├── qwen3_vl_moe.py
│   │   │   ├── sdar.py
│   │   │   ├── sdar_moe.py
│   │   │   ├── siglip.py
│   │   │   ├── starcoder2.py
│   │   │   ├── utils/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── cudagraph.py
│   │   │   │   ├── micro_batch.py
│   │   │   │   └── model.py
│   │   │   └── whisper.py
│   │   ├── multimodal/
│   │   │   ├── __init__.py
│   │   │   └── data_type.py
│   │   ├── nn/
│   │   │   ├── __init__.py
│   │   │   ├── activation.py
│   │   │   ├── attention.py
│   │   │   ├── embedding.py
│   │   │   ├── eplb.py
│   │   │   ├── gated_delta.py
│   │   │   ├── linear/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── awq.py
│   │   │   │   ├── base.py
│   │   │   │   ├── blocked_fp8.py
│   │   │   │   ├── default.py
│   │   │   │   ├── lora.py
│   │   │   │   ├── utils.py
│   │   │   │   └── w8a8.py
│   │   │   ├── moe/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base.py
│   │   │   │   ├── blocked_fp8.py
│   │   │   │   ├── default.py
│   │   │   │   ├── route.py
│   │   │   │   └── w8a8.py
│   │   │   ├── multinomial_sampling.py
│   │   │   ├── norm.py
│   │   │   ├── nsa.py
│   │   │   ├── quant_utils.py
│   │   │   ├── rotary_embedding.py
│   │   │   └── utils.py
│   │   ├── paging/
│   │   │   ├── __init__.py
│   │   │   ├── block_manager/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base_block_manager.py
│   │   │   │   ├── default_block_manager.py
│   │   │   │   └── window_block_manager.py
│   │   │   ├── block_trie.py
│   │   │   ├── eviction_helper/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base_eviction_helper.py
│   │   │   │   └── recompute_eviction_helper.py
│   │   │   ├── scheduler.py
│   │   │   ├── seq_states/
│   │   │   │   ├── __init__.py
│   │   │   │   └── states.py
│   │   │   └── state_manager.py
│   │   ├── ray.py
│   │   ├── spec_decode/
│   │   │   ├── __init__.py
│   │   │   ├── base.py
│   │   │   ├── proposers/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── base.py
│   │   │   │   ├── deepseek_mtp.py
│   │   │   │   ├── eagle.py
│   │   │   │   └── eagle3.py
│   │   │   ├── reject_sampler.py
│   │   │   └── spec_agent.py
│   │   ├── strategies/
│   │   │   ├── __init__.py
│   │   │   ├── ar/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── cudagraph.py
│   │   │   │   ├── engine.py
│   │   │   │   ├── model_agent.py
│   │   │   │   ├── model_inputs.py
│   │   │   │   ├── sampling.py
│   │   │   │   └── sequence.py
│   │   │   ├── ar_spec/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── cudagraph.py
│   │   │   │   ├── engine.py
│   │   │   │   ├── model_agent.py
│   │   │   │   ├── model_inputs.py
│   │   │   │   ├── sampling.py
│   │   │   │   └── sequence.py
│   │   │   ├── base/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── cudagraph.py
│   │   │   │   ├── engine.py
│   │   │   │   ├── model_agent.py
│   │   │   │   ├── model_inputs.py
│   │   │   │   ├── sampling.py
│   │   │   │   └── sequence.py
│   │   │   └── dllm/
│   │   │       ├── __init__.py
│   │   │       ├── cudagraph.py
│   │   │       ├── engine.py
│   │   │       ├── model_agent.py
│   │   │       ├── model_inputs.py
│   │   │       ├── sampling.py
│   │   │       ├── sequence.py
│   │   │       └── unmasking.py
│   │   ├── third_party/
│   │   │   ├── __init__.py
│   │   │   ├── deep_gemm/
│   │   │   │   └── __init__.py
│   │   │   └── flash_attn_interface.py
│   │   ├── tools/
│   │   │   ├── __init__.py
│   │   │   └── utils.py
│   │   ├── transformers/
│   │   │   ├── __init__.py
│   │   │   └── configuration_deepseek_v32.py
│   │   ├── utils.py
│   │   └── weight_loader/
│   │       ├── __init__.py
│   │       └── model_weight_loader.py
│   ├── serve/
│   │   ├── __init__.py
│   │   ├── core/
│   │   │   ├── __init__.py
│   │   │   ├── async_engine.py
│   │   │   ├── exceptions.py
│   │   │   └── vl_async_engine.py
│   │   ├── managers/
│   │   │   ├── __init__.py
│   │   │   └── session_manager.py
│   │   ├── openai/
│   │   │   ├── __init__.py
│   │   │   ├── api_client.py
│   │   │   ├── api_server.py
│   │   │   ├── harmony_utils.py
│   │   │   ├── launch_server.py
│   │   │   ├── protocol.py
│   │   │   ├── reasoning_parser/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── deepseek_r1_reasoning_parser.py
│   │   │   │   ├── qwen_qwq_reasoning_parser.py
│   │   │   │   └── reasoning_parser.py
│   │   │   ├── serving_chat_completion.py
│   │   │   ├── serving_completion.py
│   │   │   ├── serving_generate.py
│   │   │   └── tool_parser/
│   │   │       ├── __init__.py
│   │   │       ├── internlm2_parser.py
│   │   │       ├── llama3_parser.py
│   │   │       ├── qwen2d5_parser.py
│   │   │       ├── qwen3_parser.py
│   │   │       ├── qwen3coder_parser.py
│   │   │       ├── tool_parser.py
│   │   │       └── utils.py
│   │   ├── processors/
│   │   │   ├── __init__.py
│   │   │   └── multimodal.py
│   │   ├── proxy/
│   │   │   ├── __init__.py
│   │   │   ├── proxy.py
│   │   │   ├── streaming_response.py
│   │   │   └── utils.py
│   │   └── utils/
│   │       ├── __init__.py
│   │       └── server_utils.py
│   ├── tokenizer.py
│   ├── turbomind/
│   │   ├── __init__.py
│   │   ├── deploy/
│   │   │   ├── __init__.py
│   │   │   ├── config.py
│   │   │   ├── converter.py
│   │   │   ├── loader.py
│   │   │   ├── module.py
│   │   │   ├── parameter.py
│   │   │   ├── policy.py
│   │   │   ├── source_model/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── baichuan.py
│   │   │   │   ├── base.py
│   │   │   │   ├── deepseek2.py
│   │   │   │   ├── deepseek_vl.py
│   │   │   │   ├── glm4.py
│   │   │   │   ├── glm4_moe_lite.py
│   │   │   │   ├── gpt_oss.py
│   │   │   │   ├── internlm2.py
│   │   │   │   ├── internvl.py
│   │   │   │   ├── llama.py
│   │   │   │   ├── llava.py
│   │   │   │   ├── minicpmv.py
│   │   │   │   ├── mixtral.py
│   │   │   │   ├── molmo.py
│   │   │   │   ├── qwen.py
│   │   │   │   └── xcomposer2.py
│   │   │   └── target_model/
│   │   │       ├── __init__.py
│   │   │       ├── base.py
│   │   │       └── fp.py
│   │   ├── supported_models.py
│   │   ├── tokenizer_info.py
│   │   └── turbomind.py
│   ├── utils.py
│   ├── version.py
│   └── vl/
│       ├── __init__.py
│       ├── constants.py
│       ├── engine.py
│       ├── media/
│       │   ├── __init__.py
│       │   ├── base.py
│       │   ├── connection.py
│       │   ├── image.py
│       │   ├── time_series.py
│       │   ├── video.py
│       │   └── video_loader.py
│       ├── model/
│       │   ├── __init__.py
│       │   ├── base.py
│       │   ├── builder.py
│       │   ├── cogvlm.py
│       │   ├── deepseek.py
│       │   ├── deepseek_vl2.py
│       │   ├── gemma3_vl.py
│       │   ├── glm4_1v.py
│       │   ├── glm4_v.py
│       │   ├── interns1_pro.py
│       │   ├── internvl.py
│       │   ├── internvl3_hf.py
│       │   ├── internvl_llava.py
│       │   ├── llama4.py
│       │   ├── llava.py
│       │   ├── llava_hf.py
│       │   ├── llava_next.py
│       │   ├── minicpmv.py
│       │   ├── mllama.py
│       │   ├── molmo.py
│       │   ├── phi3_vision.py
│       │   ├── qwen.py
│       │   ├── qwen2.py
│       │   ├── qwen3.py
│       │   ├── qwen3_5.py
│       │   ├── utils.py
│       │   ├── xcomposer2.py
│       │   └── yi.py
│       ├── tools/
│       │   ├── __init__.py
│       │   └── merge_xcomposer2d5_task.py
│       └── utils.py
├── pyproject.toml
├── setup.py
├── src/
│   ├── CMakeLists.txt
│   └── turbomind/
│       ├── CMakeLists.txt
│       ├── comm/
│       │   ├── CMakeLists.txt
│       │   ├── barrier.h
│       │   ├── cuda_ipc/
│       │   │   ├── CMakeLists.txt
│       │   │   ├── allgather.cu
│       │   │   ├── allreduce.cu
│       │   │   ├── bootstrap.h
│       │   │   ├── broadcast.cu
│       │   │   ├── common.h
│       │   │   ├── cuda_ipc_comm.cu
│       │   │   ├── cuda_ipc_comm.h
│       │   │   ├── fused_allreduce.cu
│       │   │   ├── fused_allreduce_ex.cu
│       │   │   ├── group_sum.h
│       │   │   ├── mscclpp.h
│       │   │   ├── multimem.cuh
│       │   │   ├── semaphore.cuh
│       │   │   └── semaphore.h
│       │   ├── device_comm.cc
│       │   ├── device_comm.h
│       │   ├── env.h
│       │   ├── gloo/
│       │   │   ├── CMakeLists.txt
│       │   │   ├── gloo_comm.cc
│       │   │   ├── hybrid_comm.cc
│       │   │   ├── tcp_store.cc
│       │   │   ├── tcp_store.h
│       │   │   └── test_ipc_comm.cc
│       │   ├── host_comm.cc
│       │   ├── host_comm.h
│       │   ├── nccl/
│       │   │   ├── CMakeLists.txt
│       │   │   └── nccl.cu
│       │   ├── test_comm.cu
│       │   ├── test_host_comm.cc
│       │   └── thread_comm.cc
│       ├── core/
│       │   ├── CMakeLists.txt
│       │   ├── allocator.cc
│       │   ├── allocator.h
│       │   ├── buffer.cc
│       │   ├── buffer.h
│       │   ├── check.cc
│       │   ├── check.h
│       │   ├── common.h
│       │   ├── context.cc
│       │   ├── context.h
│       │   ├── copy.cc
│       │   ├── copy.h
│       │   ├── core.h
│       │   ├── cuda_data_type.h
│       │   ├── data_type.h
│       │   ├── interval.h
│       │   ├── layout.cc
│       │   ├── layout.h
│       │   ├── module.cc
│       │   ├── module.h
│       │   ├── ranges.h
│       │   ├── serdes.h
│       │   ├── state.h
│       │   ├── stream.cc
│       │   ├── stream.h
│       │   ├── tensor.cc
│       │   ├── tensor.cu
│       │   ├── tensor.h
│       │   └── test_core.cc
│       ├── engine/
│       │   ├── CMakeLists.txt
│       │   ├── batch.h
│       │   ├── engine.cc
│       │   ├── engine.h
│       │   ├── gateway.cc
│       │   ├── gateway.h
│       │   ├── model_executor.cc
│       │   ├── model_executor.h
│       │   ├── model_request.cc
│       │   ├── model_request.h
│       │   ├── queue.h
│       │   ├── request.cc
│       │   ├── request.h
│       │   ├── request_queue.cc
│       │   ├── request_queue.h
│       │   └── signal_buffer.h
│       ├── generation/
│       │   ├── CMakeLists.txt
│       │   ├── base_param.h
│       │   ├── generation.cc
│       │   ├── generation.h
│       │   ├── guided_decoding.cc
│       │   ├── guided_decoding.h
│       │   ├── logits_processor.cc
│       │   ├── logits_processor.h
│       │   ├── sampling.cc
│       │   ├── sampling.h
│       │   ├── stop_criteria.cc
│       │   ├── stop_criteria.h
│       │   └── utils.h
│       ├── kernels/
│       │   ├── CMakeLists.txt
│       │   ├── activation.cu
│       │   ├── activation.h
│       │   ├── activation_kernels.cu
│       │   ├── activation_kernels.h
│       │   ├── apply_token_bitmask_inplace_cuda.cu
│       │   ├── apply_token_bitmask_inplace_cuda.h
│       │   ├── attention/
│       │   │   ├── CMakeLists.txt
│       │   │   ├── arch.h
│       │   │   ├── attention.cu
│       │   │   ├── attention.h
│       │   │   ├── attention_params.h
│       │   │   ├── attention_template.h
│       │   │   ├── attention_universal.h
│       │   │   ├── block.h
│       │   │   ├── block_iterator.h
│       │   │   ├── cp_utils.cu
│       │   │   ├── cp_utils.h
│       │   │   ├── cta_map.h
│       │   │   ├── decoding.cu
│       │   │   ├── decoding.h
│       │   │   ├── decoding_template.h
│       │   │   ├── desc.h
│       │   │   ├── impl.h
│       │   │   ├── impl_16816.h
│       │   │   ├── impl_1688.h
│       │   │   ├── impl_81616.h
│       │   │   ├── impl_884.h
│       │   │   ├── impl_m16n8.h
│       │   │   ├── impl_simt.h
│       │   │   ├── iterator.h
│       │   │   ├── iterator_sm70.h
│       │   │   ├── iterator_sm80.h
│       │   │   ├── kernel/
│       │   │   │   ├── CMakeLists.txt
│       │   │   │   ├── attention_sm70_128.cu
│       │   │   │   ├── attention_sm70_256.cu
│       │   │   │   ├── attention_sm70_576.cu
│       │   │   │   ├── attention_sm70_64.cu
│       │   │   │   ├── attention_sm75_128.cu
│       │   │   │   ├── attention_sm75_256.cu
│       │   │   │   ├── attention_sm75_576.cu
│       │   │   │   ├── attention_sm75_64.cu
│       │   │   │   ├── attention_sm80_128.cu
│       │   │   │   ├── attention_sm80_192.cu
│       │   │   │   ├── attention_sm80_256.cu
│       │   │   │   ├── attention_sm80_576.cu
│       │   │   │   ├── attention_sm80_64.cu
│       │   │   │   ├── decoding_sm70_128.cu
│       │   │   │   ├── decoding_sm70_256.cu
│       │   │   │   ├── decoding_sm70_576.cu
│       │   │   │   ├── decoding_sm70_64.cu
│       │   │   │   ├── decoding_sm75_128.cu
│       │   │   │   ├── decoding_sm75_256.cu
│       │   │   │   ├── decoding_sm75_576.cu
│       │   │   │   ├── decoding_sm75_64.cu
│       │   │   │   ├── decoding_sm80_128.cu
│       │   │   │   ├── decoding_sm80_192.cu
│       │   │   │   ├── decoding_sm80_256.cu
│       │   │   │   ├── decoding_sm80_576.cu
│       │   │   │   └── decoding_sm80_64.cu
│       │   │   ├── kernel.h
│       │   │   ├── kernel_impl.h
│       │   │   ├── kv_cache_utils_v2.cu
│       │   │   ├── kv_cache_utils_v2.h
│       │   │   ├── linear_iterator.h
│       │   │   ├── mainloop.h
│       │   │   ├── mainloop_sm70.h
│       │   │   ├── mainloop_sm80.h
│       │   │   ├── quantization.h
│       │   │   ├── reduce.cu
│       │   │   ├── reduce.h
│       │   │   ├── reference.cu
│       │   │   ├── reference.h
│       │   │   ├── registrar.h
│       │   │   ├── registry.cu
│       │   │   ├── registry.h
│       │   │   ├── rotary_embedding.h
│       │   │   ├── test_attention.cu
│       │   │   ├── test_quant.cu
│       │   │   ├── test_utils.cu
│       │   │   ├── test_utils.h
│       │   │   ├── utils.cc
│       │   │   └── utils.h
│       │   ├── ban_bad_words.cu
│       │   ├── ban_bad_words.h
│       │   ├── core/
│       │   │   ├── array.h
│       │   │   ├── array_ops.h
│       │   │   ├── common.h
│       │   │   ├── data_type.h
│       │   │   ├── floating_point.h
│       │   │   ├── layout.h
│       │   │   ├── math.h
│       │   │   ├── meta.h
│       │   │   ├── mma.h
│       │   │   ├── pipe_iter.h
│       │   │   ├── smem.h
│       │   │   ├── sub_byte_ptr.h
│       │   │   ├── sync.h
│       │   │   └── thread_map.h
│       │   ├── decoding_kernels.cu
│       │   ├── decoding_kernels.h
│       │   ├── gemm/
│       │   │   ├── CMakeLists.txt
│       │   │   ├── arch/
│       │   │   │   ├── config_simt.h
│       │   │   │   ├── config_sm70_s884.h
│       │   │   │   ├── config_sm75_s16816.h
│       │   │   │   ├── config_sm80_s16816.h
│       │   │   │   ├── mma_simt.h
│       │   │   │   ├── mma_sm70.h
│       │   │   │   ├── mma_sm80.h
│       │   │   │   ├── operand_simt.h
│       │   │   │   ├── operand_sm70_s884.h
│       │   │   │   ├── operand_sm80_s16816.h
│       │   │   │   ├── smem_copy_simt.h
│       │   │   │   ├── smem_copy_sm70.h
│       │   │   │   └── smem_copy_sm80.h
│       │   │   ├── arch.h
│       │   │   ├── cast.cu
│       │   │   ├── cast.h
│       │   │   ├── context.cu
│       │   │   ├── context.h
│       │   │   ├── convert.cuh
│       │   │   ├── convert.h
│       │   │   ├── convert_v3.cu
│       │   │   ├── cp_async.h
│       │   │   ├── cta_map.h
│       │   │   ├── cublas.cu
│       │   │   ├── desc.h
│       │   │   ├── dispatch_cache.cu
│       │   │   ├── dispatch_cache.h
│       │   │   ├── epilogue.h
│       │   │   ├── format.h
│       │   │   ├── gemm.cu
│       │   │   ├── gemm.h
│       │   │   ├── gemm_universal.h
│       │   │   ├── gemm_universal_sm90.h
│       │   │   ├── gemm_universal_sm90_v2.h
│       │   │   ├── gemm_universal_sm90_v3.h
│       │   │   ├── gemm_universal_sm90_v4.h
│       │   │   ├── gemm_universal_sm90_v5.h
│       │   │   ├── gpu_metric.cu
│       │   │   ├── gpu_metric.h
│       │   │   ├── iterator.h
│       │   │   ├── iterator_sm70.h
│       │   │   ├── iterator_sm80.h
│       │   │   ├── iterator_sm90.h
│       │   │   ├── kernel/
│       │   │   │   ├── sm70_884_16.cu
│       │   │   │   ├── sm70_884_4.cu
│       │   │   │   ├── sm70_884_8.cu
│       │   │   │   ├── sm75_16816_16.cu
│       │   │   │   ├── sm75_16816_4.cu
│       │   │   │   ├── sm75_16816_8.cu
│       │   │   │   ├── sm80_16816_16.cu
│       │   │   │   ├── sm80_16816_4.cu
│       │   │   │   ├── sm80_16816_8.cu
│       │   │   │   ├── sm90_16816_16.cu
│       │   │   │   ├── sm90_16816_4.cu
│       │   │   │   ├── sm90_16816_8.cu
│       │   │   │   └── sm90_64n32_8.cu
│       │   │   ├── kernel.cu
│       │   │   ├── kernel.h
│       │   │   ├── kernel_impl.h
│       │   │   ├── kernel_impl_sm90.h
│       │   │   ├── mainloop_sm70.h
│       │   │   ├── mainloop_sm80_v2.h
│       │   │   ├── matrix_ptr.h
│       │   │   ├── moe_utils_v2.cu
│       │   │   ├── moe_utils_v2.h
│       │   │   ├── operand.h
│       │   │   ├── predicate.h
│       │   │   ├── registry.cu
│       │   │   ├── registry.h
│       │   │   ├── scaled_gmma_fp8_sm90.h
│       │   │   ├── scheduler.cuh
│       │   │   ├── scheduler_sm70.cuh
│       │   │   ├── simt.h
│       │   │   ├── sm90_utils.h
│       │   │   ├── smem_copy.h
│       │   │   ├── test/
│       │   │   │   ├── gemm_bench.cu
│       │   │   │   ├── models.h
│       │   │   │   ├── quantization.cu
│       │   │   │   ├── quantization.h
│       │   │   │   ├── quantization_impl.h
│       │   │   │   ├── reference.cu
│       │   │   │   ├── reference.h
│       │   │   │   ├── test_gemm_v2.cc
│       │   │   │   ├── test_moe_utils.cu
│       │   │   │   ├── test_utils.cu
│       │   │   │   ├── test_utils.h
│       │   │   │   └── testbed_v3.h
│       │   │   ├── thread_group_map.h
│       │   │   ├── thread_map.h
│       │   │   ├── tiled_mma.h
│       │   │   ├── tma.cu
│       │   │   ├── tma.h
│       │   │   ├── transform.h
│       │   │   ├── tuner/
│       │   │   │   ├── cache_utils.cu
│       │   │   │   ├── cache_utils.h
│       │   │   │   ├── measurer.cu
│       │   │   │   ├── measurer.h
│       │   │   │   ├── params.cc
│       │   │   │   ├── params.h
│       │   │   │   ├── sampler.cu
│       │   │   │   ├── sampler.h
│       │   │   │   ├── stats.h
│       │   │   │   ├── stopping_criterion.cc
│       │   │   │   └── stopping_criterion.h
│       │   │   ├── types.h
│       │   │   ├── unpack.cu
│       │   │   └── utils.h
│       │   ├── gpt_kernels.cu
│       │   ├── gpt_kernels.h
│       │   ├── logprob_kernels.cu
│       │   ├── logprob_kernels.h
│       │   ├── norm/
│       │   │   ├── CMakeLists.txt
│       │   │   ├── rms_norm.cu
│       │   │   └── rms_norm.h
│       │   ├── penalty_types.h
│       │   ├── quantization.cu
│       │   ├── quantization.cuh
│       │   ├── quantization.h
│       │   ├── reduce_kernel_utils.cuh
│       │   ├── sampling_kernels.cu
│       │   ├── sampling_kernels.h
│       │   ├── sampling_penalty_kernels.cu
│       │   ├── sampling_penalty_kernels.h
│       │   ├── sampling_topk_kernels.cu
│       │   ├── sampling_topk_kernels.h
│       │   ├── sampling_topp_kernels.cu
│       │   ├── sampling_topp_kernels.h
│       │   ├── stop_criteria_kernels.cu
│       │   ├── stop_criteria_kernels.h
│       │   ├── test_quantization.cc
│       │   ├── unfused_attention_kernels.cu
│       │   └── unfused_attention_kernels.h
│       ├── macro.h
│       ├── models/
│       │   ├── CMakeLists.txt
│       │   ├── input_processor.cc
│       │   ├── input_processor.h
│       │   ├── language_model.cc
│       │   ├── language_model.h
│       │   ├── llama/
│       │   │   ├── Barrier.h
│       │   │   ├── BlockManager.cc
│       │   │   ├── BlockManager.h
│       │   │   ├── BlockTrie.cc
│       │   │   ├── BlockTrie.h
│       │   │   ├── CMakeLists.txt
│       │   │   ├── GatedDeltaNetLayer.cc
│       │   │   ├── GatedDeltaNetLayer.h
│       │   │   ├── GatedDeltaNetWeight.cc
│       │   │   ├── GatedDeltaNetWeight.h
│       │   │   ├── LlamaDecoderLayerWeight.cc
│       │   │   ├── LlamaDecoderLayerWeight.h
│       │   │   ├── LlamaDenseWeight.cc
│       │   │   ├── LlamaDenseWeight.h
│       │   │   ├── LlamaFfnLayer.cc
│       │   │   ├── LlamaFfnLayer.h
│       │   │   ├── LlamaLinear.cu
│       │   │   ├── LlamaLinear.h
│       │   │   ├── LlamaWeight.cc
│       │   │   ├── LlamaWeight.h
│       │   │   ├── SequenceManager.cc
│       │   │   ├── SequenceManager.h
│       │   │   ├── bench_conv1d_silu.cc
│       │   │   ├── bench_gated_delta_net.cc
│       │   │   ├── context.h
│       │   │   ├── gated_delta_net_kernels.cu
│       │   │   ├── gated_delta_net_kernels.h
│       │   │   ├── llama_kernels.cu
│       │   │   ├── llama_kernels.h
│       │   │   ├── llama_params.h
│       │   │   ├── llama_rope.h
│       │   │   ├── llama_utils.cu
│       │   │   ├── llama_utils.h
│       │   │   ├── mla_utils.cu
│       │   │   ├── mla_utils.h
│       │   │   ├── moe_ffn_layer.cc
│       │   │   ├── moe_ffn_layer.h
│       │   │   ├── test_cache_manager.cc
│       │   │   ├── unified_attention_layer.cc
│       │   │   ├── unified_attention_layer.h
│       │   │   ├── unified_decoder.cc
│       │   │   └── unified_decoder.h
│       │   ├── output_processor.cc
│       │   └── output_processor.h
│       ├── python/
│       │   ├── CMakeLists.txt
│       │   ├── bind.cpp
│       │   ├── dlpack.h
│       │   └── xgrammar_bind.cpp
│       ├── turbomind.cc
│       ├── turbomind.h
│       └── utils/
│           ├── CMakeLists.txt
│           ├── anomaly_handler.cu
│           ├── anomaly_handler.h
│           ├── constant.h
│           ├── cuda_bf16_fallbacks.cuh
│           ├── cuda_bf16_wrapper.h
│           ├── cuda_type_utils.cuh
│           ├── cuda_utils.cc
│           ├── cuda_utils.h
│           ├── debug_utils.h
│           ├── dispatch.h
│           ├── logger.cc
│           ├── logger.h
│           ├── memory_utils.cu
│           ├── memory_utils.h
│           ├── metrics.h
│           ├── monotonic.h
│           ├── nvtx_utils.cc
│           ├── nvtx_utils.h
│           ├── parser.cc
│           ├── parser.h
│           ├── string_utils.h
│           └── test_utils.h
└── tests/
    ├── csrc/
    │   ├── CMakeLists.txt
    │   └── unittests/
    │       ├── CMakeLists.txt
    │       ├── gtest_utils.h
    │       ├── test_logprob_kernels.cu
    │       ├── test_penalty_kernels.cu
    │       ├── test_sampling_kernels.cu
    │       ├── test_sampling_layer.cu
    │       └── unittest_utils.h
    ├── pytorch/
    │   ├── config/
    │   │   └── test_hf_overrides.py
    │   ├── engine/
    │   │   ├── test_logits_process.py
    │   │   ├── test_request.py
    │   │   └── test_zmq_rpc.py
    │   ├── kernel/
    │   │   ├── test_activation.py
    │   │   ├── test_apply_rotary.py
    │   │   ├── test_bitonic_topk.py
    │   │   ├── test_causal_conv1d.py
    │   │   ├── test_ds_index.py
    │   │   ├── test_fill_kv_cache.py
    │   │   ├── test_flash_attention.py
    │   │   ├── test_flatten_kv_cache.py
    │   │   ├── test_fuse_moe_blocked_fp8.py
    │   │   ├── test_fused_lora.py
    │   │   ├── test_fused_moe.py
    │   │   ├── test_gated_delta_rule.py
    │   │   ├── test_gemm_fp8.py
    │   │   ├── test_moe_route.py
    │   │   ├── test_multinomial_sampling.py
    │   │   ├── test_paged_attention.py
    │   │   └── test_rms_norm.py
    │   ├── nn/
    │   │   └── test_embedding.py
    │   └── paging/
    │       ├── test_block_manager.py
    │       ├── test_block_trie.py
    │       └── test_scheduler.py
    └── test_lmdeploy/
        ├── test_auto_backend.py
        ├── test_content_merge.py
        ├── test_grammar.py
        ├── test_harmony_gpt_oss_parser.py
        ├── test_lite/
        │   └── test_quantization/
        │       └── test_utils/
        │           └── test_cal_qparams.py
        ├── test_messages.py
        ├── test_model.py
        ├── test_pipeline.py
        ├── test_qwen3_parser.py
        ├── test_qwen3coder_parser.py
        ├── test_tokenizer.py
        ├── test_turbomind/
        │   └── test_converter.py
        ├── test_utils.py
        └── test_vl/
            ├── test_hf_chat_template.py
            ├── test_nonhf_chat_template.py
            ├── test_qwen3vl_processor.py
            └── test_vl_encode.py
Download .txt
Showing preview only (678K chars total). Download the full file or copy to clipboard to get everything.
SYMBOL INDEX (7894 symbols across 838 files)

FILE: .github/scripts/action_tools.py
  function run_cmd (line 17) | def run_cmd(cmd_lines: List[str], log_path: str, cwd: str = None):
  function _append_summary (line 52) | def _append_summary(content):
  function add_summary (line 58) | def add_summary(csv_path: str):
  function evaluate (line 78) | def evaluate(models: List[str],
  function create_model_links (line 187) | def create_model_links(src_dir: str, dst_dir: str):
  function generate_benchmark_report (line 201) | def generate_benchmark_report(report_path: str):
  function generate_csv_from_profile_result (line 255) | def generate_csv_from_profile_result(file_path: str, out_path: str):
  function generate_output_for_evaluation (line 277) | def generate_output_for_evaluation(result_dir: str):
  function find_csv_files (line 291) | def find_csv_files(directory):

FILE: .github/scripts/check_lmdeploy.py
  function check_module_init (line 8) | def check_module_init(root: str):

FILE: .github/scripts/doc_link_checker.py
  function make_parser (line 9) | def make_parser():
  function analyze_doc (line 19) | def analyze_doc(home, path):
  function traverse (line 66) | def traverse(target):

FILE: autotest/benchmark/test_apiserver_performance.py
  function get_models (line 6) | def get_models(backend, parallel_config):
  function test_turbomind_apiserver_tp1 (line 14) | def test_turbomind_apiserver_tp1(config, run_config, worker_id):
  function test_turbomind_apiserver_tp2 (line 23) | def test_turbomind_apiserver_tp2(config, run_config, worker_id):
  function test_turbomind_apiserver_tp4 (line 32) | def test_turbomind_apiserver_tp4(config, run_config, worker_id):
  function test_turbomind_apiserver_tp8 (line 41) | def test_turbomind_apiserver_tp8(config, run_config, worker_id):
  function test_pytorch_apiserver_tp1 (line 50) | def test_pytorch_apiserver_tp1(config, run_config, worker_id):
  function test_pytorch_apiserver_tp2 (line 59) | def test_pytorch_apiserver_tp2(config, run_config, worker_id):
  function test_pytorch_apiserver_tp4 (line 68) | def test_pytorch_apiserver_tp4(config, run_config, worker_id):
  function test_pytorch_apiserver_tp8 (line 77) | def test_pytorch_apiserver_tp8(config, run_config, worker_id):
  function test_pytorch_apiserver_tp16 (line 86) | def test_pytorch_apiserver_tp16(config, run_config, worker_id):
  function test_restful_func_tp2 (line 131) | def test_restful_func_tp2(config, run_config, worker_id):

FILE: autotest/benchmark/test_longtext_performance.py
  function get_models (line 6) | def get_models(backend, parallel_config):
  function test_turbomind_longtext_throughput_tp1 (line 14) | def test_turbomind_longtext_throughput_tp1(config, run_config, worker_id):
  function test_turbomind_longtext_throughput_tp2 (line 23) | def test_turbomind_longtext_throughput_tp2(config, run_config, worker_id):
  function test_turbomind_longtext_throughput_tp4 (line 32) | def test_turbomind_longtext_throughput_tp4(config, run_config, worker_id):
  function test_turbomind_longtext_throughput_tp8 (line 41) | def test_turbomind_longtext_throughput_tp8(config, run_config, worker_id):
  function test_pytorch_longtext_throughput_tp1 (line 50) | def test_pytorch_longtext_throughput_tp1(config, run_config, worker_id):
  function test_pytorch_longtext_throughput_tp2 (line 59) | def test_pytorch_longtext_throughput_tp2(config, run_config, worker_id):
  function test_pytorch_longtext_throughput_tp4 (line 68) | def test_pytorch_longtext_throughput_tp4(config, run_config, worker_id):
  function test_pytorch_longtext_throughput_tp8 (line 77) | def test_pytorch_longtext_throughput_tp8(config, run_config, worker_id):
  function test_pytorch_longtext_throughput_tp16 (line 86) | def test_pytorch_longtext_throughput_tp16(config, run_config, worker_id):

FILE: autotest/benchmark/test_mllm_apiserver_performance.py
  function get_models (line 6) | def get_models(backend, parallel_config):
  function test_turbomind_mllm_apiserver_tp1 (line 14) | def test_turbomind_mllm_apiserver_tp1(config, run_config, worker_id):
  function test_turbomind_mllm_apiserver_tp2 (line 23) | def test_turbomind_mllm_apiserver_tp2(config, run_config, worker_id):
  function test_turbomind_mllm_apiserver_tp4 (line 32) | def test_turbomind_mllm_apiserver_tp4(config, run_config, worker_id):
  function test_turbomind_mllm_apiserver_tp8 (line 41) | def test_turbomind_mllm_apiserver_tp8(config, run_config, worker_id):
  function test_pytorch_mllm_apiserver_tp1 (line 50) | def test_pytorch_mllm_apiserver_tp1(config, run_config, worker_id):
  function test_pytorch_mllm_apiserver_tp2 (line 59) | def test_pytorch_mllm_apiserver_tp2(config, run_config, worker_id):
  function test_pytorch_mllm_apiserver_tp4 (line 68) | def test_pytorch_mllm_apiserver_tp4(config, run_config, worker_id):
  function test_pytorch_mllm_apiserver_tp8 (line 77) | def test_pytorch_mllm_apiserver_tp8(config, run_config, worker_id):
  function test_pytorch_mllm_apiserver_tp16 (line 86) | def test_pytorch_mllm_apiserver_tp16(config, run_config, worker_id):

FILE: autotest/benchmark/test_prefixcache_performance.py
  function get_models (line 6) | def get_models(backend, parallel_config):
  function test_turbomind_prefix_tp1 (line 14) | def test_turbomind_prefix_tp1(config, run_config, worker_id):
  function test_turbomind_prefix_tp2 (line 23) | def test_turbomind_prefix_tp2(config, run_config, worker_id):
  function test_turbomind_prefix_tp4 (line 32) | def test_turbomind_prefix_tp4(config, run_config, worker_id):
  function test_turbomind_prefix_tp8 (line 41) | def test_turbomind_prefix_tp8(config, run_config, worker_id):
  function test_pytorch_prefix_tp1 (line 50) | def test_pytorch_prefix_tp1(config, run_config, worker_id):
  function test_pytorch_prefix_tp2 (line 59) | def test_pytorch_prefix_tp2(config, run_config, worker_id):
  function test_pytorch_prefix_tp4 (line 68) | def test_pytorch_prefix_tp4(config, run_config, worker_id):
  function test_pytorch_prefix_tp8 (line 77) | def test_pytorch_prefix_tp8(config, run_config, worker_id):
  function test_pytorch_prefix_tp16 (line 86) | def test_pytorch_prefix_tp16(config, run_config, worker_id):
  function test_pytorch_prefix_pr_test_tp1 (line 113) | def test_pytorch_prefix_pr_test_tp1(config, run_config, worker_id):

FILE: autotest/benchmark/test_throughput_performance.py
  function get_models (line 6) | def get_models(backend, parallel_config):
  function test_turbomind_throughput_tp1 (line 16) | def test_turbomind_throughput_tp1(config, run_config, worker_id):
  function test_turbomind_throughput_tp2 (line 25) | def test_turbomind_throughput_tp2(config, run_config, worker_id):
  function test_turbomind_throughput_tp4 (line 34) | def test_turbomind_throughput_tp4(config, run_config, worker_id):
  function test_turbomind_throughput_tp8 (line 43) | def test_turbomind_throughput_tp8(config, run_config, worker_id):
  function test_pytorch_throughput_tp1 (line 52) | def test_pytorch_throughput_tp1(config, run_config, worker_id):
  function test_pytorch_throughput_tp2 (line 61) | def test_pytorch_throughput_tp2(config, run_config, worker_id):
  function test_pytorch_throughput_tp4 (line 70) | def test_pytorch_throughput_tp4(config, run_config, worker_id):
  function test_pytorch_throughput_tp8 (line 79) | def test_pytorch_throughput_tp8(config, run_config, worker_id):
  function test_pytorch_throughput_tp16 (line 88) | def test_pytorch_throughput_tp16(config, run_config, worker_id):
  function test_throughput_func_tp2 (line 114) | def test_throughput_func_tp2(config, run_config, worker_id):
  function test_throughput_prtest_tp1 (line 141) | def test_throughput_prtest_tp1(config, run_config, worker_id):

FILE: autotest/conftest.py
  function config (line 18) | def config():
  function cli_case_config (line 24) | def cli_case_config():
  function common_case_config (line 32) | def common_case_config():
  function shared_ray_manager (line 40) | def shared_ray_manager():
  function shared_proxy_manager (line 71) | def shared_proxy_manager():

FILE: autotest/evaluate/test_api_evaluate.py
  function _run_ray_distributed_test (line 13) | def _run_ray_distributed_test(
  function _run_proxy_distributed_test (line 59) | def _run_proxy_distributed_test(config,
  function run_eval_test (line 111) | def run_eval_test(config, run_config, worker_id, test_type='infer', eval...
  function get_models (line 192) | def get_models(backend, parallel_config):
  function test_turbomind_infer_tp1 (line 201) | def test_turbomind_infer_tp1(config, run_config, worker_id):
  function test_turbomind_infer_tp2 (line 210) | def test_turbomind_infer_tp2(config, run_config, worker_id):
  function test_turbomind_infer_tp4 (line 219) | def test_turbomind_infer_tp4(config, run_config, worker_id):
  function test_turbomind_infer_tp8 (line 228) | def test_turbomind_infer_tp8(config, run_config, worker_id):
  function test_turbomind_infer_cp2tp8 (line 237) | def test_turbomind_infer_cp2tp8(config, run_config, worker_id):
  function test_pytorch_restful_tp1 (line 247) | def test_pytorch_restful_tp1(config, run_config, worker_id):
  function test_pytorch_restful_tp2 (line 257) | def test_pytorch_restful_tp2(config, run_config, worker_id):
  function test_pytorch_restful_tp4 (line 267) | def test_pytorch_restful_tp4(config, run_config, worker_id):
  function test_pytorch_restful_tp8 (line 277) | def test_pytorch_restful_tp8(config, run_config, worker_id):
  function test_pytorch_restful_tp16 (line 287) | def test_pytorch_restful_tp16(config, run_config, worker_id):
  function test_pytorch_restful_distributed_tp16 (line 296) | def test_pytorch_restful_distributed_tp16(shared_ray_manager, config, ru...
  function test_pytorch_restful_distributed_dpep8 (line 309) | def test_pytorch_restful_distributed_dpep8(shared_proxy_manager, config,...
  function test_pytorch_restful_distributed_dpep16 (line 322) | def test_pytorch_restful_distributed_dpep16(shared_proxy_manager, config...
  function test_turbomind_eval_tp1 (line 335) | def test_turbomind_eval_tp1(config, run_config, worker_id):
  function test_turbomind_eval_tp2 (line 344) | def test_turbomind_eval_tp2(config, run_config, worker_id):
  function test_turbomind_eval_tp4 (line 353) | def test_turbomind_eval_tp4(config, run_config, worker_id):
  function test_turbomind_eval_tp8 (line 362) | def test_turbomind_eval_tp8(config, run_config, worker_id):
  function test_pytorch_eval_tp1 (line 372) | def test_pytorch_eval_tp1(config, run_config, worker_id):
  function test_pytorch_eval_tp2 (line 382) | def test_pytorch_eval_tp2(config, run_config, worker_id):
  function test_pytorch_eval_tp4 (line 392) | def test_pytorch_eval_tp4(config, run_config, worker_id):
  function test_pytorch_eval_tp8 (line 402) | def test_pytorch_eval_tp8(config, run_config, worker_id):
  function test_pytorch_eval_tp16 (line 412) | def test_pytorch_eval_tp16(config, run_config, worker_id):
  function test_pytorch_eval_distributed_tp16 (line 421) | def test_pytorch_eval_distributed_tp16(config, run_config, worker_id):
  function test_pytorch_eval_distributed_dpep8 (line 430) | def test_pytorch_eval_distributed_dpep8(config, run_config, worker_id):
  function test_pytorch_eval_distributed_dpep16 (line 439) | def test_pytorch_eval_distributed_dpep16(config, run_config, worker_id):
  function test_turbomind_eval_cp2tp8 (line 448) | def test_turbomind_eval_cp2tp8(config, run_config, worker_id):

FILE: autotest/evaluate/test_mllm_api_evaluate.py
  function run_eval_test (line 10) | def run_eval_test(config, run_config, worker_id, test_type='infer', eval...
  function get_models (line 69) | def get_models(backend, parallel_config):
  function test_turbomind_vl_eval_tp1 (line 85) | def test_turbomind_vl_eval_tp1(config, run_config, worker_id):
  function test_turbomind_vl_eval_tp2 (line 94) | def test_turbomind_vl_eval_tp2(config, run_config, worker_id):
  function test_turbomind_vl_eval_tp4 (line 103) | def test_turbomind_vl_eval_tp4(config, run_config, worker_id):
  function test_turbomind_vl_eval_tp8 (line 112) | def test_turbomind_vl_eval_tp8(config, run_config, worker_id):
  function test_pytorch_vl_eval_tp1 (line 122) | def test_pytorch_vl_eval_tp1(config, run_config, worker_id):
  function test_pytorch_vl_eval_tp2 (line 132) | def test_pytorch_vl_eval_tp2(config, run_config, worker_id):
  function test_pytorch_vl_eval_tp4 (line 142) | def test_pytorch_vl_eval_tp4(config, run_config, worker_id):
  function test_pytorch_vl_eval_tp8 (line 152) | def test_pytorch_vl_eval_tp8(config, run_config, worker_id):
  function test_pytorch_vl_eval_tp16 (line 162) | def test_pytorch_vl_eval_tp16(config, run_config, worker_id):
  function test_turbomind_eval_tp1 (line 171) | def test_turbomind_eval_tp1(config, run_config, worker_id):
  function test_turbomind_eval_tp2 (line 180) | def test_turbomind_eval_tp2(config, run_config, worker_id):
  function test_turbomind_eval_tp4 (line 189) | def test_turbomind_eval_tp4(config, run_config, worker_id):
  function test_turbomind_eval_tp8 (line 198) | def test_turbomind_eval_tp8(config, run_config, worker_id):
  function test_pytorch_eval_tp1 (line 208) | def test_pytorch_eval_tp1(config, run_config, worker_id):
  function test_pytorch_eval_tp2 (line 218) | def test_pytorch_eval_tp2(config, run_config, worker_id):
  function test_pytorch_eval_tp4 (line 228) | def test_pytorch_eval_tp4(config, run_config, worker_id):
  function test_pytorch_eval_tp8 (line 238) | def test_pytorch_eval_tp8(config, run_config, worker_id):
  function test_pytorch_eval_tp16 (line 248) | def test_pytorch_eval_tp16(config, run_config, worker_id):

FILE: autotest/interface/pipeline/test_pipeline_func.py
  function init_pipeline (line 15) | def init_pipeline(model_path, backend_config):
  function run_case_in_spawn (line 21) | def run_case_in_spawn(worker_id, target, args):
  function run_pipeline_testcase_prompt (line 33) | def run_pipeline_testcase_prompt(config, model, backend, file_name):
  function run_pipeline_testcase_prompt_stream (line 43) | def run_pipeline_testcase_prompt_stream(config, model, backend, file_name):
  function run_pipeline_testcase_multi_prompt (line 55) | def run_pipeline_testcase_multi_prompt(config, model, backend, file_name):
  function run_pipeline_testcase_multi_prompt_stream (line 65) | def run_pipeline_testcase_multi_prompt_stream(config, model, backend, fi...
  function run_pipeline_testcase_message (line 77) | def run_pipeline_testcase_message(config, model, backend, file_name):
  function run_pipeline_testcase_message_stream (line 88) | def run_pipeline_testcase_message_stream(config, model, backend, file_na...
  function run_pipeline_testcase_message_batch (line 101) | def run_pipeline_testcase_message_batch(config, model, backend, file_name):
  function run_pipeline_testcase_message_batch_stream (line 112) | def run_pipeline_testcase_message_batch_stream(config, model, backend, f...
  function run_pipeline_testcase_logprobs (line 125) | def run_pipeline_testcase_logprobs(config, model, backend, file_name):
  function run_pipeline_testcase_logprobs_stream (line 136) | def run_pipeline_testcase_logprobs_stream(config, model, backend, file_n...
  function run_pipeline_testcase_session_len (line 149) | def run_pipeline_testcase_session_len(config, model, backend, file_name):
  function run_pipeline_testcase_min_new_tokens (line 163) | def run_pipeline_testcase_min_new_tokens(config, model, backend, file_na...
  function run_pipeline_testcase_stop_words (line 177) | def run_pipeline_testcase_stop_words(config, model, backend, file_name):
  function run_pipeline_testcase_bad_words (line 192) | def run_pipeline_testcase_bad_words(config, model, backend, file_name):
  function run_pipeline_testcase_special_words_false (line 205) | def run_pipeline_testcase_special_words_false(config, model, backend, fi...
  function run_pipeline_testcase_special_words_true (line 225) | def run_pipeline_testcase_special_words_true(config, model, backend, fil...
  function run_pipeline_testcase_repetition_penalty (line 245) | def run_pipeline_testcase_repetition_penalty(config, model, backend, fil...
  function run_pipeline_testcase_repetition_penalty_bigger (line 256) | def run_pipeline_testcase_repetition_penalty_bigger(config, model, backe...
  function run_pipeline_testcase_min_top_p (line 267) | def run_pipeline_testcase_min_top_p(config, model, backend, file_name):
  function run_pipeline_testcase_min_top_k (line 278) | def run_pipeline_testcase_min_top_k(config, model, backend, file_name):
  function run_pipeline_testcase_diff_random_seed (line 291) | def run_pipeline_testcase_diff_random_seed(config, model, backend, file_...
  function run_pipeline_testcase_same_random_seed (line 304) | def run_pipeline_testcase_same_random_seed(config, model, backend, file_...
  function run_pipeline_testcase_do_sample_batch (line 317) | def run_pipeline_testcase_do_sample_batch(config, model, backend, file_n...
  function run_pipeline_testcase_max_new_tokens (line 328) | def run_pipeline_testcase_max_new_tokens(config, model, backend, file_na...
  function run_pipeline_testcase_ignore_eos (line 342) | def run_pipeline_testcase_ignore_eos(config, model, backend, file_name):
  function test_return_with_prompt (line 358) | def test_return_with_prompt(config, model, backend, worker_id):
  function test_return_with_prompt_stream (line 367) | def test_return_with_prompt_stream(config, model, backend, worker_id):
  function test_return_with_multi_prompt (line 376) | def test_return_with_multi_prompt(config, model, backend, worker_id):
  function test_return_with_multi_prompt_stream (line 385) | def test_return_with_multi_prompt_stream(config, model, backend, worker_...
  function test_return_with_message (line 394) | def test_return_with_message(config, model, backend, worker_id):
  function test_return_with_message_stream (line 402) | def test_return_with_message_stream(config, model, backend, worker_id):
  function test_return_with_message_batch (line 410) | def test_return_with_message_batch(config, model, backend, worker_id):
  function test_return_with_message_batch_stream (line 418) | def test_return_with_message_batch_stream(config, model, backend, worker...
  function test_return_check_logprobs (line 426) | def test_return_check_logprobs(config, model, backend, worker_id):
  function test_return_check_logprobs_stream (line 434) | def test_return_check_logprobs_stream(config, model, backend, worker_id):
  function test_backend_config_session_len (line 442) | def test_backend_config_session_len(config, model, backend, worker_id):
  function test_gen_config_min_new_tokens (line 450) | def test_gen_config_min_new_tokens(config, model, backend, worker_id):
  function test_gen_config_stop_words (line 458) | def test_gen_config_stop_words(config, model, backend, worker_id):
  function test_gen_config_bad_words (line 466) | def test_gen_config_bad_words(config, model, backend, worker_id):
  function test_gen_config_special_words_false (line 474) | def test_gen_config_special_words_false(config, model, backend, worker_id):
  function test_gen_config_special_words_true (line 482) | def test_gen_config_special_words_true(config, model, backend, worker_id):
  function test_gen_config_minimum_repetition_penalty (line 490) | def test_gen_config_minimum_repetition_penalty(config, model, backend, w...
  function test_gen_config_repetition_penalty_bigger_than_1 (line 498) | def test_gen_config_repetition_penalty_bigger_than_1(config, model, back...
  function test_gen_config_minimun_topp (line 506) | def test_gen_config_minimun_topp(config, model, backend, worker_id):
  function test_gen_config_minimun_topk (line 514) | def test_gen_config_minimun_topk(config, model, backend, worker_id):
  function test_gen_config_diff_random_seed (line 522) | def test_gen_config_diff_random_seed(config, model, backend, worker_id):
  function test_gen_config_same_random_seed (line 530) | def test_gen_config_same_random_seed(config, model, backend, worker_id):
  function test_gen_config_do_sample_batch (line 538) | def test_gen_config_do_sample_batch(config, model, backend, worker_id):
  function test_gen_config_max_new_tokens (line 546) | def test_gen_config_max_new_tokens(config, model, backend, worker_id):
  function test_gen_config_ignore_eos (line 554) | def test_gen_config_ignore_eos(config, model, backend, worker_id):
  function test_backend_config_input_validation (line 562) | def test_backend_config_input_validation(config, model, backend, worker_...
  function test_backend_config_validate_turbomind (line 599) | def test_backend_config_validate_turbomind(config, model, backend, worke...
  function test_backend_config_validate_pytorch (line 637) | def test_backend_config_validate_pytorch(config, model, backend, worker_...
  function test_backend_config_tp (line 667) | def test_backend_config_tp(config, model, backend, worker_id):

FILE: autotest/interface/pipeline/test_pipeline_longtext_func.py
  function run_case_in_spawn (line 24) | def run_case_in_spawn(target, args):
  function test_history_issue_tp1 (line 33) | def test_history_issue_tp1(config, model, worker_id):
  function test_history_issue_tp2 (line 43) | def test_history_issue_tp2(config, model, worker_id):
  function stream_infer_worker (line 52) | def stream_infer_worker(config, model, tp_num):
  function test_long_test_passkey_tp1 (line 77) | def test_long_test_passkey_tp1(config, model, backend, worker_id):
  function test_long_test_passkey_tp2 (line 90) | def test_long_test_passkey_tp2(config, model, backend, worker_id):
  function test_long_test_passkey_tp8 (line 104) | def test_long_test_passkey_tp8(config, model, backend, worker_id):
  function passkey_retrival_worker (line 125) | def passkey_retrival_worker(config, model, backend, log_name, tp_num, se...
  function get_passkey_prompt (line 177) | def get_passkey_prompt(pipe, session_len):

FILE: autotest/interface/restful/test_restful_chat_completions_v1.py
  class TestRestfulInterfaceBase (line 22) | class TestRestfulInterfaceBase:
    method test_get_model (line 25) | def test_get_model(self, config, backend, model_case):
    method test_encode_s1 (line 34) | def test_encode_s1(self, backend, model_case):
    method test_encode (line 54) | def test_encode(self, backend, model_case):
  class TestRestfulInterfaceChatCompletions (line 78) | class TestRestfulInterfaceChatCompletions:
    method test_return_info_with_prompt (line 80) | def test_return_info_with_prompt(self, backend, model_case):
    method test_return_info_with_messegae (line 94) | def test_return_info_with_messegae(self, backend, model_case):
    method test_return_info_with_prompt_streaming (line 106) | def test_return_info_with_prompt_streaming(self, backend, model_case):
    method test_return_info_with_messegae_streaming (line 125) | def test_return_info_with_messegae_streaming(self, backend, model_case):
    method test_single_stopword (line 142) | def test_single_stopword(self, backend, model_case):
    method test_single_stopword_streaming (line 159) | def test_single_stopword_streaming(self, backend, model_case):
    method test_array_stopwords (line 181) | def test_array_stopwords(self, backend, model_case):
    method test_array_stopwords_streaming (line 200) | def test_array_stopwords_streaming(self, backend, model_case):
    method test_special_words (line 225) | def test_special_words(self, backend, model_case):
    method test_minimum_repetition_penalty (line 253) | def test_minimum_repetition_penalty(self, backend, model_case):
    method test_minimum_repetition_penalty_streaming (line 272) | def test_minimum_repetition_penalty_streaming(self, backend, model_case):
    method test_repetition_penalty_bigger_than_1 (line 297) | def test_repetition_penalty_bigger_than_1(self, backend, model_case):
    method test_repetition_penalty_bigger_than_1_streaming (line 313) | def test_repetition_penalty_bigger_than_1_streaming(self, backend, mod...
    method test_minimum_topp (line 334) | def test_minimum_topp(self, backend, model_case):
    method test_minimum_topp_streaming (line 355) | def test_minimum_topp_streaming(self, backend, model_case):
    method test_mistake_modelname_return (line 381) | def test_mistake_modelname_return(self, backend, model_case):
    method test_mistake_modelname_return_streaming (line 396) | def test_mistake_modelname_return_streaming(self, backend, model_case):
    method test_mutilple_times_response_should_not_same (line 415) | def test_mutilple_times_response_should_not_same(self, backend, model_...
    method test_mutilple_times_response_should_not_same_streaming (line 434) | def test_mutilple_times_response_should_not_same_streaming(self, backe...
    method test_longtext_input (line 458) | def test_longtext_input(self, backend, model_case):
    method test_longtext_input_streaming (line 473) | def test_longtext_input_streaming(self, backend, model_case):
    method test_ignore_eos (line 492) | def test_ignore_eos(self, backend, model_case):
    method test_ignore_eos_streaming (line 511) | def test_ignore_eos_streaming(self, backend, model_case):
    method __test_max_tokens_or_max_completion_tokens (line 536) | def __test_max_tokens_or_max_completion_tokens(
    method test_max_tokens (line 572) | def test_max_tokens(self, backend, model_case):
    method test_max_completion_tokens (line 575) | def test_max_completion_tokens(self, backend, model_case):
    method __test_max_tokens_streaming_or_max_completion_tokens_streaming (line 578) | def __test_max_tokens_streaming_or_max_completion_tokens_streaming(
    method test_max_tokens_streaming (line 622) | def test_max_tokens_streaming(self, backend, model_case):
    method test_max_completion_tokens_streaming (line 625) | def test_max_completion_tokens_streaming(self, backend, model_case):
    method test_logprobs (line 629) | def test_logprobs(self, backend, model_case):
    method test_logprobs_streaming (line 649) | def test_logprobs_streaming(self, backend, model_case):
  class TestRestfulOpenAI (line 680) | class TestRestfulOpenAI:
    method test_return_info (line 683) | def test_return_info(self, backend, model_case):
    method test_return_info_streaming (line 699) | def test_return_info_streaming(self, backend, model_case):
    method test_single_stopword (line 720) | def test_single_stopword(self, backend, model_case):
    method test_single_stopword_streaming (line 739) | def test_single_stopword_streaming(self, backend, model_case):
    method test_array_stopwords (line 763) | def test_array_stopwords(self, backend, model_case):
    method test_array_stopwords_streaming (line 785) | def test_array_stopwords_streaming(self, backend, model_case):
    method test_minimum_topp (line 812) | def test_minimum_topp(self, backend, model_case):
    method test_minimum_topp_streaming (line 835) | def test_minimum_topp_streaming(self, backend, model_case):
    method test_mistake_modelname_return (line 863) | def test_mistake_modelname_return(self, backend, model_case):
    method test_mistake_modelname_return_streaming (line 878) | def test_mistake_modelname_return_streaming(self, backend, model_case):
    method test_mutilple_times_response_should_not_same (line 894) | def test_mutilple_times_response_should_not_same(self, backend, model_...
    method test_mutilple_times_response_should_not_same_streaming (line 914) | def test_mutilple_times_response_should_not_same_streaming(self, backe...
    method test_longtext_input (line 940) | def test_longtext_input(self, backend, model_case):
    method test_longtext_input_streaming (line 958) | def test_longtext_input_streaming(self, backend, model_case):
    method test_max_tokens (line 983) | def test_max_tokens(self, backend, model_case):
    method test_max_tokens_streaming (line 1000) | def test_max_tokens_streaming(self, backend, model_case):
    method test_logprobs (line 1031) | def test_logprobs(self, backend, model_case):
    method test_logprobs_streaming (line 1052) | def test_logprobs_streaming(self, backend, model_case):
    method test_input_validation (line 1083) | def test_input_validation(self, backend, model_case):
    method test_input_validation_streaming (line 1116) | def test_input_validation_streaming(self, backend, model_case):
    method test_disable_think (line 1150) | def test_disable_think(self, backend, model_case):
    method test_disable_think_with_image (line 1183) | def test_disable_think_with_image(self, backend, model_case):

FILE: autotest/interface/restful/test_restful_completions_v1.py
  class TestRestfulInterfaceBase (line 15) | class TestRestfulInterfaceBase:
    method test_get_model (line 18) | def test_get_model(self, config, backend, model_case):
    method test_encode (line 24) | def test_encode(self, backend, model_case):
    method test_return (line 42) | def test_return(self, backend, model_case):
    method test_return_streaming (line 58) | def test_return_streaming(self, backend, model_case):
    method test_max_tokens (line 72) | def test_max_tokens(self, backend, model_case):
    method test_single_stopword (line 85) | def test_single_stopword(self, backend, model_case):
    method test_array_stopwords (line 96) | def test_array_stopwords(self, backend, model_case):
    method test_completions_stream (line 109) | def test_completions_stream(self, backend, model_case):
    method test_completions_stream_stopword (line 127) | def test_completions_stream_stopword(self, backend, model_case):
    method test_completions_stream_stopwords (line 151) | def test_completions_stream_stopwords(self, backend, model_case):
    method test_batch_prompt_order (line 177) | def test_batch_prompt_order(self, backend, model_case):

FILE: autotest/interface/restful/test_restful_generate.py
  class TestGenerateComprehensive (line 22) | class TestGenerateComprehensive:
    method setup_api (line 25) | def setup_api(self, request, config, model_name, backend):
    method _log_request_response (line 38) | def _log_request_response(self, payload, response_data, stream_raw=None):
    method _post (line 55) | def _post(self, payload, stream=False):
    method _validate_generation_response (line 117) | def _validate_generation_response(self,
    method test_basic_generation (line 235) | def test_basic_generation(self):
    method test_input_ids_mode (line 294) | def test_input_ids_mode(self, config):
    method test_conflict_prompt_and_input_ids (line 349) | def test_conflict_prompt_and_input_ids(self):
    method test_input_ids_with_logprob (line 437) | def test_input_ids_with_logprob(self, config):
    method test_stop_str_with_include_flag (line 497) | def test_stop_str_with_include_flag(self):
    method test_streaming_mode (line 542) | def test_streaming_mode(self):
    method test_streaming_incremental_correctness (line 572) | def test_streaming_incremental_correctness(self):
    method test_return_logprob (line 625) | def test_return_logprob(self):
    method test_same_session_id_allowed (line 635) | def test_same_session_id_allowed(self):
    method test_empty_prompt_rejected (line 658) | def test_empty_prompt_rejected(self):
    method test_input_ids_rejected (line 673) | def test_input_ids_rejected(self):
    method test_stress_concurrent_requests (line 706) | def test_stress_concurrent_requests(self):
    method test_stress_long_prompt_and_generation (line 761) | def test_stress_long_prompt_and_generation(self):
    method test_stress_streaming_under_load (line 771) | def test_stress_streaming_under_load(self):
    method test_temperature_parameter (line 824) | def test_temperature_parameter(self):
    method test_top_p_parameter (line 844) | def test_top_p_parameter(self):
    method test_top_k_parameter (line 857) | def test_top_k_parameter(self):
    method test_min_p_parameter (line 870) | def test_min_p_parameter(self):
    method test_repetition_penalty (line 878) | def test_repetition_penalty(self):
    method test_ignore_eos_parameter (line 900) | def test_ignore_eos_parameter(self):
    method test_skip_special_tokens (line 917) | def test_skip_special_tokens(self, config):
    method test_stop_token_ids (line 941) | def test_stop_token_ids(self):
    method test_combined_parameters (line 968) | def test_combined_parameters(self):
    method test_streaming_with_all_parameters (line 984) | def test_streaming_with_all_parameters(self):
    method test_invalid_temperature_values (line 1008) | def test_invalid_temperature_values(self):
    method test_invalid_top_p_values (line 1019) | def test_invalid_top_p_values(self):
    method test_invalid_top_k_values (line 1027) | def test_invalid_top_k_values(self):
    method test_boundary_max_tokens (line 1035) | def test_boundary_max_tokens(self):
    method test_parameter_interactions (line 1057) | def test_parameter_interactions(self):
    method test_session_id_with_all_parameters (line 1074) | def test_session_id_with_all_parameters(self):
    method test_edge_cases_stop_conditions (line 1105) | def test_edge_cases_stop_conditions(self):
    method test_spaces_between_special_tokens (line 1134) | def test_spaces_between_special_tokens(self, config):
    method test_request_returns_experts (line 1160) | def test_request_returns_experts(self):

FILE: autotest/toolchain/test_lagent.py
  function test_repeat (line 8) | def test_repeat(config, model):

FILE: autotest/tools/chat/test_command_chat_hf_pytorch.py
  function test_hf_pytorch_chat_tp1 (line 15) | def test_hf_pytorch_chat_tp1(config, run_config, cli_case_config, worker...
  function test_hf_pytorch_chat_tp2 (line 23) | def test_hf_pytorch_chat_tp2(config, run_config, cli_case_config, worker...
  function test_hf_pytorch_chat_tp4 (line 31) | def test_hf_pytorch_chat_tp4(config, run_config, cli_case_config, worker...
  function test_hf_pytorch_chat_tp8 (line 39) | def test_hf_pytorch_chat_tp8(config, run_config, cli_case_config, worker...
  function test_hf_pytorch_chat_tp16 (line 47) | def test_hf_pytorch_chat_tp16(config, run_config, cli_case_config, worke...
  function test_hf_pytorch_base_tp1 (line 55) | def test_hf_pytorch_base_tp1(config, run_config, cli_case_config, worker...
  function test_hf_pytorch_base_tp2 (line 63) | def test_hf_pytorch_base_tp2(config, run_config, cli_case_config, worker...
  function test_hf_pytorch_chat_pr_tp2 (line 71) | def test_hf_pytorch_chat_pr_tp2(config, run_config, cli_case_config, wor...
  function test_hf_pytorch_chat_pr_tp1 (line 80) | def test_hf_pytorch_chat_pr_tp1(config, run_config, cli_case_config, wor...
  function test_modelscope_pytorch_chat_tp1 (line 88) | def test_modelscope_pytorch_chat_tp1(config, run_config, cli_case_config...
  function test_pytorch_chat_with_lora_tp1 (line 99) | def test_pytorch_chat_with_lora_tp1(config, run_config, cli_case_config,...
  function test_pytorch_chat_with_lora_tp2 (line 109) | def test_pytorch_chat_with_lora_tp2(config, run_config, cli_case_config,...

FILE: autotest/tools/chat/test_command_chat_hf_turbomind.py
  function test_hf_turbomind_chat_tp1 (line 15) | def test_hf_turbomind_chat_tp1(config, run_config, cli_case_config, work...
  function test_hf_turbomind_chat_tp2 (line 22) | def test_hf_turbomind_chat_tp2(config, run_config, cli_case_config, work...
  function test_hf_turbomind_chat_tp4 (line 29) | def test_hf_turbomind_chat_tp4(config, run_config, cli_case_config, work...
  function test_hf_turbomind_chat_tp8 (line 36) | def test_hf_turbomind_chat_tp8(config, run_config, cli_case_config, work...
  function test_hf_turbomind_chat_fallback_backend_tp1 (line 43) | def test_hf_turbomind_chat_fallback_backend_tp1(config, run_config, cli_...
  function test_hf_turbomind_chat_fallback_backend_tp2 (line 50) | def test_hf_turbomind_chat_fallback_backend_tp2(config, run_config, cli_...
  function test_hf_turbomind_base_tp1 (line 57) | def test_hf_turbomind_base_tp1(config, run_config, cli_case_config, work...
  function test_hf_turbomind_base_tp2 (line 64) | def test_hf_turbomind_base_tp2(config, run_config, cli_case_config, work...
  function test_hf_turbomind_chat_pr_tp2 (line 72) | def test_hf_turbomind_chat_pr_tp2(config, run_config, cli_case_config, w...
  function test_hf_turbomind_chat_pr_tp1 (line 81) | def test_hf_turbomind_chat_pr_tp1(config, run_config, cli_case_config, w...
  function test_modelscope_turbomind_chat_tp1 (line 89) | def test_modelscope_turbomind_chat_tp1(config, run_config, cli_case_conf...

FILE: autotest/tools/pipeline/llm_case.py
  function run_pipeline_chat_test (line 13) | def run_pipeline_chat_test(model_path, run_config, cases_path, is_pr_tes...

FILE: autotest/tools/pipeline/mllm_case.py
  function run_pipeline_mllm_test (line 23) | def run_pipeline_mllm_test(model_path, run_config, resource_path, is_pr_...
  function internvl_vl_testcase (line 125) | def internvl_vl_testcase(pipe, resource_path, lang='en'):
  function MiniCPM_vl_testcase (line 245) | def MiniCPM_vl_testcase(pipe, resource_path):
  function Qwen_vl_testcase (line 343) | def Qwen_vl_testcase(pipe, resource_path):

FILE: autotest/tools/pipeline/test_pipeline_chat_pytorch_llm.py
  function test_pipeline_chat_tp1 (line 16) | def test_pipeline_chat_tp1(config, run_config, common_case_config, worke...
  function test_pipeline_chat_tp2 (line 24) | def test_pipeline_chat_tp2(config, run_config, common_case_config, worke...
  function test_pipeline_chat_tp4 (line 32) | def test_pipeline_chat_tp4(config, run_config, common_case_config, worke...
  function test_pipeline_chat_tp8 (line 40) | def test_pipeline_chat_tp8(config, run_config, common_case_config, worke...
  function test_pipeline_chat_tp16 (line 48) | def test_pipeline_chat_tp16(config, run_config, common_case_config, work...
  function test_pipeline_chat_pytorch_prefix_cache_tp2 (line 56) | def test_pipeline_chat_pytorch_prefix_cache_tp2(config, run_config, comm...
  function test_hf_pytorch_chat_pr_tp2 (line 64) | def test_hf_pytorch_chat_pr_tp2(config, run_config, common_case_config, ...
  function test_hf_pytorch_chat_pr_tp1 (line 73) | def test_hf_pytorch_chat_pr_tp1(config, run_config, common_case_config, ...
  function test_modelscope_pipeline_chat_tp1 (line 81) | def test_modelscope_pipeline_chat_tp1(config, run_config, common_case_co...
  function test_pytorch_chat_with_lora_tp1 (line 89) | def test_pytorch_chat_with_lora_tp1(config, run_config, common_case_conf...
  function test_pytorch_chat_with_lora_tp2 (line 96) | def test_pytorch_chat_with_lora_tp2(config, run_config, common_case_conf...
  function test_pipeline_chat_speculative_decoding_tp1 (line 105) | def test_pipeline_chat_speculative_decoding_tp1(config, run_config, comm...

FILE: autotest/tools/pipeline/test_pipeline_chat_pytorch_mllm.py
  function get_models (line 8) | def get_models(parallel_config):
  function test_restful_chat_tp1 (line 15) | def test_restful_chat_tp1(config, run_config, worker_id):
  function test_restful_chat_tp2 (line 21) | def test_restful_chat_tp2(config, run_config, worker_id):
  function test_restful_chat_tp4 (line 27) | def test_restful_chat_tp4(config, run_config, worker_id):
  function test_restful_chat_tp8 (line 33) | def test_restful_chat_tp8(config, run_config, worker_id):
  function test_restful_chat_tp16 (line 39) | def test_restful_chat_tp16(config, run_config, worker_id):

FILE: autotest/tools/pipeline/test_pipeline_chat_turbomind_llm.py
  function test_pipeline_chat_tp1 (line 15) | def test_pipeline_chat_tp1(config, run_config, common_case_config, worke...
  function test_pipeline_chat_tp2 (line 22) | def test_pipeline_chat_tp2(config, run_config, common_case_config, worke...
  function test_pipeline_chat_tp4 (line 29) | def test_pipeline_chat_tp4(config, run_config, common_case_config, worke...
  function test_pipeline_chat_tp8 (line 36) | def test_pipeline_chat_tp8(config, run_config, common_case_config, worke...
  function test_pipeline_chat_prefix_cache_tp2 (line 43) | def test_pipeline_chat_prefix_cache_tp2(config, run_config, common_case_...
  function test_pipeline_chat_fallback_backend_tp1 (line 50) | def test_pipeline_chat_fallback_backend_tp1(config, run_config, common_c...
  function test_pipeline_chat_fallback_backend_tp2 (line 58) | def test_pipeline_chat_fallback_backend_tp2(config, run_config, common_c...
  function test_pipeline_chat_pr_tp2 (line 68) | def test_pipeline_chat_pr_tp2(config, run_config, common_case_config, wo...
  function test_pipeline_chat_pr_tp1 (line 79) | def test_pipeline_chat_pr_tp1(config, run_config, common_case_config, wo...
  function test_modelscope_restful_chat_tp1 (line 88) | def test_modelscope_restful_chat_tp1(config, run_config, common_case_con...

FILE: autotest/tools/pipeline/test_pipeline_chat_turbomind_mllm.py
  function get_models (line 10) | def get_models(parallel_config):
  function test_restful_chat_tp1 (line 17) | def test_restful_chat_tp1(config, run_config, worker_id):
  function test_restful_chat_tp2 (line 23) | def test_restful_chat_tp2(config, run_config, worker_id):
  function test_restful_chat_tp4 (line 29) | def test_restful_chat_tp4(config, run_config, worker_id):
  function test_restful_chat_tp8 (line 35) | def test_restful_chat_tp8(config, run_config, worker_id):
  function test_restful_chat_tp16 (line 41) | def test_restful_chat_tp16(config, run_config, worker_id):
  function test_restful_chat_fallback_backend_tp1 (line 48) | def test_restful_chat_fallback_backend_tp1(config, run_config, worker_id):
  function test_pipeline_pr_test (line 56) | def test_pipeline_pr_test(config, run_config, worker_id):
  function test_pipeline_pr_tp2_test (line 65) | def test_pipeline_pr_tp2_test(config, run_config, worker_id):

FILE: autotest/tools/quantization/test_quantization_awq.py
  function test_quantization_awq (line 13) | def test_quantization_awq(config, model, worker_id):
  function test_quantization_gptq (line 22) | def test_quantization_gptq(config, model, worker_id):
  function test_quantization_awq_pr (line 34) | def test_quantization_awq_pr(config, model):
  function quantization_all (line 39) | def quantization_all(config, quantization_model_name, origin_model_name,...

FILE: autotest/tools/quantization/test_quantization_w8a8.py
  function test_quantization_w8a8 (line 13) | def test_quantization_w8a8(config, model, worker_id):
  function quantization_w8a8 (line 17) | def quantization_w8a8(config, quantization_model_name, origin_model_name...

FILE: autotest/tools/restful/test_restful_chat_hf_pytorch_llm.py
  function _run_ray_distributed_test (line 16) | def _run_ray_distributed_test(
  function _run_proxy_distributed_test (line 41) | def _run_proxy_distributed_test(
  function test_restful_chat_tp1 (line 74) | def test_restful_chat_tp1(config, run_config, common_case_config, worker...
  function test_restful_chat_tp2 (line 82) | def test_restful_chat_tp2(config, run_config, common_case_config, worker...
  function test_restful_chat_tp4 (line 90) | def test_restful_chat_tp4(config, run_config, common_case_config, worker...
  function test_restful_chat_tp8 (line 98) | def test_restful_chat_tp8(config, run_config, common_case_config, worker...
  function test_restful_chat_tp16 (line 106) | def test_restful_chat_tp16(config, run_config, common_case_config, worke...
  function test_restful_chat_distributed_tp16 (line 115) | def test_restful_chat_distributed_tp16(shared_ray_manager, config, run_c...
  function test_restful_chat_distributed_dpep16 (line 127) | def test_restful_chat_distributed_dpep16(shared_proxy_manager, config, r...
  function test_restful_chat_pytorch_prefix_cache_tp2 (line 138) | def test_restful_chat_pytorch_prefix_cache_tp2(config, run_config, commo...
  function test_hf_pytorch_chat_pr_tp2 (line 146) | def test_hf_pytorch_chat_pr_tp2(config, run_config, common_case_config, ...
  function test_hf_pytorch_chat_pr_tp1 (line 155) | def test_hf_pytorch_chat_pr_tp1(config, run_config, common_case_config, ...
  function test_modelscope_restful_chat_tp1 (line 163) | def test_modelscope_restful_chat_tp1(config, run_config, common_case_con...
  function test_pytorch_chat_with_lora_tp1 (line 171) | def test_pytorch_chat_with_lora_tp1(config, run_config, common_case_conf...
  function test_pytorch_chat_with_lora_tp2 (line 178) | def test_pytorch_chat_with_lora_tp2(config, run_config, common_case_conf...
  function test_restful_chat_reasoning_tp1 (line 188) | def test_restful_chat_reasoning_tp1(config, run_config, worker_id):
  function test_restful_chat_reasoning_tp2 (line 198) | def test_restful_chat_reasoning_tp2(config, run_config, worker_id):
  function test_restful_chat_tools_tp1 (line 208) | def test_restful_chat_tools_tp1(config, run_config, worker_id):
  function test_restful_chat_tools_tp2 (line 218) | def test_restful_chat_tools_tp2(config, run_config, worker_id):
  function test_restful_chat_tools_tp4 (line 228) | def test_restful_chat_tools_tp4(config, run_config, worker_id):
  function test_restful_chat_speculative_decoding_tp1 (line 237) | def test_restful_chat_speculative_decoding_tp1(config, run_config, commo...
  function test_restful_chat_speculative_decoding_tp16 (line 247) | def test_restful_chat_speculative_decoding_tp16(shared_ray_manager, conf...

FILE: autotest/tools/restful/test_restful_chat_hf_pytorch_mllm.py
  function test_restful_chat_tp1 (line 11) | def test_restful_chat_tp1(config, run_config, worker_id):
  function test_restful_chat_tp2 (line 17) | def test_restful_chat_tp2(config, run_config, worker_id):
  function test_restful_chat_tp4 (line 23) | def test_restful_chat_tp4(config, run_config, worker_id):
  function test_restful_chat_tp8 (line 29) | def test_restful_chat_tp8(config, run_config, worker_id):
  function test_restful_chat_tp16 (line 35) | def test_restful_chat_tp16(config, run_config, worker_id):

FILE: autotest/tools/restful/test_restful_chat_hf_turbomind_llm.py
  function test_restful_chat_tp1 (line 16) | def test_restful_chat_tp1(config, run_config, common_case_config, worker...
  function test_restful_chat_tp2 (line 23) | def test_restful_chat_tp2(config, run_config, common_case_config, worker...
  function test_restful_chat_tp4 (line 30) | def test_restful_chat_tp4(config, run_config, common_case_config, worker...
  function test_restful_chat_tp8 (line 37) | def test_restful_chat_tp8(config, run_config, common_case_config, worker...
  function test_restful_chat_prefix_cache_tp2 (line 44) | def test_restful_chat_prefix_cache_tp2(config, run_config, common_case_c...
  function test_restful_chat_fallback_backend_tp1 (line 51) | def test_restful_chat_fallback_backend_tp1(config, run_config, common_ca...
  function test_restful_chat_fallback_backend_tp2 (line 59) | def test_restful_chat_fallback_backend_tp2(config, run_config, common_ca...
  function test_restful_chat_pr_tp2 (line 69) | def test_restful_chat_pr_tp2(config, run_config, common_case_config, wor...
  function test_restful_chat_pr_tp1 (line 80) | def test_restful_chat_pr_tp1(config, run_config, common_case_config, wor...
  function test_restful_logprobs (line 90) | def test_restful_logprobs(config, run_config, worker_id):
  function test_modelscope_restful_chat_tp1 (line 98) | def test_modelscope_restful_chat_tp1(config, run_config, common_case_con...
  function test_restful_chat_reasoning_tp1 (line 109) | def test_restful_chat_reasoning_tp1(config, run_config, worker_id):
  function test_restful_chat_reasoning_tp2 (line 119) | def test_restful_chat_reasoning_tp2(config, run_config, worker_id):
  function test_restful_chat_tools_tp1 (line 129) | def test_restful_chat_tools_tp1(config, run_config, worker_id):
  function test_restful_chat_tools_tp2 (line 139) | def test_restful_chat_tools_tp2(config, run_config, worker_id):
  function test_restful_chat_tools_tp4 (line 149) | def test_restful_chat_tools_tp4(config, run_config, worker_id):

FILE: autotest/tools/restful/test_restful_chat_hf_turbomind_mllm.py
  function test_restful_chat_tp1 (line 12) | def test_restful_chat_tp1(config, run_config, worker_id):
  function test_restful_chat_tp2 (line 18) | def test_restful_chat_tp2(config, run_config, worker_id):
  function test_restful_chat_tp4 (line 24) | def test_restful_chat_tp4(config, run_config, worker_id):
  function test_restful_chat_tp8 (line 30) | def test_restful_chat_tp8(config, run_config, worker_id):
  function test_restful_chat_tp16 (line 36) | def test_restful_chat_tp16(config, run_config, worker_id):
  function test_restful_chat_fallback_backend_tp1 (line 43) | def test_restful_chat_fallback_backend_tp1(config, run_config, worker_id):

FILE: autotest/utils/benchmark_utils.py
  function throughput_test (line 11) | def throughput_test(config, run_config, worker_id: str = '', is_smoke: b...
  function longtext_throughput_test (line 56) | def longtext_throughput_test(config, run_config, worker_id: str = ''):
  function restful_test (line 103) | def restful_test(config, run_config, worker_id: str = '', is_smoke: bool...
  function restful_profile (line 133) | def restful_profile(config, run_config, port, is_smoke: bool = False):
  function mllm_restful_profile (line 165) | def mllm_restful_profile(config, run_config, port, is_smoke: bool = False):
  function prefixcache_throughput_test (line 196) | def prefixcache_throughput_test(config, run_config, worker_id: str = '',...
  function get_max_cache_entry (line 257) | def get_max_cache_entry(model, backend):

FILE: autotest/utils/common_utils.py
  function execute_command_with_logging (line 6) | def execute_command_with_logging(cmd,

FILE: autotest/utils/config_utils.py
  function resolve_extra_params (line 15) | def resolve_extra_params(extra_params: dict[str, Any], model_base_path: ...
  function get_func_config_list (line 39) | def get_func_config_list(backend: str,
  function get_cli_common_param (line 134) | def get_cli_common_param(run_config: dict[str, Any]) -> str:
  function get_cli_str (line 169) | def get_cli_str(config: dict[str, Any]) -> str:
  function get_parallel_config (line 188) | def get_parallel_config(config: dict[str, Any], model_name: str) -> list...
  function _extract_models_from_config (line 208) | def _extract_models_from_config(config_value: Any) -> list[str]:
  function get_model_list (line 220) | def get_model_list(config: dict[str, Any],
  function _filter_by_test_func_type (line 259) | def _filter_by_test_func_type(config: dict[str, Any], model_list: list[s...
  function _extend_turbomind_quant_models (line 273) | def _extend_turbomind_quant_models(quant_config: dict[str, Any], base_mo...
  function _extend_pytorch_quant_models (line 288) | def _extend_pytorch_quant_models(quant_config: dict[str, Any], base_mode...
  function _is_kvint_model (line 300) | def _is_kvint_model(config: dict[str, Any], backend: str, model: str, qu...
  function _base_model_name (line 310) | def _base_model_name(model: str) -> str:
  function get_quantization_model_list (line 316) | def get_quantization_model_list(type: str) -> list[str]:
  function get_config (line 348) | def get_config() -> dict[str, Any]:
  function get_cuda_prefix_by_workerid (line 378) | def get_cuda_prefix_by_workerid(worker_id: str | None, parallel_config: ...
  function get_cuda_id_by_workerid (line 395) | def get_cuda_id_by_workerid(worker_id: str | None, tp_num: int = 1) -> s...
  function get_workerid (line 406) | def get_workerid(worker_id: str | None) -> int:
  function is_quantization_model (line 415) | def is_quantization_model(model: str) -> bool:
  function _get_communicator_list (line 421) | def _get_communicator_list(config: dict[str, Any],
  function set_device_env_variable (line 439) | def set_device_env_variable(worker_id: str | None, parallel_config: dict...
  function unset_device_env_variable (line 460) | def unset_device_env_variable():
  function is_model_in_list (line 470) | def is_model_in_list(config: dict[str, Any], parallel_config: dict[str, ...
  function get_case_str_by_config (line 476) | def get_case_str_by_config(run_config: dict[str, Any], is_simple: bool =...
  function parse_config_by_case (line 501) | def parse_config_by_case(case_str: str) -> dict[str, Any]:
  function test_config (line 531) | def test_config():
  function test_get_case_str_by_config (line 574) | def test_get_case_str_by_config():
  function test_cli_common_param (line 596) | def test_cli_common_param():
  function test_return_info_turbomind (line 637) | def test_return_info_turbomind():
  function test_return_info_pytorch (line 741) | def test_return_info_pytorch():
  function test_run_config (line 845) | def test_run_config():
  function test_get_parallel_config (line 880) | def test_get_parallel_config():

FILE: autotest/utils/evaluate_utils.py
  function write_to_summary (line 16) | def write_to_summary(case_name, result, msg, metrics, result_dir):
  function llm_summary (line 67) | def llm_summary(case_name, result, msg, work_dir, result_dir=None):
  function mllm_summary (line 107) | def mllm_summary(case_name,
  function eval_test (line 146) | def eval_test(model_path, eval_path, case_name, port=DEFAULT_PORT, test_...
  function mllm_eval_test (line 268) | def mllm_eval_test(model_path, eval_path, case_name, port=DEFAULT_PORT, ...

FILE: autotest/utils/get_run_config.py
  function get_model_name (line 5) | def get_model_name(model):
  function _simple_model_name (line 51) | def _simple_model_name(model):

FILE: autotest/utils/mp_log_utils.py
  function write_log (line 7) | def write_log(config, result, msg, is_new: bool = True, case_path_tag: s...
  function assert_log (line 22) | def assert_log(config, case_path_tag: str = 'default'):

FILE: autotest/utils/pipeline_chat.py
  function run_pipeline_llm_test (line 13) | def run_pipeline_llm_test(config, run_config, common_case_config, worker...
  function run_pipeline_mllm_test (line 73) | def run_pipeline_mllm_test(config, run_config, worker_id: str = '', is_s...
  function get_response_from_output (line 165) | def get_response_from_output(output_text, case):
  function get_response_from_output_by_prompt (line 169) | def get_response_from_output_by_prompt(output_text, case, prompt):
  function assert_pipeline_single_return (line 178) | def assert_pipeline_single_return(output, logprobs_num: int = 0):
  function assert_pipeline_batch_return (line 186) | def assert_pipeline_batch_return(output, size: int = 1):
  function assert_pipeline_single_stream_return (line 196) | def assert_pipeline_single_stream_return(output, logprobs_num: int = 0):
  function assert_pipeline_batch_stream_return (line 205) | def assert_pipeline_batch_stream_return(output, size: int = 1):
  function assert_pipeline_single_element (line 214) | def assert_pipeline_single_element(output, is_stream: bool = False, is_l...
  function internvl_vl_testcase (line 246) | def internvl_vl_testcase(output_text, file, lang: str = 'en'):
  function MiniCPM_vl_testcase (line 288) | def MiniCPM_vl_testcase(output_text, file):
  function Qwen_vl_testcase (line 315) | def Qwen_vl_testcase(output_text, file):
  function save_pipeline_common_log (line 342) | def save_pipeline_common_log(config, log_name, result, content, msg: str...
  function assert_pipeline_common_log (line 351) | def assert_pipeline_common_log(config, log_name):

FILE: autotest/utils/proxy_distributed_utils.py
  function is_port_open (line 18) | def is_port_open(host: str, port: int, timeout: float = 1.0) -> bool:
  function check_nodes_status (line 29) | def check_nodes_status(host: str, proxy_port: int, model_name: str, expe...
  function wait_for_model_service_ready (line 79) | def wait_for_model_service_ready(host: str,
  function proxy_worker_node_wait (line 147) | def proxy_worker_node_wait(manager, timeout_minutes: int = 120):
  class ProxyDistributedManager (line 183) | class ProxyDistributedManager:
    method __init__ (line 185) | def __init__(self):
    method start (line 193) | def start(self):
    method cleanup (line 206) | def cleanup(self):
  class ApiServerPerTest (line 216) | class ApiServerPerTest:
    method __init__ (line 218) | def __init__(self, proxy_manager: ProxyDistributedManager, config: dic...
    method start (line 236) | def start(self):
    method wait_until_ready (line 269) | def wait_until_ready(self):
    method cleanup (line 280) | def cleanup(self):

FILE: autotest/utils/quantization_utils.py
  function quantization (line 6) | def quantization(config,

FILE: autotest/utils/ray_distributed_utils.py
  function wait_for_model_service_ready (line 20) | def wait_for_model_service_ready(
  function verify_service_functionality (line 72) | def verify_service_functionality(host: str, api_port: int, model_name: s...
  class RayLMDeployManager (line 102) | class RayLMDeployManager:
    method __init__ (line 104) | def __init__(
    method start_ray_cluster (line 137) | def start_ray_cluster(self):
    method start_lmdeploy_api_server (line 153) | def start_lmdeploy_api_server(self, config: dict[str, Any], run_config...
    method cleanup (line 219) | def cleanup(self, force: bool = True):
    method get_cluster_info (line 255) | def get_cluster_info(self) -> dict[str, Any]:
    method __enter__ (line 266) | def __enter__(self):
    method __exit__ (line 269) | def __exit__(self, exc_type, exc_val, exc_tb):
  function ray_worker_node_wait (line 273) | def ray_worker_node_wait(manager: RayLMDeployManager, timeout_minutes: i...

FILE: autotest/utils/restful_return_check.py
  function assert_chat_completions_batch_return (line 4) | def assert_chat_completions_batch_return(output, model_name, check_logpr...
  function assert_completions_batch_return (line 22) | def assert_completions_batch_return(output, model_name, check_logprobs: ...
  function assert_usage (line 39) | def assert_usage(usage):
  function assert_logprobs (line 46) | def assert_logprobs(logprobs, logprobs_num):
  function assert_logprob_element (line 55) | def assert_logprob_element(logprob):
  function assert_chat_completions_stream_return (line 61) | def assert_chat_completions_stream_return(output,
  function assert_completions_stream_return (line 89) | def assert_completions_stream_return(output,
  function has_repeated_fragment (line 117) | def has_repeated_fragment(text, repeat_count=5):

FILE: autotest/utils/rule_condition_assert.py
  function assert_result (line 1) | def assert_result(input, rule_condition, model_name: str = None):

FILE: autotest/utils/run_client_chat.py
  function run_tests (line 12) | def run_tests(config, usercase, cli_case_config, run_config, worker_id):
  function hf_command_line_test (line 23) | def hf_command_line_test(config, case, case_info, run_config, cuda_prefi...
  function command_test (line 46) | def command_test(config, cmd, run_config, case_info, need_extract_output):
  function parse_dialogue (line 117) | def parse_dialogue(inputs: str):
  function extract_output (line 126) | def extract_output(output: str, model: str):

FILE: autotest/utils/run_restful_chat.py
  function start_openai_service (line 22) | def start_openai_service(config, run_config, worker_id, timeout: int = 1...
  function stop_restful_api (line 96) | def stop_restful_api(pid, startRes):
  function terminate_restful_api (line 104) | def terminate_restful_api(worker_id):
  function run_all_step (line 119) | def run_all_step(log_path, case_name, cases_info, port: int = DEFAULT_PO...
  function open_chat_test (line 137) | def open_chat_test(log_path, case_name, case_info, url):
  function health_check (line 194) | def health_check(url, model_name):
  function get_model (line 210) | def get_model(url):
  function _run_logprobs_test (line 220) | def _run_logprobs_test(port: int = DEFAULT_PORT):
  function run_vl_testcase (line 244) | def run_vl_testcase(log_path, resource_path, port: int = DEFAULT_PORT):
  function _run_reasoning_case (line 297) | def _run_reasoning_case(log_path, port: int = DEFAULT_PORT):
  function test_internlm_multiple_round_prompt (line 342) | def test_internlm_multiple_round_prompt(client, model):
  function test_qwen_multiple_round_prompt (line 443) | def test_qwen_multiple_round_prompt(client, model):
  function _run_tools_case (line 588) | def _run_tools_case(log_path, port: int = DEFAULT_PORT):
  function proxy_health_check (line 691) | def proxy_health_check(url):
  function start_proxy_server (line 704) | def start_proxy_server(log_path, port, case_name: str = 'default'):
  function run_llm_test (line 770) | def run_llm_test(config, run_config, common_case_config, worker_id):
  function run_mllm_test (line 786) | def run_mllm_test(config, run_config, worker_id):
  function run_reasoning_case (line 800) | def run_reasoning_case(config, run_config, worker_id):
  function run_tools_case (line 812) | def run_tools_case(config, run_config, worker_id):
  function run_logprob_test (line 824) | def run_logprob_test(config, run_config, worker_id):

FILE: autotest/utils/toolkit.py
  function parse_sse_stream (line 6) | def parse_sse_stream(content: str) -> list[str]:
  function _load_tokenizer_cached (line 25) | def _load_tokenizer_cached(model_path: str):
  function encode_text (line 33) | def encode_text(model_path: str, text: str) -> list[int]:

FILE: benchmark/benchmark_decode.py
  function benchmark (line 13) | def benchmark(model_path, share_gpt_path, downsample=100, accel=None, sa...

FILE: benchmark/benchmark_pipeline.py
  function get_cmd (line 9) | def get_cmd(model_path, backend, engine_config, data_config):
  function benchmark (line 36) | def benchmark(model_path, backend, engine_config, data_config):
  function main (line 63) | def main(model_path=None, backend=None, config_path=None):

FILE: benchmark/benchmark_serving.py
  function get_launching_server_cmd (line 10) | def get_launching_server_cmd(model_path, backend, server_config):
  function get_output_file (line 31) | def get_output_file(model_path, backend, server_config):
  function get_server_ip_port (line 58) | def get_server_ip_port(backend: str, server_config: Dict) -> Tuple[str, ...
  function wait_server_ready (line 78) | def wait_server_ready(server_ip: str, server_port: int) -> bool:
  function get_client_cmd (line 93) | def get_client_cmd(backend: str, server_ip: str, server_port: int, clien...
  function benchmark (line 115) | def benchmark(model_path: str, backend: str, server_config: Dict, data_c...
  function validate_config (line 169) | def validate_config(config: Dict) -> None:
  function main (line 190) | def main(backend: str, config_path: str, model_path: Optional[str] = None):

FILE: benchmark/benchmark_throughput.py
  function get_cmd (line 9) | def get_cmd(model_path, backend, engine_config, data_config):
  function benchmark (line 36) | def benchmark(model_path, backend, engine_config, data_config):
  function main (line 63) | def main(model_path=None, backend=None, config_path=None):

FILE: benchmark/profile_pipeline_api.py
  function sample_sharegpt_requests (line 20) | def sample_sharegpt_requests(
  function sample_random_requests (line 66) | def sample_random_requests(
  class Engine (line 132) | class Engine:
    method __init__ (line 134) | def __init__(self, model_path: str, engine_config, csv: str):
    method process_request (line 140) | def process_request(self, requests, profiler: Profiler, temperature, t...
  function parse_args (line 199) | def parse_args():
  function main (line 284) | def main():

FILE: benchmark/profile_restful_api.py
  class RequestFuncInput (line 55) | class RequestFuncInput:
  class RequestFuncOutput (line 66) | class RequestFuncOutput:
  function remove_prefix (line 77) | def remove_prefix(text: str, prefix: str) -> str:
  function async_request_trt_llm (line 83) | async def async_request_trt_llm(
  function async_request_openai_completions (line 153) | async def async_request_openai_completions(
  function async_request_openai_chat_completions (line 231) | async def async_request_openai_chat_completions(
  function async_request_sglang_generate (line 339) | async def async_request_sglang_generate(
  function async_request_gserver (line 416) | async def async_request_gserver(
  function get_model (line 423) | def get_model(pretrained_model_name_or_path: str) -> str:
  function get_tokenizer (line 438) | def get_tokenizer(pretrained_model_name_or_path: str, ) -> Union[PreTrai...
  function get_processor (line 449) | def get_processor(pretrained_model_name_or_path: str, ) -> Union[PreTrai...
  class BenchmarkMetrics (line 476) | class BenchmarkMetrics:
  function download_and_cache_file (line 506) | def download_and_cache_file(url: str, filename: Optional[str] = None):
  class DatasetRow (line 541) | class DatasetRow:
    method __post_init__ (line 549) | def __post_init__(self):
  function sample_sharegpt_requests (line 556) | def sample_sharegpt_requests(dataset_path: str,
  function compute_random_lens (line 609) | def compute_random_lens(full_len: int, range_ratio: float, num: int):
  function sample_random_requests (line 617) | def sample_random_requests(
  function parse_image_resolution (line 686) | def parse_image_resolution(image_resolution: str) -> Tuple[int, int]:
  function gen_mm_prompt (line 714) | def gen_mm_prompt(tokenizer, image_pad_id, token_num):
  function create_mm_data_row (line 724) | def create_mm_data_row(text_prompt, images: list, images_base64, output_...
  function sample_image_requests (line 794) | def sample_image_requests(
  function get_request (line 887) | async def get_request(
  function calculate_metrics (line 905) | def calculate_metrics(
  function benchmark (line 980) | async def benchmark(
  function parse_request_rate_range (line 1161) | def parse_request_rate_range(request_rate_range):
  function check_chat_template (line 1169) | def check_chat_template(model_path):
  function run_benchmark (line 1178) | def run_benchmark(args_: argparse.Namespace):
  function set_ulimit (line 1330) | def set_ulimit(target_soft_limit=65535):

FILE: benchmark/profile_throughput.py
  function sample_sharegpt_requests (line 24) | def sample_sharegpt_requests(
  function sample_random_requests (line 69) | def sample_random_requests(
  class Engine (line 135) | class Engine:
    method __init__ (line 137) | def __init__(self, model_path: str, engine_config: Union[PytorchEngine...
    method _inference (line 151) | async def _inference(self, req_queue: Queue, session_id: int, temperat...
    method process_request (line 199) | def process_request(self, requests, profiler: Profiler, concurrency, t...
  function parse_args (line 237) | def parse_args():
  function main (line 337) | def main():

FILE: docs/en/conf.py
  function metrics (line 62) | def metrics():

FILE: docs/zh_cn/conf.py
  function metrics (line 62) | def metrics():

FILE: eval/eval.py
  class ProcessManager (line 9) | class ProcessManager:
    method __init__ (line 12) | def __init__(self):
    method __enter__ (line 16) | def __enter__(self):
    method __exit__ (line 27) | def __exit__(self, exc_type, exc_val, exc_tb):
    method _signal_handler (line 33) | def _signal_handler(self, sig, frame):
    method start_process (line 40) | def start_process(self, cmd):
    method cleanup (line 44) | def cleanup(self):
  function read_config (line 58) | def read_config():
  function update_datasets (line 80) | def update_datasets(config, datasets):
  function get_model_name_from_server (line 118) | def get_model_name_from_server(server: str, tag: str) -> str:
  function save_config (line 128) | def save_config(work_dir: str, config: str):
  function perform_evaluation (line 144) | def perform_evaluation(config, api_server, judger_server, mode, work_dir...
  function main (line 195) | def main():

FILE: examples/lite/qwen3_30b_a3b_awq.py
  function parse_args (line 9) | def parse_args():
  function main (line 25) | def main():

FILE: examples/lite/qwen3_30b_a3b_gptq.py
  function parse_args (line 9) | def parse_args():
  function main (line 25) | def main():

FILE: lmdeploy/api.py
  function pipeline (line 15) | def pipeline(model_path: str,
  function serve (line 78) | def serve(model_path: str,
  function client (line 101) | def client(api_server_url: str = 'http://0.0.0.0:23333', api_key: str | ...

FILE: lmdeploy/archs.py
  function autoget_backend (line 13) | def autoget_backend(model_path: str) -> Literal['turbomind', 'pytorch']:
  function autoget_backend_config (line 58) | def autoget_backend_config(
  function check_vl_llm (line 96) | def check_vl_llm(backend: str, config: dict) -> bool:
  function get_task (line 131) | def get_task(backend: str, model_path: str):
  function get_model_arch (line 147) | def get_model_arch(model_path: str):
  function search_nested_config (line 176) | def search_nested_config(config, key):

FILE: lmdeploy/cli/chat.py
  function input_prompt (line 10) | def input_prompt():
  function build_pipe (line 17) | def build_pipe(model_path, backend, **kwargs):
  function build_gen_config (line 55) | def build_gen_config(**kwargs):
  function get_adapter_name (line 63) | def get_adapter_name(adapters=None, **kwargs):
  function main (line 71) | def main(model_path, backend, **kwargs):

FILE: lmdeploy/cli/cli.py
  class CLI (line 10) | class CLI(object):
    method add_parser_chat (line 18) | def add_parser_chat():
    method add_parser_checkenv (line 78) | def add_parser_checkenv():
    method check_env (line 93) | def check_env(args):
    method chat (line 157) | def chat(args):
    method add_parsers (line 169) | def add_parsers():

FILE: lmdeploy/cli/entrypoint.py
  function run (line 10) | def run():

FILE: lmdeploy/cli/lite.py
  class SubCliLite (line 6) | class SubCliLite(object):
    method add_parser_auto_awq (line 18) | def add_parser_auto_awq():
    method add_parser_auto_gptq (line 44) | def add_parser_auto_gptq():
    method add_parser_calibrate (line 66) | def add_parser_calibrate():
    method add_parser_smooth_quant (line 83) | def add_parser_smooth_quant():
    method auto_awq (line 107) | def auto_awq(args):
    method auto_gptq (line 114) | def auto_gptq(args):
    method calibrate (line 121) | def calibrate(args):
    method smooth_quant (line 128) | def smooth_quant(args):
    method add_parsers (line 135) | def add_parsers():

FILE: lmdeploy/cli/serve.py
  class SubCliServe (line 10) | class SubCliServe:
    method add_parser_api_server (line 22) | def add_parser_api_server():
    method add_parser_proxy (line 161) | def add_parser_proxy():
    method api_server (line 201) | def api_server(args):
    method proxy (line 337) | def proxy(args):
    method add_parsers (line 344) | def add_parsers():

FILE: lmdeploy/cli/utils.py
  class DefaultsAndTypesHelpFormatter (line 15) | class DefaultsAndTypesHelpFormatter(argparse.HelpFormatter):
    method _get_help_string (line 18) | def _get_help_string(self, action):
  function convert_args (line 35) | def convert_args(args):
  function get_lora_adapters (line 42) | def get_lora_adapters(adapters: List[str]):
  function get_chat_template (line 71) | def get_chat_template(chat_template: str, model_path: str = None):
  function get_speculative_config (line 102) | def get_speculative_config(args):
  class ArgumentHelper (line 115) | class ArgumentHelper:
    method model_name (line 119) | def model_name(parser):
    method dtype (line 130) | def dtype(parser, default: str = 'auto'):
    method quant_dtype (line 142) | def quant_dtype(parser, default: str = 'int8'):
    method model_format (line 151) | def model_format(parser, default: str = None):
    method revision (line 161) | def revision(parser, default: str = None):
    method download_dir (line 170) | def download_dir(parser, default: str = None):
    method tp (line 178) | def tp(parser):
    method dp (line 187) | def dp(parser):
    method ep (line 196) | def ep(parser):
    method cp (line 205) | def cp(parser):
    method dp_rank (line 215) | def dp_rank(parser):
    method node_rank (line 224) | def node_rank(parser):
    method num_nodes (line 230) | def num_nodes(parser):
    method dist_init_addr (line 236) | def dist_init_addr(parser):
    method session_id (line 242) | def session_id(parser):
    method session_len (line 248) | def session_len(parser, default: int = None):
    method max_batch_size (line 255) | def max_batch_size(parser):
    method quant_policy (line 265) | def quant_policy(parser, default: int = 0):
    method rope_scaling_factor (line 275) | def rope_scaling_factor(parser):
    method hf_overrides (line 281) | def hf_overrides(parser):
    method use_logn_attn (line 289) | def use_logn_attn(parser):
    method block_size (line 298) | def block_size(parser):
    method top_p (line 304) | def top_p(parser):
    method top_k (line 316) | def top_k(parser):
    method temperature (line 327) | def temperature(parser, default: float = 0.8):
    method repetition_penalty (line 331) | def repetition_penalty(parser):
    method log_level (line 340) | def log_level(parser):
    method api_keys (line 351) | def api_keys(parser):
    method ssl (line 361) | def ssl(parser):
    method backend (line 372) | def backend(parser):
    method stream_output (line 382) | def stream_output(parser):
    method calib_dataset (line 388) | def calib_dataset(parser):
    method calib_samples (line 399) | def calib_samples(parser):
    method calib_seqlen (line 408) | def calib_seqlen(parser):
    method calib_batchsize (line 414) | def calib_batchsize(parser):
    method calib_search_scale (line 426) | def calib_search_scale(parser):
    method device (line 438) | def device(parser, default: str = 'cuda', choices: List[str] = ['cuda'...
    method chat_template (line 448) | def chat_template(parser):
    method reasoning_parser (line 461) | def reasoning_parser(parser):
    method tool_call_parser (line 472) | def tool_call_parser(parser):
    method allow_terminate_by_client (line 483) | def allow_terminate_by_client(parser):
    method enable_abort_handling (line 492) | def enable_abort_handling(parser):
    method cache_max_entry_count (line 502) | def cache_max_entry_count(parser):
    method adapters (line 512) | def adapters(parser):
    method work_dir (line 525) | def work_dir(parser):
    method cache_block_seq_len (line 534) | def cache_block_seq_len(parser):
    method enable_prefix_caching (line 548) | def enable_prefix_caching(parser):
    method num_tokens_per_iter (line 557) | def num_tokens_per_iter(parser):
    method max_prefill_iters (line 564) | def max_prefill_iters(parser):
    method async_ (line 571) | def async_(parser):
    method max_prefill_token_num (line 581) | def max_prefill_token_num(parser):
    method vision_max_batch_size (line 588) | def vision_max_batch_size(parser):
    method max_log_len (line 592) | def max_log_len(parser):
    method disable_fastapi_docs (line 600) | def disable_fastapi_docs(parser):
    method eager_mode (line 608) | def eager_mode(parser):
    method communicator (line 618) | def communicator(parser):
    method enable_microbatch (line 627) | def enable_microbatch(parser):
    method enable_eplb (line 635) | def enable_eplb(parser):
    method disable_metrics (line 641) | def disable_metrics(parser):
    method role (line 650) | def role(parser):
    method migration_backend (line 660) | def migration_backend(parser):
    method disable_vision_encoder (line 668) | def disable_vision_encoder(parser):
    method logprobs_mode (line 676) | def logprobs_mode(parser):
    method dllm_block_length (line 685) | def dllm_block_length(parser):
    method dllm_unmasking_strategy (line 690) | def dllm_unmasking_strategy(parser):
    method dllm_denoising_steps (line 699) | def dllm_denoising_steps(parser):
    method dllm_confidence_threshold (line 707) | def dllm_confidence_threshold(parser):
    method enable_return_routed_experts (line 715) | def enable_return_routed_experts(parser):
    method add_spec_group (line 724) | def add_spec_group(parser):
    method distributed_executor_backend (line 745) | def distributed_executor_backend(parser):
  class FlexibleArgumentParser (line 755) | class FlexibleArgumentParser(argparse.ArgumentParser):
    method parse_args (line 758) | def parse_args(self, args=None, namespace=None):

FILE: lmdeploy/lite/apis/auto_awq.py
  function save_vl_model (line 18) | def save_vl_model(vl_model, model_path, dst_path):
  function auto_awq (line 41) | def auto_awq(model: str,

FILE: lmdeploy/lite/apis/calibrate.py
  function _prepare_for_calibrate (line 78) | def _prepare_for_calibrate(model: nn.Module,
  function make_compatible_internvl_config (line 149) | def make_compatible_internvl_config(model_path):
  function update_moe_mapping (line 166) | def update_moe_mapping(model, model_type):
  function calibrate (line 198) | def calibrate(model: str,

FILE: lmdeploy/lite/apis/get_small_sharded_hf.py
  function parse_args (line 12) | def parse_args():
  function main (line 20) | def main():

FILE: lmdeploy/lite/apis/gptq.py
  function auto_gptq (line 11) | def auto_gptq(model: str,

FILE: lmdeploy/lite/apis/smooth_quant.py
  function smooth_quant (line 17) | def smooth_quant(model: str,

FILE: lmdeploy/lite/modeling/internlm2_gptq.py
  class InternLM2GPTQForCausalLM (line 5) | class InternLM2GPTQForCausalLM(BaseGPTQForCausalLM):

FILE: lmdeploy/lite/modeling/internlm3_gptq.py
  class InternLM3GPTQForCausalLM (line 5) | class InternLM3GPTQForCausalLM(BaseGPTQForCausalLM):

FILE: lmdeploy/lite/quantization/activation/observer.py
  class KVCacheObserver (line 8) | class KVCacheObserver(GlobalAvailMixin):
    method __init__ (line 12) | def __init__(self, num_head: int, head_dim: int) -> None:
    method observe (line 26) | def observe(self, x: torch.Tensor) -> None:
  class ActivationObserver (line 53) | class ActivationObserver(GlobalAvailMixin):
    method __init__ (line 61) | def __init__(self, dim: int) -> None:
    method disable (line 79) | def disable(cls):
    method enable (line 84) | def enable(cls):
    method observe (line 89) | def observe(self, x: torch.Tensor, save_input: bool = False) -> None:
    method save_ratio (line 127) | def save_ratio(self, ratio: float) -> None:

FILE: lmdeploy/lite/quantization/awq.py
  function skipped_module (line 128) | def skipped_module(name: str):
  function get_weight_scale (line 137) | def get_weight_scale(weight, q_group_size=-1):
  function smooth_ln_fcs (line 153) | def smooth_ln_fcs(ln: torch.nn.Module,
  function smooth_fc_fcs (line 206) | def smooth_fc_fcs(pre_fc: torch.nn.Module,
  function check_awq_supported (line 269) | def check_awq_supported(layer_type):
  function quant_weights (line 296) | def quant_weights(model, fcs, bits, symmetry, group_size=-1, device='cud...
  function smooth_layers (line 323) | def smooth_layers(layers, fc2fcs, norm2fcs, a_scales, group_size=-1, dev...
  function pseudo_quantize_tensor (line 351) | def pseudo_quantize_tensor(w, w_bit=8, w_group_size=-1, return_scale_zer...
  function awq_layers (line 380) | def awq_layers(layers, fc2fcs, norm2fcs, a_scales, a_ratios=None, group_...

FILE: lmdeploy/lite/quantization/calibration.py
  class CalibrationContext (line 16) | class CalibrationContext():
    method __init__ (line 30) | def __init__(self,
    method _guess_num_heads (line 81) | def _guess_num_heads(self, model):
    method _init_input_observers (line 92) | def _init_input_observers(self, name2mod):
    method _init_output_observers (line 98) | def _init_output_observers(self, name2mod):
    method _insert_input_observers (line 104) | def _insert_input_observers(self):
    method _insert_output_observers (line 121) | def _insert_output_observers(self):
    method _wrap_decoder_layers (line 138) | def _wrap_decoder_layers(self):
    method collect_inputs_stats (line 168) | def collect_inputs_stats(self):
    method collect_outputs_stats (line 183) | def collect_outputs_stats(self):
    method export (line 199) | def export(self, out_dir):
    method calibrate (line 216) | def calibrate(self, data):
    method __enter__ (line 227) | def __enter__(self):
    method __exit__ (line 241) | def __exit__(self, exc_type, exc_value, traceback):
  function auto_scale_block (line 253) | def auto_scale_block(module, module_kwargs, w_bit, w_group_size, input_f...
  class CalibrationContextV2 (line 337) | class CalibrationContextV2(CalibrationContext):
    method __init__ (line 339) | def __init__(self,
    method _insert_input_observers (line 355) | def _insert_input_observers(self):
    method export (line 372) | def export(self, out_dir):
    method _wrap_decoder_layers_for_search (line 399) | def _wrap_decoder_layers_for_search(self):
    method __enter__ (line 441) | def __enter__(self):

FILE: lmdeploy/lite/quantization/modules/linear.py
  class WeightOnlyQLinear (line 15) | class WeightOnlyQLinear(nn.Module):
    method __init__ (line 28) | def __init__(
    method from_linear (line 74) | def from_linear(cls: Type['WeightOnlyQLinear'],
    method forward (line 141) | def forward(self, x):

FILE: lmdeploy/lite/quantization/weight/quant_utils.py
  function _aligned_size (line 7) | def _aligned_size(a, b):
  function fast_log2_ceil_torch (line 11) | def fast_log2_ceil_torch(x: torch.Tensor) -> torch.Tensor:
  function fast_pow2_torch (line 21) | def fast_pow2_torch(x: torch.Tensor) -> torch.Tensor:
  function fast_round_scale_torch (line 26) | def fast_round_scale_torch(amax: torch.Tensor, fp8_max: torch.Tensor) ->...
  function _get_quant_scaling (line 30) | def _get_quant_scaling(weight: torch.Tensor,
  function quant_blocked_fp8 (line 47) | def quant_blocked_fp8(weight: torch.Tensor,

FILE: lmdeploy/lite/quantization/weight/quantizer.py
  class WeightQuantizer (line 13) | class WeightQuantizer(GlobalAvailMixin):
    method __init__ (line 59) | def __init__(self, bits: int, symmetry: bool, granularity: str, group_...
    method calculate_qparams (line 81) | def calculate_qparams(self, weight: torch.Tensor) -> QParams:
    method quant (line 98) | def quant(self, weight: torch.Tensor, qparams: Optional[QParams] = Non...

FILE: lmdeploy/lite/utils/batch_split.py
  function split_decoder_layer_inputs (line 7) | def split_decoder_layer_inputs(batch_size, *args: Union[torch.Tensor, Any],
  function concat_decoder_layer_outputs (line 61) | def concat_decoder_layer_outputs(batch_outputs: List[Any]) -> Any:

FILE: lmdeploy/lite/utils/cal_qparams.py
  class QParams (line 7) | class QParams(NamedTuple):
  function precise_round (line 15) | def precise_round(x):
  function cal_qparams_per_channel_absmax (line 20) | def cal_qparams_per_channel_absmax(w: torch.Tensor, n_bits: int, return_...
  function cal_qparams_per_channel_minmax (line 36) | def cal_qparams_per_channel_minmax(w: torch.Tensor, n_bits: int, return_...
  function cal_qparams_per_group_absmax (line 58) | def cal_qparams_per_group_absmax(w: torch.Tensor, n_bits: int, group_siz...
  function cal_qparams_per_group_minmax (line 79) | def cal_qparams_per_group_minmax(w: torch.Tensor, n_bits: int, group_siz...
  function cal_qparams_per_tensor_minmax (line 105) | def cal_qparams_per_tensor_minmax(w: torch.Tensor, n_bits: int, return_s...
  function cal_qparams_per_tensor_absmax (line 125) | def cal_qparams_per_tensor_absmax(w: torch.Tensor, n_bits: int, return_s...

FILE: lmdeploy/lite/utils/calib_dataloader.py
  function set_seed (line 8) | def set_seed(seed):
  function process_dataset (line 14) | def process_dataset(ds, tokenizer, max_seq_length):
  function get_wikitext2 (line 102) | def get_wikitext2(dataset, tokenizer, nsamples, seed, seqlen):
  function get_c4 (line 128) | def get_c4(dataset, tokenizer, nsamples, seed, seqlen):
  function get_pileval (line 158) | def get_pileval(dataset, tokenizer, nsamples, seed, seqlen=512):
  function get_gsm8k (line 211) | def get_gsm8k(dataset, tokenizer, nsamples, seed, seqlen):
  function get_neuralmagic_calibration (line 250) | def get_neuralmagic_calibration(dataset, tokenizer, nsamples, seed, seql...
  function get_open_platypus (line 289) | def get_open_platypus(dataset, tokenizer, nsamples, seed, seqlen):
  function get_openwebtext (line 328) | def get_openwebtext(dataset, tokenizer, nsamples, seed, seqlen):
  function get_calib_loaders (line 362) | def get_calib_loaders(name, tokenizer, nsamples=128, seed=0, seqlen=2048):

FILE: lmdeploy/lite/utils/collect.py
  function collect_target_modules (line 7) | def collect_target_modules(model: nn.Module,
  function collect_target_weights (line 41) | def collect_target_weights(model: nn.Module, target: Union[str, type], s...
  function bimap_name_mod (line 64) | def bimap_name_mod(name2mod_mappings: List[Dict[str, nn.Module]]) -> Tup...

FILE: lmdeploy/lite/utils/global_avail.py
  class GlobalAvailMixin (line 7) | class GlobalAvailMixin:
    method global_available (line 12) | def global_available(self, key: Union[str, nn.Module] = 'default', gro...
    method _save_instance (line 24) | def _save_instance(cls,
    method find (line 44) | def find(cls, key: Union[str, nn.Module] = 'default', group: str = 'de...
    method find_group (line 60) | def find_group(cls, group: str) -> Dict[Union[str, nn.Module], 'Global...
    method instances (line 73) | def instances(cls) -> Dict[str, Dict[Union[str, nn.Module], 'GlobalAva...

FILE: lmdeploy/lite/utils/load.py
  class LoadNoInit (line 9) | class LoadNoInit:
    method __init__ (line 12) | def __init__(self):
    method __enter__ (line 22) | def __enter__(self, *args, **kwargs):
    method __exit__ (line 34) | def __exit__(self, *args, **kwargs):
  function load_hf_from_pretrained (line 47) | def load_hf_from_pretrained(pretrained_model_name_or_path, dtype: Litera...

FILE: lmdeploy/lite/utils/memory_efficient.py
  function extract_return_values (line 15) | def extract_return_values(module: nn.Module) -> List[str]:
  function find_kv_cache_idx (line 36) | def find_kv_cache_idx(module: nn.Module) -> int:
  function find_modules_by_return_value (line 46) | def find_modules_by_return_value(model: nn.Module, value: str) -> List[n...
  function offload_kv_cache (line 79) | def offload_kv_cache(model: nn.Module, device: str = 'cuda') -> None:
  function offload_weights (line 141) | def offload_weights(model: nn.Module, device: str = 'cuda') -> None:
  function memory_efficient_inference (line 198) | def memory_efficient_inference(model: nn.Module, offload: bool = True, d...

FILE: lmdeploy/logger.py
  class RequestLogger (line 11) | class RequestLogger:
    method __init__ (line 20) | def __init__(self, max_log_len: Optional[int]) -> None:
    method log_prompt (line 23) | def log_prompt(self, session_id: int, prompt: str) -> None:
    method log_inputs (line 34) | def log_inputs(self, session_id: int, prompt: Optional[str], prompt_to...

FILE: lmdeploy/messages.py
  class GenerationConfig (line 25) | class GenerationConfig:
    method convert_stop_bad_words_to_ids (line 138) | def convert_stop_bad_words_to_ids(self, tokenizer: Tokenizer):
    method update_from_hf_gen_cfg (line 160) | def update_from_hf_gen_cfg(self, generation_config, tokenizer_eos_toke...
    method __post_init__ (line 179) | def __post_init__(self):
  class TurbomindEngineConfig (line 190) | class TurbomindEngineConfig:
    method __post_init__ (line 290) | def __post_init__(self):
  class PytorchEngineConfig (line 304) | class PytorchEngineConfig:
    method __post_init__ (line 425) | def __post_init__(self):
  class ResponseType (line 450) | class ResponseType(enum.Enum):
  class Response (line 467) | class Response:
    method __str__ (line 499) | def __str__(self):
    method __repr__ (line 502) | def __repr__(self):
    method _format_none_text_fields (line 505) | def _format_none_text_fields(self):
    method extend (line 529) | def extend(self, other: 'Response') -> 'Response':
  class EventType (line 557) | class EventType(enum.IntEnum):
  class EngineEvent (line 572) | class EngineEvent:
    method new_event (line 583) | def new_event(cls, event_type: EventType, timestamp: Optional[float] =...
  class ScheduleMetrics (line 591) | class ScheduleMetrics:
  class RequestMetrics (line 602) | class RequestMetrics:
  class EngineOutput (line 615) | class EngineOutput:
  class VisionConfig (line 638) | class VisionConfig:
  class SpeculativeConfig (line 654) | class SpeculativeConfig:

FILE: lmdeploy/metrics/loggers.py
  class StatLoggerBase (line 17) | class StatLoggerBase(ABC):
    method record_schedule (line 20) | def record_schedule(self, stats: SchedulerStats) -> None:
    method record_iteration (line 24) | def record_iteration(self, stats: IterationStats) -> None:
    method record_specdecode (line 28) | def record_specdecode(self, stats: SpeculativeDecodingStats) -> None:
    method log (line 31) | def log(self):  # noqa
  class LoggingStatLogger (line 35) | class LoggingStatLogger(StatLoggerBase):
    method __init__ (line 37) | def __init__(self, dp_rank: int = 0):
    method _reset (line 42) | def _reset(self, now):
    method record_schedule (line 52) | def record_schedule(self, stats: SchedulerStats):
    method record_iteration (line 55) | def record_iteration(self, stats: IterationStats):
    method record_specdecode (line 62) | def record_specdecode(self, stats: SpeculativeDecodingStats):
    method record_finish (line 73) | def record_finish(self, stats: RequestStats):
    method get_spec_msg (line 76) | def get_spec_msg(self):
    method log (line 98) | def log(self):
  class PrometheusStatLogger (line 133) | class PrometheusStatLogger(StatLoggerBase):
    method __init__ (line 135) | def __init__(self, model_name: str, max_model_len: int, dp_rank: int =...
    method record_schedule (line 309) | def record_schedule(self, stats: SchedulerStats) -> None:
    method record_iteration (line 319) | def record_iteration(self, stats: IterationStats) -> None:
    method record_finish (line 335) | def record_finish(self, stats: RequestStats) -> None:
    method record_specdecode (line 345) | def record_specdecode(self, stats: SpeculativeDecodingStats) -> None:
  function build_buckets (line 349) | def build_buckets(mantissa_lst: List[int], max_value: int) -> List[int]:
  function build_1_2_5_buckets (line 364) | def build_1_2_5_buckets(max_value: int) -> List[int]:

FILE: lmdeploy/metrics/metrics_processor.py
  class MetricsProcessor (line 14) | class MetricsProcessor():
    method __init__ (line 17) | def __init__(self):
    method start_metrics_handler (line 25) | def start_metrics_handler(self, enable_metrics: bool):
    method stop_metrics_handler (line 33) | async def stop_metrics_handler(self):
    method _run_metrics_handler (line 45) | async def _run_metrics_handler(self):
    method update_schedule_stats (line 83) | async def update_schedule_stats(self, schedule_metrics: ScheduleMetrics):
    method queue_update (line 90) | def queue_update(self, update_data: tuple):
    method increase_total_requests (line 96) | def increase_total_requests(self):
    method increase_completed_requests (line 100) | def increase_completed_requests(self):
    method increase_api_routed_requests (line 104) | def increase_api_routed_requests(self):
    method decrease_api_routed_requests (line 108) | def decrease_api_routed_requests(self):

FILE: lmdeploy/metrics/stats.py
  class SchedulerStats (line 14) | class SchedulerStats:
    method __repr__ (line 44) | def __repr__(self):
    method update_from_schedule_metrics (line 56) | def update_from_schedule_metrics(self, scheduled_metrics: ScheduleMetr...
  class RequestStats (line 63) | class RequestStats:
    method __init__ (line 66) | def __init__(self, arrival_time: float = None, prompt_tokens: int = 0):
    method __repr__ (line 100) | def __repr__(self):
    method update_from_events (line 111) | def update_from_events(self, engine_events: List[EngineEvent]):
    method e2e_latency (line 126) | def e2e_latency(self) -> float:
    method queued_time_interval (line 131) | def queued_time_interval(self) -> float:
    method prefill_time_interval (line 136) | def prefill_time_interval(self) -> float:
    method decode_time_interval (line 144) | def decode_time_interval(self) -> float:
    method inference_time_interval (line 152) | def inference_time_interval(self) -> float:
  class IterationStats (line 160) | class IterationStats:
    method __init__ (line 163) | def __init__(self):
    method __repr__ (line 181) | def __repr__(self):
    method _time_since (line 191) | def _time_since(self, start: float) -> float:
    method update_from_output (line 195) | def update_from_output(self, outputs: EngineOutput, req_stats: Request...
  class SpeculativeDecodingStats (line 231) | class SpeculativeDecodingStats:
    method __post_init__ (line 240) | def __post_init__(self):
    method update_from_output (line 244) | def update_from_output(self, outputs: EngineOutput):
    method update_per_draft (line 253) | def update_per_draft(self, num_draft_tokens: int, num_accepted_tokens:...
    method __repr__ (line 261) | def __repr__(self):

FILE: lmdeploy/model.py
  function random_uuid (line 16) | def random_uuid() -> str:
  function get_text (line 21) | def get_text(content: Union[str, List[dict]]):
  class ChatTemplateConfig (line 35) | class ChatTemplateConfig:
    method chat_template (line 69) | def chat_template(self):
    method to_json (line 80) | def to_json(self, file_path=None):
    method from_json (line 90) | def from_json(cls, file_or_string):
  class BaseChatTemplate (line 111) | class BaseChatTemplate:
    method __init__ (line 114) | def __init__(self,
    method get_prompt (line 141) | def get_prompt(self, prompt, sequence_start=True):
    method messages2prompt (line 167) | def messages2prompt(self, messages, sequence_start=True, **kwargs):
    method match (line 194) | def match(cls, model_path: str) -> Optional[str]:
  class CogVLM (line 204) | class CogVLM(BaseChatTemplate):
    method __init__ (line 207) | def __init__(self,
    method match (line 228) | def match(cls, model_path: str) -> Optional[str]:
  class Vicuna (line 240) | class Vicuna(BaseChatTemplate):
    method __init__ (line 243) | def __init__(
    method get_prompt (line 262) | def get_prompt(self, prompt, sequence_start=True):
    method messages2prompt (line 267) | def messages2prompt(self, messages, sequence_start=True, **kwargs):
    method match (line 273) | def match(cls, model_path: str) -> Optional[str]:
  class Llavav1 (line 287) | class Llavav1(Vicuna):
    method __init__ (line 290) | def __init__(
    method match (line 297) | def match(cls, model_path: str) -> Optional[str]:
  class InternLMChat7B (line 312) | class InternLMChat7B(BaseChatTemplate):
    method __init__ (line 315) | def __init__(
    method match (line 342) | def match(cls, model_path: str) -> Optional[str]:
  class Baichuan2 (line 355) | class Baichuan2(BaseChatTemplate):
    method __init__ (line 359) | def __init__(self, user='<reserved_106>', assistant='<reserved_107>', ...
    method match (line 363) | def match(cls, model_path: str) -> Optional[str]:
  class Llama2 (line 375) | class Llama2(BaseChatTemplate):
    method __init__ (line 378) | def __init__(
    method match (line 401) | def match(cls, model_path: str) -> Optional[str]:
  class CodeLlama (line 412) | class CodeLlama(Llama2):
    method __init__ (line 414) | def __init__(self, meta_instruction='', suffix_first=False, stop_words...
    method get_prompt (line 427) | def get_prompt(self, prompt, sequence_start=True):
    method _infill_prompt (line 435) | def _infill_prompt(self, prompt):
    method match (line 446) | def match(cls, model_path: str) -> Optional[str]:
  class ChatGLM2 (line 457) | class ChatGLM2(BaseChatTemplate):
    method __init__ (line 459) | def __init__(self, user='问:', eoh='\n\n', assistant='答:', eoa='\n\n', ...
    method get_prompt (line 467) | def get_prompt(self, prompt, sequence_start=True):
    method messages2prompt (line 478) | def messages2prompt(self, messages, sequence_start=True, **kwargs):
    method match (line 497) | def match(cls, model_path: str) -> Optional[str]:
  class MistralChat (line 509) | class MistralChat(BaseChatTemplate):
    method __init__ (line 516) | def __init__(self, user='[INST] ', eoh=' [/INST]', eoa='</s>', **kwargs):
    method match (line 520) | def match(cls, model_path: str) -> Optional[str]:
  class InternVLZH (line 535) | class InternVLZH(BaseChatTemplate):
    method __init__ (line 537) | def __init__(self, user='<human>: ', eoh=' ', assistant='<bot>: ', eoa...
    method get_prompt (line 540) | def get_prompt(self, prompt, sequence_start=True):
    method messages2prompt (line 545) | def messages2prompt(self, messages, sequence_start=True, **kwargs):
    method match (line 551) | def match(cls, model_path: str) -> Optional[str]:
  class DeepseekVL (line 563) | class DeepseekVL(BaseChatTemplate):
    method __init__ (line 565) | def __init__(
    method get_prompt (line 582) | def get_prompt(self, prompt, sequence_start=True):
    method messages2prompt (line 587) | def messages2prompt(self, messages, sequence_start=True, **kwargs):
    method match (line 593) | def match(cls, model_path: str) -> Optional[str]:
  class DeepseekVL2 (line 605) | class DeepseekVL2(BaseChatTemplate):
    method __init__ (line 607) | def __init__(self,
    method get_prompt (line 623) | def get_prompt(self, prompt, sequence_start=True):
    method messages2prompt (line 626) | def messages2prompt(self, messages, sequence_start=True, **kwargs):
    method match (line 632) | def match(cls, model_path: str) -> Optional[str]:
  class ChatmlDirect (line 644) | class ChatmlDirect(BaseChatTemplate):
    method __init__ (line 646) | def __init__(self,
    method match (line 667) | def match(cls, model_path: str) -> Optional[str]:
  class HFChatTemplate (line 679) | class HFChatTemplate(BaseChatTemplate):
    method __init__ (line 685) | def __init__(self, model_path: str = '', **kwargs):
    method get_prompt (line 706) | def get_prompt(self, prompt, sequence_start=True, **kwargs):
    method messages2prompt (line 710) | def messages2prompt(self, messages, sequence_start=True, **kwargs):
    method _user_instruction (line 745) | def _user_instruction(self):
    method _assistant_instruction (line 756) | def _assistant_instruction(self):
    method _system_instruction (line 773) | def _system_instruction(self):
    method match (line 790) | def match(cls, model_path: str) -> Optional[str]:
  function get_chat_template (line 798) | def get_chat_template(model_path: str, config: Optional[ChatTemplateConf...

FILE: lmdeploy/pipeline.py
  class Pipeline (line 30) | class Pipeline:
    method __init__ (line 33) | def __init__(self,
    method infer (line 83) | def infer(self,
    method batch_infer (line 125) | def batch_infer(self, *args, **kwargs):
    method stream_infer (line 128) | def stream_infer(self,
    method close (line 164) | def close(self):
    method chat (line 169) | def chat(self,
    method session (line 230) | def session(self) -> 'Session':
    method get_reward_score (line 234) | def get_reward_score(self, input_ids: List) -> List[float]:
    method get_ppl (line 256) | def get_ppl(self, input_ids: List[int] | List[List[int]]) -> List[float]:
    method __call__ (line 306) | def __call__(self,
    method __enter__ (line 312) | def __enter__(self):
    method __exit__ (line 315) | def __exit__(self, exc_type, exc_value, traceback):
    method generate (line 319) | async def generate(self, *args, **kwargs):
    method _is_single (line 328) | def _is_single(prompts):
    method _request_generator (line 333) | def _request_generator(self,
    method _get_limiter (line 370) | def _get_limiter(self):
    method _infer (line 375) | def _infer(self, requests: Iterator[Dict], multiplex: bool, pbar=None,...
    method _run (line 413) | def _run(self, fn=None, coro=None):
    method _batch_iterator (line 424) | def _batch_iterator(self, sizes, max_value):
    method _get_long_text_ppl (line 446) | def _get_long_text_ppl(self, session, input_ids, max_input_len):
    method _get_ppl (line 472) | def _get_ppl(self,
  class _EventLoopThread (line 523) | class _EventLoopThread:
    method __init__ (line 525) | def __init__(self, daemon=False):
    method _thread_entry (line 534) | def _thread_entry(self, fut):
    method _cancel_all_tasks (line 550) | def _cancel_all_tasks(self):
    method close (line 574) | def close(self):

FILE: lmdeploy/profiler.py
  class Session (line 10) | class Session:
    method __init__ (line 16) | def __init__(self, input_len, req_output_len):
    method tick (line 23) | def tick(self, n_token):
    method finish (line 27) | def finish(self, status):
  class Profiler (line 31) | class Profiler:
    method __init__ (line 33) | def __init__(self, stream_output: bool, percentages: List[int]):
    method new_session (line 38) | def new_session(self, *args, **kwargs):
    method start (line 43) | def start(self):
    method finish (line 46) | def finish(self):
    method compute_metrics (line 49) | def compute_metrics(self):
    method summarize (line 106) | def summarize(self, title: str, hyperparams: List = None, header=40, d...
    method save_csv (line 140) | def save_csv(self, csv_file: str, hyperparams):

FILE: lmdeploy/pytorch/adapter/adapter.py
  function get_ranks_and_scalings (line 10) | def get_ranks_and_scalings(target_name: str, cfgs: Iterable, device: tor...
  function find_all_target (line 26) | def find_all_target(model: torch.nn.Module, target_name: str):
  function get_layer_index (line 48) | def get_layer_index(key: str, layers_pattern: str = None):
  function _get_reverse_pack_map (line 63) | def _get_reverse_pack_map(model: nn.Module):
  function _get_key_map (line 73) | def _get_key_map(reverse_map: Dict[str, str]):
  function load_lora_weights (line 84) | def load_lora_weights(model: nn.Module, weights: Iterable[Tuple[str, tor...
  class AdapterManager (line 111) | class AdapterManager:
    method __init__ (line 114) | def __init__(self, adapters: Dict[str, str]):
    method get_adapter_ids (line 125) | def get_adapter_ids(self, names: List[str]):
    method num_adapters (line 128) | def num_adapters(self):

FILE: lmdeploy/pytorch/backends/activation.py
  class SiluAndMulImpl (line 5) | class SiluAndMulImpl(ABC):
    method forward (line 9) | def forward(self, x):
  class SiluAndMulBuilder (line 14) | class SiluAndMulBuilder(ABC):
    method build (line 19) | def build(inplace: bool = False):
  class GeluAndMulImpl (line 24) | class GeluAndMulImpl(ABC):
    method forward (line 28) | def forward(self, x):
  class GeluAndMulBuilder (line 33) | class GeluAndMulBuilder(ABC):
    method build (line 38) | def build(approximate: str = 'none'):

FILE: lmdeploy/pytorch/backends/apply_rotary_emb.py
  class ApplyRotaryEmbImpl (line 7) | class ApplyRotaryEmbImpl(ABC):
    method forward (line 11) | def forward(self, query: Tensor, key: Tensor, cos: Tensor, sin: Tensor...
  class ApplyRotaryEmbBuilder (line 16) | class ApplyRotaryEmbBuilder(ABC):
    method build (line 21) | def build():

FILE: lmdeploy/pytorch/backends/attention.py
  class AttentionMetadata (line 11) | class AttentionMetadata:
  class AttentionImpl (line 27) | class AttentionImpl(ABC, Generic[T]):
    method __init__ (line 30) | def __init__(
    method make_alibi_slopes (line 67) | def make_alibi_slopes(head_start: int, head_end: int, num_heads: int, ...
    method set_alibi_slopes (line 85) | def set_alibi_slopes(self, slopes: torch.Tensor):
    method forward (line 89) | def forward(
  class AttentionBuilder (line 107) | class AttentionBuilder(ABC, Generic[T]):
    method build (line 112) | def build(

FILE: lmdeploy/pytorch/backends/awq_modules.py
  class LinearW4A16Impl (line 8) | class LinearW4A16Impl(ABC):
    method update_weights (line 11) | def update_weights(self,
    method forward (line 20) | def forward(self,
  class LinearW4A16Builder (line 30) | class LinearW4A16Builder(ABC):
    method build (line 35) | def build(in_features: int,

FILE: lmdeploy/pytorch/backends/base.py
  class OpType (line 13) | class OpType(Enum):
  class OpsBackend (line 45) | class OpsBackend(ABC):
    method get_name (line 50) | def get_name() -> str:
    method get_layer_impl_builder (line 56) | def get_layer_impl_builder(cls, layer_type: OpType):
    method get_attention_metadata_cls (line 62) | def get_attention_metadata_cls():
    method get_k_block_shape (line 68) | def get_k_block_shape(
    method get_v_block_shape (line 79) | def get_v_block_shape(
    method update_step_context (line 89) | def update_step_context(cls, step_context):
    method build_graph_runner (line 97) | def build_graph_runner(model: torch.nn.Module, model_config: ModelConf...
    method device_count (line 104) | def device_count():
    method support_ray (line 109) | def support_ray():

FILE: lmdeploy/pytorch/backends/blockedf8_modules.py
  class LinearBlockedF8Impl (line 9) | class LinearBlockedF8Impl(ABC):
    method __init__ (line 12) | def __init__(self):
    method update_weights (line 15) | def update_weights(self, weight: torch.Tensor, scale: torch.Tensor, bi...
    method set_scale_fmt (line 19) | def set_scale_fmt(self, scale_fmt: Optional[str]):
    method forward (line 24) | def forward(self,
  class LinearBlockedF8Builder (line 37) | class LinearBlockedF8Builder(ABC):
    method build (line 42) | def build(in_features: int, out_features: int, bias: bool = True, dtyp...

FILE: lmdeploy/pytorch/backends/causal_conv1d.py
  class CausalConv1dImpl (line 7) | class CausalConv1dImpl(ABC):
    method conv1d_fn (line 11) | def conv1d_fn(self,
    method update_fn (line 22) | def update_fn(self,
  class CausalConv1dBuilder (line 33) | class CausalConv1dBuilder(ABC):
    method build (line 38) | def build():

FILE: lmdeploy/pytorch/backends/cuda/activation.py
  class TritonSiluAndMulImpl (line 7) | class TritonSiluAndMulImpl(SiluAndMulImpl):
    method __init__ (line 10) | def __init__(self, inplace: bool):
    method forward (line 13) | def forward(self, x):
  class TritonSiluAndMulBuilder (line 30) | class TritonSiluAndMulBuilder(SiluAndMulBuilder):
    method build (line 34) | def build(inplace: bool = False):

FILE: lmdeploy/pytorch/backends/cuda/apply_rotary_emb.py
  class TritonApplyRotaryEmbImpl (line 10) | class TritonApplyRotaryEmbImpl(ApplyRotaryEmbImpl):
    method forward (line 13) | def forward(self, query: Tensor, key: Tensor, cos: Tensor, sin: Tensor...
  class TritonApplyRotaryEmbBuilder (line 24) | class TritonApplyRotaryEmbBuilder(ApplyRotaryEmbBuilder):
    method build (line 28) | def build():

FILE: lmdeploy/pytorch/backends/cuda/attention/__init__.py
  function use_fa3_warning (line 26) | def use_fa3_warning():
  function _enable_fa3 (line 35) | def _enable_fa3(alibi: bool, learnable_sink: bool, block_sparse_size: in...
  function _normalize_sliding_window (line 53) | def _normalize_sliding_window(sliding_window):
  class TritonAttentionBuilder (line 69) | class TritonAttentionBuilder(AttentionBuilder[TritonAttentionMetadata]):
    method build (line 79) | def build(

FILE: lmdeploy/pytorch/backends/cuda/attention/default.py
  class TritonAttentionMetadata (line 14) | class TritonAttentionMetadata(AttentionMetadata):
  function _cdiv (line 56) | def _cdiv(a, b):
  class TritonAttentionImpl (line 69) | class TritonAttentionImpl(AttentionImpl[TritonAttentionMetadata]):
    method __init__ (line 72) | def __init__(
    method _get_max_q_seqlen (line 111) | def _get_max_q_seqlen(
    method _get_fill_meta (line 126) | def _get_fill_meta(
    method _fill_kv_cache_impl (line 138) | def _fill_kv_cache_impl(
    method _forward_decoding (line 177) | def _forward_decoding(
    method _forward_prefill (line 226) | def _forward_prefill(
    method forward (line 298) | def forward(

FILE: lmdeploy/pytorch/backends/cuda/attention/fa3.py
  class FA3Impl (line 11) | class FA3Impl(TritonAttentionImpl):
    method __init__ (line 24) | def __init__(
    method _get_max_q_seqlen (line 54) | def _get_max_q_seqlen(
    method _normalize_sliding_window (line 66) | def _normalize_sliding_window(self, sliding_window):
    method _decoding_speculative (line 81) | def _decoding_speculative(
    method _decoding_standard (line 126) | def _decoding_standard(
    method _forward_decoding (line 176) | def _forward_decoding(
    method _forward_prefill (line 210) | def _forward_prefill(
    method forward (line 275) | def forward(

FILE: lmdeploy/pytorch/backends/cuda/attention/mla.py
  function _cdiv (line 14) | def _cdiv(a, b):
  function _try_dynamic_compile (line 19) | def _try_dynamic_compile(func, *args, **kwargs):
  class NSAIndicesUpdater (line 29) | class NSAIndicesUpdater:
    method __init__ (line 36) | def __init__(self):
    method _update_decode_impl (line 40) | def _update_decode_impl(self, nsa_indices: torch.Tensor, block_offsets...
    method update_decode (line 51) | def update_decode(self, nsa_indices: torch.Tensor, block_offsets: torc...
    method _update_prefill_impl (line 59) | def _update_prefill_impl(self, nsa_indices: torch.Tensor, q_seqlens: t...
    method update_prefill (line 68) | def update_prefill(self, nsa_indices: torch.Tensor, q_seqlens: torch.T...
    method build (line 78) | def build():
  class FlashMLAImpl (line 82) | class FlashMLAImpl(TritonAttentionImpl):
    method __init__ (line 97) | def __init__(
    method _get_flash_mla_sparse_fwd (line 143) | def _get_flash_mla_sparse_fwd(self):
    method flash_mla_decoding (line 154) | def flash_mla_decoding(
    method _prefill_sparse (line 196) | def _prefill_sparse(self, query: torch.Tensor, flatten_k: torch.Tensor...
    method _prefill_triton (line 232) | def _prefill_triton(
    method _prefill_fa3 (line 271) | def _prefill_fa3(
    method run_flatten_kv_cache (line 315) | def run_flatten_kv_cache(self,
    method _get_max_q_seqlen (line 369) | def _get_max_q_seqlen(
    method _fill_kv_cache_impl (line 382) | def _fill_kv_cache_impl(self,
    method _forward_decoding (line 449) | def _forward_decoding(
    method _forward_prefill (line 472) | def _forward_prefill(
    method forward (line 520) | def forward(

FILE: lmdeploy/pytorch/backends/cuda/awq_modules.py
  function wq_gemm_forward (line 11) | def wq_gemm_forward(
  class AwqLinearW4A16Impl (line 43) | class AwqLinearW4A16Impl(LinearW4A16Impl):
    method __init__ (line 46) | def __init__(self, in_features: int, out_features: int, w_bit: int, gr...
    method forward (line 52) | def forward(self,
  class AwqLinearW4A16Builder (line 68) | class AwqLinearW4A16Builder(LinearW4A16Builder):
    method build (line 72) | def build(in_features: int,

FILE: lmdeploy/pytorch/backends/cuda/blockedf8_modules.py
  class TritonLinearBlockedF8Impl (line 16) | class TritonLinearBlockedF8Impl(LinearBlockedF8Impl):
    method __init__ (line 19) | def __init__(self, in_features: int, out_features: int, block_size: in...
    method forward (line 26) | def forward(self,
  class TritonLinearBlockedF8Builder (line 58) | class TritonLinearBlockedF8Builder(LinearBlockedF8Builder):
    method build (line 62) | def build(in_features: int, out_features: int, block_size: int = 128, ...
  class DeepGemmLinearBlockedF8Impl (line 73) | class DeepGemmLinearBlockedF8Impl(LinearBlockedF8Impl):
    method __init__ (line 76) | def __init__(self, in_features: int, out_features: int, block_size: in...
    method warmup (line 89) | def warmup(self, warmup_meta: WarmupMeta):
    method forward (line 112) | def forward(self,

FILE: lmdeploy/pytorch/backends/cuda/causal_conv1d.py
  class CausalConv1dTilelangImpl (line 10) | class CausalConv1dTilelangImpl(CausalConv1dImpl):
    method __init__ (line 13) | def __init__(self):
    method conv1d_fn (line 18) | def conv1d_fn(self,
    method update_fn (line 32) | def update_fn(self,
  class CausalConv1dDaoImpl (line 48) | class CausalConv1dDaoImpl(CausalConv1dTilelangImpl):
    method __init__ (line 50) | def __init__(self):
  function has_dao (line 61) | def has_dao():
  class CausalConv1dCudaBuilder (line 71) | class CausalConv1dCudaBuilder(CausalConv1dBuilder):
    method build (line 75) | def build() -> CausalConv1dImpl:

FILE: lmdeploy/pytorch/backends/cuda/flash_attention.py
  class TritonFlashAttentionImpl (line 7) | class TritonFlashAttentionImpl(FlashAttentionImpl):
    method __init__ (line 10) | def __init__(
    method forward (line 42) | def forward(self,
  class TritonFlashAttentionBuilder (line 71) | class TritonFlashAttentionBuilder(FlashAttentionBuilder):
    method build (line 75) | def build(

FILE: lmdeploy/pytorch/backends/cuda/gated_delta_rule.py
  function has_fla (line 11) | def has_fla():
  class CudaGatedDeltaRuleImpl (line 19) | class CudaGatedDeltaRuleImpl(GatedDeltaRuleImpl):
    method __init__ (line 21) | def __init__(self):
    method chunk_gated_delta_rule (line 30) | def chunk_gated_delta_rule(self,
    method fused_recurrent_gated_delta_rule (line 68) | def fused_recurrent_gated_delta_rule(self,
  class CudaGatedDeltaRuleBuilder (line 93) | class CudaGatedDeltaRuleBuilder(GatedDeltaRuleBuilder):
    method build (line 96) | def build() -> GatedDeltaRuleImpl:

FILE: lmdeploy/pytorch/backends/cuda/graph_runner.py
  function next_power_of_2 (line 22) | def next_power_of_2(n: int):
  function _get_capture_batch_size_impl (line 36) | def _get_capture_batch_size_impl(max_batches: int):
  function _false (line 54) | def _false(*args, **kwargs):
  class CUDASingleGraphRunner (line 59) | class CUDASingleGraphRunner:
    method __init__ (line 62) | def __init__(
    method capture (line 102) | def capture(self, **kwargs):
    method forward (line 127) | def forward(self, **kwargs):
    method __del__ (line 138) | def __del__(self):
  class CUDAGraphRunner (line 143) | class CUDAGraphRunner(GraphRunner):
    method __init__ (line 146) | def __init__(self, model: torch.nn.Module, model_config: ModelConfig, ...
    method check_enable_graph (line 164) | def check_enable_graph(self):
    method _try_compile_model_once (line 171) | def _try_compile_model_once(self):
    method _get_capture_tokens (line 182) | def _get_capture_tokens(self, batch_size: int):
    method get_graph_key (line 190) | def get_graph_key(self, input_ids: torch.Tensor, position_ids: torch.T...
    method _prepare_inputs (line 206) | def _prepare_inputs(self, **kwargs):
    method _get_max_tokens (line 214) | def _get_max_tokens(self, graph_key: tuple, input_ids: torch.Tensor, q...
    method __call__ (line 222) | def __call__(self, **kwargs):
    method prepare_inputs_for_generation (line 262) | def prepare_inputs_for_generation(
    method reset (line 281) | def reset(self):
    method update_inputs (line 293) | def update_inputs(self, inputs):
    method get_capture_batch_sizes (line 306) | def get_capture_batch_sizes(self) -> List[int]:

FILE: lmdeploy/pytorch/backends/cuda/lora.py
  class PackedLoRAInput (line 13) | class PackedLoRAInput:
  class TritonLoRAImpl (line 23) | class TritonLoRAImpl(LoRAImpl):
    method _make_packed_lora_input (line 27) | def _make_packed_lora_input(x, ctx_mgr):
    method forward (line 41) | def forward(self,
  class TritonLoRABuilder (line 84) | class TritonLoRABuilder(LoRABuilder):
    method build (line 88) | def build():

FILE: lmdeploy/pytorch/backends/cuda/moe/blocked_fp8.py
  class TritonFusedMoEBlockedF8Impl (line 22) | class TritonFusedMoEBlockedF8Impl(FusedMoEBlockedF8Impl):
    method __init__ (line 25) | def __init__(self,
    method ep_expert_list (line 38) | def ep_expert_list(self, world_size: int, rank: int):
    method forward (line 46) | def forward(self,
  class FusedDeepEpMoEBlockedF8Impl (line 90) | class FusedDeepEpMoEBlockedF8Impl(TritonFusedMoEBlockedF8Impl):
    method __init__ (line 92) | def __init__(self,
    method ep_expert_list (line 128) | def ep_expert_list(self, world_size: int, rank: int):
    method forward (line 141) | def forward(self,
    method do_renormalize (line 168) | def do_renormalize(self, topk_weights):
    method fusedmoe_build (line 171) | def fusedmoe_build(self, low_latency_mode: bool = False):
  class TritonFusedMoEBlockedF8Builder (line 186) | class TritonFusedMoEBlockedF8Builder(FusedMoEBlockedF8Builder):
    method build (line 190) | def build(top_k: int,

FILE: lmdeploy/pytorch/backends/cuda/moe/default.py
  class TritonFusedMoEImpl (line 21) | class TritonFusedMoEImpl(FusedMoEImpl):
    method __init__ (line 24) | def __init__(self, top_k: int, num_experts: int, renormalize: bool = F...
    method update_weights (line 29) | def update_weights(self, gate_up_weights: torch.Tensor, down_weights: ...
    method ep_expert_list (line 34) | def ep_expert_list(self, world_size: int, rank: int):
    method forward (line 42) | def forward(self,
  class FusedMoENormal (line 73) | class FusedMoENormal:
    method __init__ (line 75) | def __init__(
    method forward (line 99) | def forward(
    method capture (line 121) | def capture(self):
    method wait (line 124) | def wait(self, event):
    method dispatch_async (line 128) | def dispatch_async(self,
    method combine_async (line 138) | def combine_async(self, x: torch.Tensor, handle: tuple, previous_event...
    method release (line 141) | def release(self):
    method fusedmoe_forward (line 144) | def fusedmoe_forward(self, state, up_weight, down_weight):
  function _disposible_tensor (line 150) | def _disposible_tensor(tensor):
  function dispatch_ll (line 159) | def dispatch_ll(
  function dispatch_async_ll (line 200) | def dispatch_async_ll(
  class FusedMoELowLatency (line 230) | class FusedMoELowLatency:
    method __init__ (line 232) | def __init__(
    method experts (line 253) | def experts(
    method forward (line 279) | def forward(self,
    method wait (line 300) | def wait(self, event):
    method dispatch_async (line 303) | def dispatch_async(
    method combine_async (line 313) | def combine_async(
    method fusedmoe_forward (line 323) | def fusedmoe_forward(self, state, up_weight, down_weight):
  function build_deepep_moe (line 333) | def build_deepep_moe(
  class FusedMoEEPImpl (line 360) | class FusedMoEEPImpl(TritonFusedMoEImpl):
    method __init__ (line 363) | def __init__(
    method update_weights (line 398) | def update_weights(self, gate_up_weights: torch.Tensor, down_weights: ...
    method forward (line 401) | def forward(self,
    method ep_expert_list (line 425) | def ep_expert_list(self, world_size: int, rank: int):
    method do_renormalize (line 432) | def do_renormalize(self, topk_weights):
    method fusedmoe_build (line 435) | def fusedmoe_build(self, low_latency_mode: bool = False):
  class TritonFusedMoEBuilder (line 447) | class TritonFusedMoEBuilder(FusedMoEBuilder):
    method build (line 451) | def build(

FILE: lmdeploy/pytorch/backends/cuda/moe/ep_utils.py
  function split_inputs_by_attn_tp (line 10) | def split_inputs_by_attn_tp(
  function gather_outputs_by_attn_tp (line 37) | def gather_outputs_by_attn_tp(out_states: torch.Tensor, split_size: List...

FILE: lmdeploy/pytorch/backends/cuda/moe/w8a8.py
  class TritonFusedMoEW8A8Impl (line 16) | class TritonFusedMoEW8A8Impl(FusedMoEW8A8Impl):
    method __init__ (line 19) | def __init__(
    method update_weights (line 33) | def update_weights(self, gate_up_weights: torch.Tensor, down_weights: ...
    method forward (line 38) | def forward(self,
  class TritonFusedMoEW8A8Builder (line 77) | class TritonFusedMoEW8A8Builder(FusedMoEW8A8Builder):
    method build (line 81) | def build(

FILE: lmdeploy/pytorch/backends/cuda/moe_router.py
  function is_power_of_two (line 12) | def is_power_of_two(n):
  class TritonRouterNoauxTCImpl (line 16) | class TritonRouterNoauxTCImpl(DefaultRouterNoauxTCImpl):
    method __init__ (line 18) | def __init__(
    method should_enable_custom_kernel (line 42) | def should_enable_custom_kernel(self) -> bool:
    method forward (line 60) | def forward(self, logits: torch.Tensor, bias: torch.Tensor) -> Tuple[t...
  class TritonRouterNoauxTCBuilder (line 77) | class TritonRouterNoauxTCBuilder(RouterNoauxTCBuilder):
    method build (line 80) | def build(

FILE: lmdeploy/pytorch/backends/cuda/multinomial_sampling.py
  class TritonMultinomialSamplingImpl (line 10) | class TritonMultinomialSamplingImpl(MultinomialSamplingImpl):
    method forward (line 12) | def forward(self,
  class TritonMultinomialSamplingBuilder (line 21) | class TritonMultinomialSamplingBuilder(MultinomialSamplingBuilder):
    method build (line 24) | def build():

FILE: lmdeploy/pytorch/backends/cuda/norm.py
  class TritonRMSNormImpl (line 9) | class TritonRMSNormImpl(RMSNormImpl):
    method __init__ (line 12) | def __init__(self, hidden_size: int, eps: float = 1e-6):
    method forward (line 16) | def forward(self, x: torch.Tensor, weight: torch.Tensor, residual: tor...
  class TritonRMSNormBuilder (line 26) | class TritonRMSNormBuilder(RMSNormBuilder):
    method build (line 30) | def build(weight: torch.Tensor, eps: float = 1e-6):

FILE: lmdeploy/pytorch/backends/cuda/nsa.py
  class TritonNSAIndexFP8 (line 12) | class TritonNSAIndexFP8(BaseNSAIndexFP8):
    method __init__ (line 14) | def __init__(self, topk: int, softmax_scale: float, block_size: int, f...
    method forward (line 23) | def forward(self, q: Tensor, k: Tensor, weights: Tensor, k_cache: Tens...
  class TritonNSAIndexFP8Builder (line 68) | class TritonNSAIndexFP8Builder(BaseNSAIndexFP8Builder):
    method build (line 71) | def build(topk: int, softmax_scale: float, block_size: int = 128, fill...

FILE: lmdeploy/pytorch/backends/cuda/op_backend.py
  class CudaOpsBackend (line 15) | class CudaOpsBackend(DefaultOpsBackend):
    method get_name (line 19) | def get_name() -> str:
    method get_layer_impl_builder (line 24) | def get_layer_impl_builder(cls, layer_type: OpType):
    method get_attention_metadata_cls (line 85) | def get_attention_metadata_cls():
    method get_k_block_shape (line 91) | def get_k_block_shape(
    method get_v_block_shape (line 105) | def get_v_block_shape(
    method update_meta_flashmla (line 119) | def update_meta_flashmla(cls, attn_metadata, model_config: ModelConfig...
    method update_meta_flashattn (line 139) | def update_meta_flashattn(cls, attn_metadata, step_context):
    method update_step_context (line 162) | def update_step_context(cls, step_context):
    method build_graph_runner (line 207) | def build_graph_runner(model: torch.nn.Module, model_config: ModelConf...
    method device_count (line 225) | def device_count():
    method support_ray (line 230) | def support_ray():

FILE: lmdeploy/pytorch/backends/cuda/qmodules.py
  class TritonRMSNormW8A8Impl (line 14) | class TritonRMSNormW8A8Impl(RMSNormW8A8Impl):
    method __init__ (line 17) | def __init__(self, hidden_size: int, eps: float = 1e-6, quant_dtype: t...
    method forward (line 23) | def forward(self, x: torch.Tensor, weight: torch.Tensor, residual: tor...
  class TritonRMSNormBuilder (line 39) | class TritonRMSNormBuilder(RMSNormW8A8Builder):
    method build (line 43) | def build(hidden_size: int, eps: float = 1e-6, quant_dtype: torch.dtyp...
  class TritonLinearW8A8Impl (line 48) | class TritonLinearW8A8Impl(LinearW8A8Impl):
    method __init__ (line 51) | def __init__(self,
    method forward (line 61) | def forward(self,
  class TritonLinearW8A8Builder (line 87) | class TritonLinearW8A8Builder(LinearW8A8Builder):
    method build (line 91) | def build(in_features: int,

FILE: lmdeploy/pytorch/backends/cuda/token_dispatcher.py
  function get_buffer_common (line 25) | def get_buffer_common(
  function get_buffer_normal (line 57) | def get_buffer_normal(group: dist.ProcessGroup, hidden_bytes: int):
  function get_buffer_low_latency (line 77) | def get_buffer_low_latency(
  class DeepEPTokenDispatcher (line 105) | class DeepEPTokenDispatcher(TokenDispatcherImpl):
    method __init__ (line 110) | def __init__(
    method dispatch (line 135) | def dispatch(
    method dispatch_normal (line 166) | def dispatch_normal(
    method dispatch_normal_async (line 217) | def dispatch_normal_async(self,
    method combine (line 267) | def combine(self, hidden_states: torch.Tensor) -> torch.Tensor:
    method combine_normal (line 274) | def combine_normal(self, x: torch.Tensor, handle: Tuple, previous_even...
    method combine_normal_async (line 284) | def combine_normal_async(self, x: torch.Tensor, handle: Tuple, previou...
    method release (line 294) | def release(self):
    method get_number_of_tokens_per_expert (line 304) | def get_number_of_tokens_per_expert(self) -> torch.Tensor:
    method get_permuted_hidden_states_by_experts (line 308) | def get_permuted_hidden_states_by_experts(self,
    method get_restored_hidden_states_by_experts (line 328) | def get_restored_hidden_states_by_experts(
  class DeepEPTokenDispatcherLowLatency (line 350) | class DeepEPTokenDispatcherLowLatency(TokenDispatcherImpl):
    method __init__ (line 352) | def __init__(
    method dispatch (line 378) | def dispatch(
    method dispatch_async (line 407) | def dispatch_async(
    method combine (line 427) | def combine(
    method combine_async (line 444) | def combine_async(
  class TokenDispatcherBuilder (line 465) | class TokenDispatcherBuilder:
    method build (line 469) | def build(

FILE: lmdeploy/pytorch/backends/cuda/utils.py
  function has_tilelang (line 6) | def has_tilelang():

FILE: lmdeploy/pytorch/backends/cuda/warmup_manager.py
  class WarmupMeta (line 13) | class WarmupMeta:
  class WarmupManager (line 21) | class WarmupManager:
    method __init__ (line 23) | def __init__(self):
    method __contains__ (line 26) | def __contains__(self, key: str):
    method __getitem__ (line 30) | def __getitem__(self, key: str):
    method __setitem__ (line 34) | def __setitem__(self, key: str, val):
    method warmup (line 38) | def warmup(self, warmup_meta: WarmupMeta):
  function get_warmup_manager (line 50) | def get_warmup_manager():

FILE: lmdeploy/pytorch/backends/deepep_moe_checker.py
  class MoEBackend (line 6) | class MoEBackend:
    method __init__ (line 8) | def __init__(self):
    method set_deepep_moe_backend (line 12) | def set_deepep_moe_backend(self):
    method use_deepep_moe_backend (line 16) | def use_deepep_moe_backend(self):
  function get_moe_backend (line 21) | def get_moe_backend():

FILE: lmdeploy/pytorch/backends/default/activation.py
  class DefaultSiluAndMulImpl (line 8) | class DefaultSiluAndMulImpl(SiluAndMulImpl):
    method __init__ (line 11) | def __init__(self, inplace: bool):
    method forward (line 15) | def forward(self, x):
  class DefaultSiluAndMulBuilder (line 21) | class DefaultSiluAndMulBuilder(SiluAndMulBuilder):
    method build (line 25) | def build(inplace: bool = False):
  class DefaultGeluAndMulImpl (line 30) | class DefaultGeluAndMulImpl(GeluAndMulImpl):
    method __init__ (line 33) | def __init__(self, approximate: str = 'none'):
    method forward (line 36) | def forward(self, x):
  class DefaultGeluAndMulBuilder (line 42) | class DefaultGeluAndMulBuilder(GeluAndMulBuilder):
    method build (line 46) | def build(approximate: str = 'none'):

FILE: lmdeploy/pytorch/backends/default/apply_rotary_emb.py
  function rotate_half (line 8) | def rotate_half(x):
  class DefaultApplyRotaryEmbImpl (line 19) | class DefaultApplyRotaryEmbImpl(ApplyRotaryEmbImpl):
    method forward (line 22) | def forward(self, query: Tensor, key: Tensor, cos: Tensor, sin: Tensor...
  class DefaultApplyRotaryEmbBuilder (line 42) | class DefaultApplyRotaryEmbBuilder(ApplyRotaryEmbBuilder):
    method build (line 46) | def build():

FILE: lmdeploy/pytorch/backends/default/awq_modules.py
  function get_shifts (line 13) | def get_shifts(bits: int, device: torch.device):
  function unpack_awq (line 20) | def unpack_awq(qweight: torch.Tensor, qzeros: torch.Tensor, bits: int):
  function dequantize_gemm (line 38) | def dequantize_gemm(qweight, qzeros, scales, bits, group_size):
  class DefaultLinearW4A16Impl (line 50) | class DefaultLinearW4A16Impl(LinearW4A16Impl):
    method __init__ (line 53) | def __init__(self, in_features: int, out_features: int, w_bit: int, gr...
    method forward (line 59) | def forward(self,
  class DefaultLinearW4A16Builder (line 85) | class DefaultLinearW4A16Builder(LinearW4A16Builder):
    method build (line 89) | def build(in_features: int,

FILE: lmdeploy/pytorch/backends/default/embedding.py
  function get_masked_input_and_mask (line 9) | def get_masked_input_and_mask(input: torch.Tensor, start_index: int, end...
  class DefaultEmbeddingImpl (line 16) | class DefaultEmbeddingImpl(EmbeddingImpl):
    method __init__ (line 19) | def __init__(self, start_index: int, end_index: int):
    method forward (line 23) | def forward(self, x, weight: torch.Tensor, all_reduce: bool = False, g...
  class DefaultEmbeddingBuilder (line 36) | class DefaultEmbeddingBuilder(EmbeddingBuilder):
    method build (line 40) | def build(start_index: int, end_index: int):

FILE: lmdeploy/pytorch/backends/default/linear.py
  class DefaultLinearImpl (line 11) | class DefaultLinearImpl(LinearImpl):
    method forward (line 14) | def forward(self,
  class DefaultLinearBuilder (line 33) | class DefaultLinearBuilder(LinearBuilder):
    method build (line 37) | def build(in_features: int, out_features: int, bias: bool = True, dtyp...

FILE: lmdeploy/pytorch/backends/default/moe.py
  class DefaultSoftmaxTopKImpl (line 7) | class DefaultSoftmaxTopKImpl(SoftmaxTopKImpl):
    method __init__ (line 10) | def __init__(self, top_k: int, dim: int = -1, n_groups: int = -1):
    method forward (line 16) | def forward(self, x: torch.Tensor):
  class DefaultSoftmaxTopKBuilder (line 35) | class DefaultSoftmaxTopKBuilder(SoftmaxTopKBuilder):
    method build (line 39) | def build(top_k: int, dim: int = -1, n_groups: int = -1):

FILE: lmdeploy/pytorch/backends/default/moe_router.py
  function _compute_scores (line 10) | def _compute_scores(scoring_func: str, logits: torch.Tensor):
  function get_group_offsets (line 23) | def get_group_offsets(n_groups: int, group_size: int, device: str | torc...
  class DefaultRouterNoauxTCImpl (line 28) | class DefaultRouterNoauxTCImpl(RouterNoauxTCImpl):
    method __init__ (line 30) | def __init__(
    method _forward_router_n_groups (line 55) | def _forward_router_n_groups(self, scores_for_choice: torch.Tensor) ->...
    method _forward_default (line 67) | def _forward_default(self, scores: torch.Tensor, scores_for_choice: to...
    method renorm (line 83) | def renorm(self, topk_weight: torch.Tensor) -> torch.Tensor:
    method forward (line 93) | def forward(self, logits: torch.Tensor, bias: torch.Tensor) -> Tuple[t...
  class DefaultRouterNoauxTCBuilder (line 108) | class DefaultRouterNoauxTCBuilder(RouterNoauxTCBuilder):
    method build (line 111) | def build(

FILE: lmdeploy/pytorch/backends/default/multinomial_sampling.py
  class DefaultMultinomialSamplingImpl (line 8) | class DefaultMultinomialSamplingImpl(MultinomialSamplingImpl):
    method forward (line 11) | def forward(self,
  class DefaultMultinomialSamplingBuilder (line 22) | class DefaultMultinomialSamplingBuilder(MultinomialSamplingBuilder):
    method build (line 26) | def build():

FILE: lmdeploy/pytorch/backends/default/norm.py
  class DefaultRMSNormImpl (line 7) | class DefaultRMSNormImpl(RMSNormImpl):
    method __init__ (line 10) | def __init__(self, hidden_size: int, eps: float = 1e-6):
    method forward (line 14) | def forward(self, x: torch.Tensor, weight: torch.Tensor, residual: tor...
  class DefaultRMSNormBuilder (line 29) | class DefaultRMSNormBuilder(RMSNormBuilder):
    method build (line 33) | def build(hidden_size: int, eps: float = 1e-6):
  class DefaultLayerNormImpl (line 38) | class DefaultLayerNormImpl(LayerNormImpl):
    method __init__ (line 41) | def __init__(self, normalized_shape: int, eps: float = 1e-6):
    method forward (line 47) | def forward(self,
  class DefaultLayerNormBuilder (line 62) | class DefaultLayerNormBuilder(LayerNormBuilder):
    method build (line 66) | def build(normalized_shape: int, eps: float = 1e-6):

FILE: lmdeploy/pytorch/backends/default/op_backend.py
  class DefaultOpsBackend (line 9) | class DefaultOpsBackend(OpsBackend):
    method get_name (line 12) | def get_name() -> str:
    method get_layer_impl_builder (line 16) | def get_layer_impl_builder(cls, layer_type: OpType):
    method get_k_block_shape (line 58) | def get_k_block_shape(
    method get_v_block_shape (line 72) | def get_v_block_shape(
    method init (line 86) | def init():
    method ccl_backend (line 90) | def ccl_backend() -> str:

FILE: lmdeploy/pytorch/backends/default/rotary_embedding.py
  function safe_torch_compile (line 14) | def safe_torch_compile(**compile_kwargs):
  function _rotary_embedding_fwd (line 44) | def _rotary_embedding_fwd(position_ids: torch.Tensor,
  class RotaryEmbeddingImpl (line 74) | class RotaryEmbeddingImpl(RotaryEmbeddingImpl, nn.Module):
    method __init__ (line 77) | def __init__(self, dim: int, base: int = 10000, scaling_factor: float ...
    method forward (line 85) | def forward(self, x: torch.Tensor, position_ids: torch.Tensor):
  class LlamaDynamicNTKScalingRotaryEmbedding (line 98) | class LlamaDynamicNTKScalingRotaryEmbedding(RotaryEmbeddingImpl):
    method __init__ (line 104) | def __init__(self, dim: int, base: int = 10000, scaling_factor: float ...
    method _ntk_inv_freq (line 108) | def _ntk_inv_freq(self, seq_len: torch.Tensor):
    method forward (line 116) | def forward(self, x: torch.Tensor, position_ids: torch.Tensor):
  class Llama3RotaryEmbeddingImpl (line 134) | class Llama3RotaryEmbeddingImpl(RotaryEmbeddingImpl):
    method __init__ (line 137) | def __init__(
  function yarn_find_correction_dim (line 167) | def yarn_find_correction_dim(num_rotations, dim, base=10000, max_positio...
  function yarn_find_correction_range (line 173) | def yarn_find_correction_range(low_rot, high_rot, dim, base=10000, max_p...
  function yarn_get_mscale (line 183) | def yarn_get_mscale(scale=1, mscale=1):
  function yarn_linear_ramp_mask (line 190) | def yarn_linear_ramp_mask(min, max, dim):
  class YarnRotaryEmbeddingImpl (line 200) | class YarnRotaryEmbeddingImpl(RotaryEmbeddingImpl):
    method __init__ (line 203) | def __init__(self,
    method forward (line 244) | def forward(self, x: torch.Tensor, position_ids: torch.Tensor):
  class LongRoPEScalingRotaryEmbeddingImpl (line 258) | class LongRoPEScalingRotaryEmbeddingImpl(RotaryEmbeddingImpl):
    method __init__ (line 261) | def __init__(
    method forward (line 285) | def forward(self, x: torch.Tensor, position_ids: torch.Tensor):
  class FopeRotaryEmbeddingImpl (line 310) | class FopeRotaryEmbeddingImpl(RotaryEmbeddingImpl):
    method __init__ (line 312) | def __init__(self,
    method forward (line 335) | def forward(self, x: torch.Tensor, position_ids: torch.Tensor, sin_coe...
  class DefaultRotaryEmbeddingBuilder (line 372) | class DefaultRotaryEmbeddingBuilder(RotaryEmbeddingBuilder):
    method build (line 376) | def build(

FILE: lmdeploy/pytorch/backends/default/token_dispatcher.py
  class AlltoAllTokenDispatcher (line 9) | class AlltoAllTokenDispatcher(TokenDispatcherImpl):
    method __init__ (line 11) | def __init__(
    method sort_chunks_by_idxs (line 30) | def sort_chunks_by_idxs(self, input: torch.Tensor, split_sizes: torch....
    method all_to_all (line 37) | def all_to_all(self, group: torch.distributed.group, input_: torch.Ten...
    method preprocess (line 55) | def preprocess(self, routing_map: torch.Tensor, local_expert_indices) ...
    method dispatch (line 82) | def dispatch(self, hidden_states: torch.Tensor, topk_ids: torch.Tensor...
    method combine (line 108) | def combine(self, hidden_states: torch.Tensor) -> torch.Tensor:

FILE: lmdeploy/pytorch/backends/dlinfer/activation.py
  class DlinferSiluAndMulImpl (line 7) | class DlinferSiluAndMulImpl(SiluAndMulImpl):
    method forward (line 10) | def forward(self, x):
  class DlinferSiluAndMulBuilder (line 15) | class DlinferSiluAndMulBuilder(SiluAndMulBuilder):
    method build (line 19) | def build(inplace: bool = False):

FILE: lmdeploy/pytorch/backends/dlinfer/apply_rotary_emb.py
  class DlinferApplyRotaryEmbImpl (line 9) | class DlinferApplyRotaryEmbImpl(ApplyRotaryEmbImpl):
    method forward (line 12) | def forward(self, query: Tensor, key: Tensor, cos: Tensor, sin: Tensor...
  class DlinferApplyRotaryEmbBuilder (line 23) | class DlinferApplyRotaryEmbBuilder(ApplyRotaryEmbBuilder):
    method build (line 27) | def build():

FILE: lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py
  class SocVersion (line 25) | class SocVersion:
    method device_name (line 31) | def device_name(cls) -> str:
    method is_Ascend310P (line 41) | def is_Ascend310P(cls) -> bool:
    method is_Ascend910 (line 45) | def is_Ascend910(cls) -> bool:
    method soc_version (line 50) | def soc_version(cls) -> int:
    method is_A2 (line 54) | def is_A2(cls) -> bool:
    method is_A3 (line 58) | def is_A3(cls) -> bool:
  class DistMeta (line 63) | class DistMeta:
  class AscendKVQuantMeta (line 73) | class AscendKVQuantMeta:
    method set_value (line 78) | def set_value(cls, device: str, dtype: torch.dtype, record_file: str, ...
  class AscendOpsBackend (line 118) | class AscendOpsBackend(DlinferOpsBackend):
    method get_name (line 126) | def get_name() -> str:
    method get_k_block_shape (line 131) | def get_k_block_shape(
    method get_v_block_shape (line 143) | def get_v_block_shape(
    method update_step_context (line 155) | def update_step_context(cls, step_context):
    method build_graph_runner (line 432) | def build_graph_runner(model: torch.nn.Module, model_config: ModelConf...
    method init (line 441) | def init():
    method ccl_backend (line 453) | def ccl_backend():
    method device_count (line 457) | def device_count():
    method support_ray (line 462) | def support_ray():

FILE: lmdeploy/pytorch/backends/dlinfer/ascend/utils.py
  function nd_to_nz_spec (line 8) | def nd_to_nz_spec(tensor: torch.Tensor) -> torch.Tensor:

FILE: lmdeploy/pytorch/backends/dlinfer/attention.py
  class DlinferAttentionMetadata (line 12) | class DlinferAttentionMetadata(AttentionMetadata):
  class DlinferAttentionImpl (line 23) | class DlinferAttentionImpl(AttentionImpl[DlinferAttentionMetadata]):
    method __init__ (line 26) | def __init__(
    method forward (line 58) | def forward(
  class DlinferAttentionBuilder (line 150) | class DlinferAttentionBuilder(AttentionBuilder[DlinferAttentionMetadata]):
    method build (line 154) | def build(

FILE: lmdeploy/pytorch/backends/dlinfer/awq_modules.py
  class AwqLinearW4A16Impl (line 11) | class AwqLinearW4A16Impl(LinearW4A16Impl):
    method __init__ (line 14) | def __init__(self, in_features: int, out_features: int, w_bit: int, gr...
    method forward (line 20) | def forward(self,
  class AwqLinearW4A16Builder (line 33) | class AwqLinearW4A16Builder(LinearW4A16Builder):
    method build (line 37) | def build(in_features: int,

FILE: lmdeploy/pytorch/backends/dlinfer/camb/op_backend.py
  class CambOpsBackend (line 14) | class CambOpsBackend(DlinferOpsBackend):
    method get_name (line 19) | def get_name() -> str:
    method get_k_block_shape (line 24) | def get_k_block_shape(
    method get_v_block_shape (line 37) | def get_v_block_shape(
    method update_step_context (line 50) | def update_step_context(cls, step_context):
    method build_graph_runner (line 121) | def build_graph_runner(model: torch.nn.Module, model_config: ModelConf...
    method support_ray (line 128) | def support_ray():

FILE: lmdeploy/pytorch/backends/dlinfer/flash_attention.py
  class DlinferFlashAttentionImpl (line 7) | class DlinferFlashAttentionImpl(FlashAttentionImpl):
    method __init__ (line 10) | def __init__(
    method forward (line 38) | def forward(self,
  class DlinferFlashAttentionBuilder (line 71) | class DlinferFlashAttentionBuilder(FlashAttentionBuilder):
    method build (line 75) | def build(

FILE: lmdeploy/pytorch/backends/dlinfer/linear.py
  class DlinferLinearImpl (line 13) | class DlinferLinearImpl(LinearImpl):
    method update_weights (line 16) | def update_weights(self, weight: torch.Tensor, bias: Optional[torch.Te...
    method forward (line 22) | def forward(self,
  class DlinferLinearBuilder (line 37) | class DlinferLinearBuilder(LinearBuilder):
    method build (line 41) | def build(in_features: int, out_features: int, bias: bool = True, dtyp...

FILE: lmdeploy/pytorch/backends/dlinfer/maca/op_backend.py
  class MacaOpsBackend (line 14) | class MacaOpsBackend(DlinferOpsBackend):
    method get_name (line 19) | def get_name() -> str:
    method get_k_block_shape (line 24) | def get_k_block_shape(
    method get_v_block_shape (line 33) | def get_v_block_shape(
    method update_step_context (line 42) | def update_step_context(cls, step_context):
    method build_graph_runner (line 112) | def build_graph_runner(model: torch.nn.Module, model_config: ModelConf...
    method support_ray (line 119) | def support_ray():

FILE: lmdeploy/pytorch/backends/dlinfer/moe.py
  class DlinferSoftmaxTopKImpl (line 15) | class DlinferSoftmaxTopKImpl(SoftmaxTopKImpl):
    method __init__ (line 18) | def __init__(self, top_k: int, dim: int = -1, n_groups: int = -1):
    method forward (line 23) | def forward(self, x: torch.Tensor):
  class DlinferSoftmaxTopKBuilder (line 32) | class DlinferSoftmaxTopKBuilder(SoftmaxTopKBuilder):
    method build (line 36) | def build(top_k: int, dim: int = -1, n_groups: int = -1):
  class DlinferFusedMoEImpl (line 41) | class DlinferFusedMoEImpl(FusedMoEImpl):
    method __init__ (line 44) | def __init__(self,
    method update_weights (line 63) | def update_weights(self, gate_up_weights: torch.Tensor, down_weights: ...
    method ep_expert_list (line 72) | def ep_expert_list(self, world_size: int, rank: int):
    method forward (line 80) | def forward(self,
  class DlinferFusedMoEBuilder (line 102) | class DlinferFusedMoEBuilder(FusedMoEBuilder):
    method build (line 106) | def build(top_k: int,

FILE: lmdeploy/pytorch/backends/dlinfer/norm.py
  class DlinferRMSNormImpl (line 9) | class DlinferRMSNormImpl(RMSNormImpl):
    method __init__ (line 12) | def __init__(self, hidden_size: int, eps: float = 1e-6):
    method forward (line 16) | def forward(self, x: torch.Tensor, weight: torch.Tensor, residual: tor...
  class DlinferRMSNormBuilder (line 26) | class DlinferRMSNormBuilder(RMSNormBuilder):
    method build (line 30) | def build(weight: torch.Tensor, eps: float = 1e-6):

FILE: lmdeploy/pytorch/backends/dlinfer/op_backend.py
  class DlinferOpsBackend (line 14) | class DlinferOpsBackend(DefaultOpsBackend):
    method get_name (line 18) | def get_name() -> str:
    method get_layer_impl_builder (line 23) | def get_layer_impl_builder(cls, layer_type: OpType):
    method get_attention_metadata_cls (line 66) | def get_attention_metadata_cls():
    method get_k_block_shape (line 71) | def get_k_block_shape(
    method get_v_block_shape (line 84) | def get_v_block_shape(
    method update_step_context (line 97) | def update_step_context(cls, step_context):

FILE: lmdeploy/pytorch/backends/dlinfer/qmodules.py
  class DlinferLinearW8A8Impl (line 14) | class DlinferLinearW8A8Impl(LinearW8A8Impl):
    method __init__ (line 17) | def __init__(self,
    method update_weights (line 27) | def update_weights(self, weight: torch.Tensor, scale: torch.Tensor, bi...
    method forward (line 34) | def forward(self,
  class DlinferLinearW8A8Builder (line 54) | class DlinferLinearW8A8Builder(LinearW8A8Builder):
    method build (line 58) | def build(in_features: int,
  class DlinferRMSNormW8A8Impl (line 67) | class DlinferRMSNormW8A8Impl(RMSNormW8A8Impl):
    method __init__ (line 70) | def __init__(self, hidden_size: int, eps: float = 1e-6, quant_dtype: t...
    method forward (line 76) | def forward(self, x: torch.Tensor, weight: torch.Tensor, residual: tor...
  class DlinferRMSNormW8A8Builder (line 88) | class DlinferRMSNormW8A8Builder(RMSNormW8A8Builder):
    method build (line 92) | def build(hidden_size: int, eps: float = 1e-6, quant_dtype: torch.dtyp...

FILE: lmdeploy/pytorch/backends/dlinfer/rotary_embedding.py
  function _rotary_embedding_fwd (line 14) | def _rotary_embedding_fwd(position_ids: torch.Tensor,
  class DlinferRotaryEmbeddingImpl (line 41) | class DlinferRotaryEmbeddingImpl(RotaryEmbeddingImpl, nn.Module):
    method __init__ (line 44) | def __init__(self, dim: int, base: int = 10000, scaling_factor: float ...
    method forward (line 54) | def forward(self, x, position_ids):
  class DlinferLlamaDynamicNTKScalingRotaryEmbedding (line 63) | class DlinferLlamaDynamicNTKScalingRotaryEmbedding(LlamaDynamicNTKScalin...
    method __init__ (line 69) | def __init__(self, dim: int, base: int = 10000, scaling_factor: float ...
    method _ntk_inv_freq (line 77) | def _ntk_inv_freq(self, seq_len: torch.Tensor):
    method forward (line 83) | def forward(self, x: torch.Tensor, position_ids: torch.Tensor):
  class DlinferLlama3RotaryEmbeddingImpl (line 96) | class DlinferLlama3RotaryEmbeddingImpl(DlinferRotaryEmbeddingImpl):
    method __init__ (line 99) | def __init__(
  class DlinferYarnRotaryEmbeddingImpl (line 129) | class DlinferYarnRotaryEmbeddingImpl(YarnRotaryEmbeddingImpl):
    method __init__ (line 132) | def __init__(self,
    method forward (line 140) | def forward(self, x: torch.Tensor, position_ids: torch.Tensor):
  class DlinferRotaryEmbeddingBuilder (line 148) | class DlinferRotaryEmbeddingBuilder(RotaryEmbeddingBuilder):
    method build (line 152) | def build(

FILE: lmdeploy/pytorch/backends/embedding.py
  class EmbeddingImpl (line 8) | class EmbeddingImpl(ABC):
    method forward (line 12) | def forward(self, x, weight: torch.Tensor, all_reduce: bool = False, g...
  class EmbeddingBuilder (line 17) | class EmbeddingBuilder(ABC):
    method build (line 22) | def build(start_index: int, end_index: int):

FILE: lmdeploy/pytorch/backends/flash_attention.py
  class FlashAttentionImpl (line 7) | class FlashAttentionImpl(ABC):
    method forward (line 10) | def forward(self,
  class FlashAttentionBuilder (line 23) | class FlashAttentionBuilder(ABC):
    method build (line 28) | def build(

FILE: lmdeploy/pytorch/backends/gated_delta_rule.py
  class GatedDeltaRuleImpl (line 7) | class GatedDeltaRuleImpl(ABC):
    method chunk_gated_delta_rule (line 11) | def chunk_gated_delta_rule(self,
    method fused_recurrent_gated_delta_rule (line 27) | def fused_recurrent_gated_delta_rule(self,
  class GatedDeltaRuleBuilder (line 42) | class GatedDeltaRuleBuilder(ABC):
    method build (line 47) | def build() -> GatedDeltaRuleImpl:

FILE: lmdeploy/pytorch/backends/graph_runner.py
  class GraphRunnerMeta (line 13) | class GraphRunnerMeta:
  function _get_capture_batch_size_impl (line 18) | def _get_capture_batch_size_impl(max_batches: int):
  class GraphRunner (line 29) | class GraphRunner:
    method __init__ (line 32) | def __init__(self, model: torch.nn.Module, model_config: ModelConfig, ...
    method __call__ (line 42) | def __call__(self, **kwargs):
    method get_model (line 46) | def get_model(self):
    method get_logits (line 50) | def get_logits(self, hidden_states: torch.Tensor):
    method prepare_inputs_for_generation (line 56) | def prepare_inputs_for_generation(
    method update_model_metas (line 69) | def update_model_metas(
    method get_input_processor (line 85) | def get_input_processor(self):
    method reset (line 92) | def reset(self):
    method get_meta (line 96) | def get_meta(self):
    method update_inputs (line 100) | def update_inputs(self, inputs):
    method get_capture_batch_sizes (line 103) | def get_capture_batch_sizes(self) -> List[int]:

FILE: lmdeploy/pytorch/backends/linear.py
  class LinearImpl (line 9) | class LinearImpl(ABC):
    method update_weights (line 12) | def update_weights(self, weight: torch.Tensor, bias: Optional[torch.Te...
    method forward (line 17) | def forward(self,
  class LinearBuilder (line 29) | class LinearBuilder(ABC):
    method build (line 34) | def build(in_features: int, out_features: int, bias: bool = True, dtyp...

FILE: lmdeploy/pytorch/backends/lora.py
  class AdapterInfo (line 11) | class AdapterInfo:
    method __post_init__ (line 21) | def __post_init__(self):
  class LoRAImpl (line 30) | class LoRAImpl(ABC):
    method forward (line 34) | def forward(self,
  class LoRABuilder (line 47) | class LoRABuilder(ABC):
    method build (line 52) | def build():

FILE: lmdeploy/pytorch/backends/moe.py
  class SoftmaxTopKImpl (line 10) | class SoftmaxTopKImpl(ABC):
    method get_group_offsets (line 15) | def get_group_offsets(n_groups: int, group_size: int, device: str):
    method forward (line 20) | def forward(self, x: torch.Tensor):
  class SoftmaxTopKBuilder (line 25) | class SoftmaxTopKBuilder(ABC):
    method build (line 30) | def build(top_k: int, dim: int = -1, n_groups: int = -1):
  class FusedMoEImpl (line 35) | class FusedMoEImpl(ABC):
    method update_weights (line 38) | def update_weights(self, gate_up_weights: torch.Tensor, down_weights: ...
    method ep_expert_list (line 42) | def ep_expert_list(self, world_size: int, rank: int):
    method forward (line 47) | def forward(self,
  class FusedMoEBuilder (line 61) | class FusedMoEBuilder(ABC):
    method build (line 66) | def build(top_k: int,
  class FusedMoEW8A8Impl (line 78) | class FusedMoEW8A8Impl(ABC):
    method update_weights (line 81) | def update_weights(self, gate_up_weights: torch.Tensor, down_weights: ...
    method ep_expert_list (line 86) | def ep_expert_list(self, world_size: int, rank: int):
    method forward (line 91) | def forward(self,
  class FusedMoEW8A8Builder (line 105) | class FusedMoEW8A8Builder(ABC):
    method build (line 110) | def build(top_k: int,
  class FusedMoEBlockedF8Impl (line 119) | class FusedMoEBlockedF8Impl(ABC):
    method __init__ (line 122) | def __init__(self):
    method update_weights (line 125) | def update_weights(self, gate_up_weights: torch.Tensor, down_weights: ...
    method ep_expert_list (line 130) | def ep_expert_list(self, world_size: int, rank: int):
    method set_scale_fmt (line 134) | def set_scale_fmt(self, scale_fmt: Optional[str]):
    method forward (line 139) | def forward(self,
  class FusedMoEBlockedF8Builder (line 156) | class FusedMoEBlockedF8Builder(ABC):
    method build (line 161) | def build(top_k: int,

FILE: lmdeploy/pytorch/backends/moe_router.py
  class RouterNoauxTCImpl (line 8) | class RouterNoauxTCImpl(ABC):
    method forward (line 12) | def forward(self, logits: torch.Tensor, bias: torch.Tensor) -> Tuple[t...
  class RouterNoauxTCBuilder (line 17) | class RouterNoauxTCBuilder(ABC):
    method build (line 22) | def build(

FILE: lmdeploy/pytorch/backends/multinomial_sampling.py
  class MultinomialSamplingImpl (line 7) | class MultinomialSamplingImpl(ABC):
    method forward (line 11) | def forward(scores: torch.Tensor, seeds: torch.LongTensor, offsets: to...
  class MultinomialSamplingBuilder (line 16) | class MultinomialSamplingBuilder(ABC):
    method build (line 21) | def build():

FILE: lmdeploy/pytorch/backends/norm.py
  class RMSNormImpl (line 7) | class RMSNormImpl(ABC):
    method forward (line 11) | def forward(self, x: torch.Tensor, weight: torch.Tensor, residual: tor...
  class RMSNormBuilder (line 16) | class RMSNormBuilder(ABC):
    method build (line 21) | def build(hidden_size: int, eps: float = 1e-6):
  class LayerNormImpl (line 26) | class LayerNormImpl(ABC):
    method forward (line 30) | def forward(self, x: torch.Tensor, weight: torch.Tensor, bias: torch.T...
  class LayerNormBuilder (line 35) | class LayerNormBuilder(ABC):
    method build (line 40) | def build(normalized_shape: int, eps: float = 1e-6):

FILE: lmdeploy/pytorch/backends/nsa.py
  class NSAIndexMeta (line 9) | class NSAIndexMeta:
  class BaseNSAIndexFP8 (line 19) | class BaseNSAIndexFP8(ABC):
    method forward (line 22) | def forward(self, q: Tensor, k: Tensor, weights: Tensor, k_cache: Tens...
  class BaseNSAIndexFP8Builder (line 28) | class BaseNSAIndexFP8Builder:
    method build (line 32) | def build(topk: int, softmax_scale: float, block_size: int = 128, fill...

FILE: lmdeploy/pytorch/backends/qmodules.py
  class RMSNormW8A8Impl (line 8) | class RMSNormW8A8Impl(ABC):
    method create_weight (line 12) | def create_weight(hidden_size: int, dtype: torch.dtype = None, device:...
    method forward (line 22) | def forward(self, x: torch.Tensor, weight: torch.Tensor, residual: tor...
  class RMSNormW8A8Builder (line 27) | class RMSNormW8A8Builder(ABC):
    method build (line 32) | def build(hidden_size: int, eps: float = 1e-6, quant_dtype: torch.dtyp...
  class LinearW8A8Impl (line 37) | class LinearW8A8Impl(ABC):
    method update_weights (line 40) | def update_weights(self, weight: torch.Tensor, scale: torch.Tensor, bi...
    method forward (line 45) | def forward(self,
  class LinearW8A8Builder (line 56) | class LinearW8A8Builder(ABC):
    method build (line 61) | def build(in_features: int,

FILE: lmdeploy/pytorch/backends/rotary_embedding.py
  class RopeType (line 10) | class RopeType(Enum):
  class YarnParameters (line 22) | class YarnParameters:
  class LongRoPEScalingParameters (line 33) | class LongRoPEScalingParameters:
  class Llama3Parameters (line 43) | class Llama3Parameters:
  class FopeParameters (line 51) | class FopeParameters:
  class RotaryEmbeddingImpl (line 59) | class RotaryEmbeddingImpl(ABC):
    method forward (line 63) | def forward(self, x, position_ids, **kwargs):
  class RotaryEmbeddingBuilder (line 68) | class RotaryEmbeddingBuilder(ABC):
    method build (line 73) | def build(

FILE: lmdeploy/pytorch/backends/selector.py
  function _get_backend (line 5) | def _get_backend():
  function get_backend (line 28) | def get_backend(backend_type: str = None):
  function init_backend (line 39) | def init_backend(backend_type: str):

FILE: lmdeploy/pytorch/backends/token_dispatcher.py
  class TokenDispatcherImpl (line 8) | class TokenDispatcherImpl(ABC):
    method permute (line 11) | def permute(
    method unpermute (line 25) | def unpermute(
    method indices_to_multihot (line 43) | def indices_to_multihot(self, topk_ids, topk_weight, num_experts):
    method dispatch (line 65) | def dispatch(self, hidden_states: torch.Tensor, probs: torch.Tensor, t...
    method combine (line 71) | def combine(self, hidden_states: torch.Tensor) -> torch.Tensor:

FILE: lmdeploy/pytorch/block.py
  function _div_up (line 5) | def _div_up(x, n):
  function _round_up (line 10) | def _round_up(x, n):
  class LogicalTokenBlocks (line 15) | class LogicalTokenBlocks:
    method __init__ (line 19) | def __init__(self, blocks: np.ndarray = None):
    method reserve (line 29) | def reserve(self, size: int):
    method __setitem__ (line 37) | def __setitem__(self, *args, **kwargs):
    method __getitem__ (line 41) | def __getitem__(self, *args, **kwargs):
    method get_real_blocks (line 45) | def get_real_blocks(self):
    method append (line 49) | def append(self, blocks: np.ndarray):
    method __len__ (line 58) | def __len__(self):
    method resize (line 62) | def resize(self, num_blocks: int):
    method reset (line 67) | def reset(self):
    method clone (line 72) | def clone(self):

FILE: lmdeploy/pytorch/check_env/adapter.py
  class AdapterChecker (line 5) | class AdapterChecker(BaseChecker):
    method __init__ (line 8) | def __init__(self, adapter_path: str, logger=None):
    method check (line 12) | def check(self):

FILE: lmdeploy/pytorch/check_env/base.py
  function _red_text (line 11) | def _red_text(text: str):
  class BaseChecker (line 18) | class BaseChecker:
    method __init__ (line 21) | def __init__(self, logger: Logger = None):
    method get_logger (line 28) | def get_logger(self):
    method register_required_checker (line 32) | def register_required_checker(self, checker: 'BaseChecker'):
    method handle (line 36) | def handle(self):
    method log_and_exit (line 47) | def log_and_exit(self, e: Exception = None, mod_name: str = None, mess...
    method check (line 59) | def check(self):

FILE: lmdeploy/pytorch/check_env/cuda.py
  class CudaChecker (line 5) | class CudaChecker(BaseChecker):
    method __init__ (line 8) | def __init__(self, model_format: str = None, logger=None) -> None:
    method check (line 12) | def check(self):

FILE: lmdeploy/pytorch/check_env/deeplink.py
  class DeeplinkChecker (line 7) | class DeeplinkChecker(BaseChecker):
    method __init__ (line 10) | def __init__(self, device_type: str, logger=None) -> None:
    method check (line 14) | def check(self):

FILE: lmdeploy/pytorch/check_env/dist.py
  class DistChecker (line 9) | class DistChecker(BaseChecker):
    method __init__ (line 12) | def __init__(self, tp: int, dp: int, ep: int, distributed_executor_bac...
    method check (line 22) | def check(self):

FILE: lmdeploy/pytorch/check_env/model.py
  class ModelChecker (line 7) | class ModelChecker(BaseChecker):
    method __init__ (line 10) | def __init__(self, model_path: str, trust_remote_code: bool, dtype: st...
    method check_config (line 17) | def check_config(self, trans_version):
    method check_trans_version (line 31) | def check_trans_version(self, config, trans_version):
    method check_dtype (line 44) | def check_dtype(self, config):
    method check (line 72) | def check(self):

FILE: lmdeploy/pytorch/check_env/torch.py
  class TorchChecker (line 5) | class TorchChecker(BaseChecker):
    method __init__ (line 8) | def __init__(self, device: str = 'cuda', logger=None) -> None:
    method check (line 12) | def check(self):

FILE: lmdeploy/pytorch/check_env/transformers.py
  class TransformersChecker (line 10) | class TransformersChecker(BaseChecker):
    method check (line 13) | def check(self):

FILE: lmdeploy/pytorch/check_env/triton.py
  class TritonChecker (line 10) | class TritonChecker(BaseChecker):
    method check_version (line 13) | def check_version(self):
    method check (line 31) | def check(self):

FILE: lmdeploy/pytorch/check_env/triton_custom_add.py
  function _add_kernel (line 8) | def _add_kernel(A, B, C, size, BLOCK: tl.constexpr):
  function custom_add (line 17) | def custom_add(a, b):

FILE: lmdeploy/pytorch/config.py
  function _update_torch_dtype (line 16) | def _update_torch_dtype(config: 'ModelConfig', dtype: str, device_type: ...
  class BackendConfig (line 64) | class BackendConfig:
  class SchedulerConfig (line 71) | class SchedulerConfig:
  class CacheConfig (line 83) | class CacheConfig:
    method __post_init__ (line 106) | def __post_init__(self):
  class TPMode (line 113) | class TPMode(enum.Enum):
  class DistConfig (line 120) | class DistConfig:
    method __post_init__ (line 138) | def __post_init__(self):
    method get_tp_by_layer (line 183) | def get_tp_by_layer(self, layer_type: str):
    method from_engine_config (line 198) | def from_engine_config(cls, engine_config: PytorchEngineConfig):
  function _override_hf_config_dict (line 214) | def _override_hf_config_dict(hf_config: dict, key: str, hf_overrides):
  function _overide_hf_config_cfg (line 234) | def _overide_hf_config_cfg(hf_config: list, key: str, hf_overrides):
  function _override_hf_config (line 252) | def _override_hf_config(hf_config: Any, key: str, hf_overrides):
  function override_hf_config (line 260) | def override_hf_config(hf_config: Any, hf_overrides: Dict[str, Any]):
  function _default_check_env (line 266) | def _default_check_env(device: str):
  function _patch_quantization_config (line 270) | def _patch_quantization_config(hf_config: Any, model_format: str = None):
  class ModelConfig (line 300) | class ModelConfig:
    method get_head_size (line 347) | def get_head_size(self):
    method from_pretrained (line 352) | def from_pretrained(
    method from_hf_config (line 413) | def from_hf_config(
  class UnmaskingStrategy (line 459) | class UnmaskingStrategy(enum.Enum):
    method from_str (line 470) | def from_str(cls, strategy: str):
  class DLLMConfig (line 484) | class DLLMConfig:
  class MiscConfig (line 492) | class MiscConfig:
    method from_engine_config (line 505) | def from_engine_config(cls, engine_config: PytorchEngineConfig):
  class SpecDecodeConfig (line 528) | class SpecDecodeConfig:
    method from_config (line 536) | def from_config(
  class QuantizationConfig (line 574) | class QuantizationConfig:
    method from_config (line 586) | def from_config(cls, hf_config: Any):
    method get_quant_method (line 644) | def get_quant_method(self, prefix: str = ''):
    method get (line 653) | def get(self, key, default=None):

FILE: lmdeploy/pytorch/configurations/builder.py
  class AutoModelConfigBuilder (line 9) | class AutoModelConfigBuilder(ABC):
    method __init_subclass__ (line 13) | def __init_subclass__(cls) -> None:
    method register_builder (line 18) | def register_builder(cls, sub_cls):
    method condition (line 24) | def condition(cls, hf_config):
    method build (line 29) | def build(cls, hf_config, model_path: str = None, **kwargs):
    method update_num_kv_heads (line 56) | def update_num_kv_heads(cls, hf_config, tp, num_key_value_heads):

FILE: lmdeploy/pytorch/configurations/chatglm.py
  class ChatGLMModelConfigBuilder (line 7) | class ChatGLMModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 10) | def condition(cls, hf_config):
    method build (line 15) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/cogvlm.py
  class CogVLMModelConfigBuilder (line 6) | class CogVLMModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 15) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/deepseek_v2.py
  class DeepseekV2ModelConfigBuilder (line 8) | class DeepseekV2ModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 11) | def condition(cls, hf_config):
    method build (line 16) | def build(cls, hf_config, model_path: str = None, is_draft_model: bool...

FILE: lmdeploy/pytorch/configurations/deepseek_v32.py
  function _check_env_v32 (line 7) | def _check_env_v32(device: str = 'cuda'):
  class DeepseekV32ModelConfigBuilder (line 27) | class DeepseekV32ModelConfigBuilder(DeepseekV2ModelConfigBuilder):
    method condition (line 30) | def condition(cls, hf_config):
    method build (line 35) | def build(cls, hf_config, model_path: str | None = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/deepseek_vl2.py
  class DeepseekVLV2ModelConfigBuilder (line 6) | class DeepseekVLV2ModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 14) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/default.py
  class DefaultModelConfigBuilder (line 7) | class DefaultModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 10) | def condition(cls, hf_config):
    method build (line 15) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/gemma.py
  class GemmaModelConfigBuilder (line 6) | class GemmaModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 14) | def build(cls, hf_config, model_path: str = None, **kwargs):
  class GemmaVLModelConfigBuilder (line 21) | class GemmaVLModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 24) | def condition(cls, hf_config):
    method build (line 30) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/glm4.py
  class Glm4MoeLiteModelConfigBuilder (line 6) | class Glm4MoeLiteModelConfigBuilder(DeepseekV2ModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 14) | def build(cls, hf_config, model_path: str = None, is_draft_model: bool...
  class Glm4MoeModelConfigBuilder (line 28) | class Glm4MoeModelConfigBuilder(DefaultModelConfigBuilder):
    method condition (line 31) | def condition(cls, hf_config):
    method build (line 36) | def build(cls, hf_config, model_path: str = None, is_draft_model: bool...

FILE: lmdeploy/pytorch/configurations/gpt_oss.py
  class GptOSSModelConfigBuilder (line 6) | class GptOSSModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 14) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/interns1_pro.py
  class InterS1ProModelConfigBuilder (line 6) | class InterS1ProModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 14) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/internvl.py
  class InternVLModelConfigBuilder (line 6) | class InternVLModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 14) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/internvl3_hf.py
  class InternVL3ModelConfigBuilder (line 6) | class InternVL3ModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 14) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/llama.py
  class LlamaModelConfigBuilder (line 6) | class LlamaModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 14) | def build(cls, hf_config, model_path: str = None, is_draft_model: bool...

FILE: lmdeploy/pytorch/configurations/llama4.py
  class Llama4ModelConfigBuilder (line 6) | class Llama4ModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 14) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/llava_hf.py
  class LlavaHfModelConfigBuilder (line 7) | class LlavaHfModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 10) | def condition(cls, hf_config):
    method build (line 15) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/minicpm3.py
  class MiniCPM3ModelConfigBuilder (line 7) | class MiniCPM3ModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 10) | def condition(cls, hf_config):
    method build (line 15) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/qwen.py
  class QwenModelConfigBuilder (line 6) | class QwenModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 14) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/qwen3_5.py
  class Qwen3_5ModelConfigBuilder (line 11) | class Qwen3_5ModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 14) | def condition(cls, hf_config):
    method build (line 19) | def build(cls, hf_config, model_path: str = None, tp: int = 1, **kwargs):

FILE: lmdeploy/pytorch/configurations/qwen3_next.py
  function _check_env_qwen3_next (line 8) | def _check_env_qwen3_next(device: str):
  class Qwen3NextModelConfigBuilder (line 19) | class Qwen3NextModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 22) | def condition(cls, hf_config):
    method build (line 27) | def build(cls, hf_config, model_path: str = None, tp: int = 1, **kwargs):

FILE: lmdeploy/pytorch/configurations/qwen3_vl.py
  class Qwen3VLModelConfigBuilder (line 6) | class Qwen3VLModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 9) | def condition(cls, hf_config):
    method build (line 14) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/sdar.py
  class SDARModelConfigBuilder (line 5) | class SDARModelConfigBuilder(AutoModelConfigBuilder):
    method condition (line 8) | def condition(cls, hf_config):
    method build (line 13) | def build(cls, hf_config, model_path: str = None, **kwargs):

FILE: lmdeploy/pytorch/configurations/utils.py
  function flash_mla_available (line 9) | def flash_mla_available():
  function flash_attn_v3_available (line 26) | def flash_attn_v3_available():

FILE: lmdeploy/pytorch/devices/device_manager.py
  class DeviceContext (line 9) | class DeviceContext:
  class DeviceManager (line 17) | class DeviceManager(CtxMgrBase[DeviceContext]):
    method __init__ (line 19) | def __init__(self):
    method register_context_callback (line 24) | def register_context_callback(self, callback: Callable):
    method unregister_context_callback (line 31) | def unregister_context_callback(self, handle: int):
  function get_device_manager (line 36) | def get_device_manager():

FILE: lmdeploy/pytorch/disagg/backend/base.py
  class MigrationBackendImpl (line 9) | class MigrationBackendImpl:
    method p2p_initialize (line 12) | def p2p_initialize(self, init_request: DistServeInitRequest):
    method register_memory_region (line 16) | def register_memory_region(self, register_mr_request: DistServeRegiste...
    method endpoint_info (line 20) | def endpoint_info(self, remote_engine_id: str, protocol: MigrationProt...
    method p2p_connect (line 24) | def p2p_connect(self, remote_engine_id: str, conn_req: DistServeKVTran...
    method p2p_migrate (line 28) | def p2p_migrate(self, assignment: MigrationAssignment, async_op: bool ...
    method store (line 32) | def store(self, assignment: MigrationAssignment, async_op: bool = False):
    method load (line 36) | def load(self, assignment: MigrationAssignment, async_op: bool = False):

FILE: lmdeploy/pytorch/disagg/backend/dlslime.py
  class DLSlimeMigrationManagement (line 22) | class DLSlimeMigrationManagement:
    method __init__ (line 24) | def __init__(self, init_request: DistServeInitRequest):
    method register_memory_region (line 46) | def register_memory_region(self, register_mr_request: DistServeRegiste...
    method connect (line 54) | def connect(self, kvtransfer_endpoint_info: DistServeKVTransferEndpoin...
    method p2p_migrate (line 57) | async def p2p_migrate(self, assignment: MigrationAssignment):
  class DLSlimeBackend (line 75) | class DLSlimeBackend(MigrationBackendImpl):
    method __init__ (line 78) | def __init__(self):
    method p2p_initialize (line 81) | def p2p_initialize(self, init_request: DistServeInitRequest):
    method register_memory_region (line 84) | def register_memory_region(self, register_mr_request: DistServeRegiste...
    method endpoint_info (line 87) | def endpoint_info(self, remote_engine_id: str, protocol: MigrationProt...
    method p2p_connect (line 90) | def p2p_connect(self, remote_engine_id: str, conn_req: DistServeKVTran...
    method p2p_migrate (line 93) | async def p2p_migrate(self, assignment: MigrationAssignment, async_op:...
    method store (line 96) | def store(self, assignment: MigrationAssignment, async_op: bool = False):
    method load (line 99) | def load(self, assignment: MigrationAssignment, async_op: bool = False):

FILE: lmdeploy/pytorch/disagg/backend/mooncake.py
  function get_rdma_nics (line 22) | def get_rdma_nics():
  function get_local_ip_by_remote (line 48) | def get_local_ip_by_remote() -> str:
  class MooncakeMigrationManagement (line 68) | class MooncakeMigrationManagement:
    method __init__ (line 71) | def __init__(self, init_request: DistServeInitRequest):
    method _initialize_p2p (line 100) | def _initialize_p2p(self, init_request: DistServeInitRequest):
    method register_memory_region (line 123) | def register_memory_region(self, register_mr_request: DistServeRegiste...
    method endpoint_info (line 145) | def endpoint_info(self) -> Dict:
    method connect (line 164) | def connect(self, connect_request: DistServeKVTransferEndpointInfo):
    method p2p_migrate (line 178) | async def p2p_migrate(self, assignment: MigrationAssignment, async_op:...
    method _migrate (line 195) | def _migrate(self, assignment: MigrationAssignment):
  class MooncakeBackend (line 236) | class MooncakeBackend(MigrationBackendImpl):
    method __init__ (line 239) | def __init__(self):
    method p2p_initialize (line 242) | def p2p_initialize(self, init_request: DistServeInitRequest):
    method register_memory_region (line 245) | def register_memory_region(self, register_mr_request: DistServeRegiste...
    method endpoint_info (line 248) | def endpoint_info(self, remote_engine_id: int, protocol: MigrationProt...
    method p2p_connect (line 251) | def p2p_connect(self, remote_engine_id: str, connect_request: DistServ...
    method p2p_migrate (line 254) | async def p2p_migrate(self, assignment: MigrationAssignment, async_op:...
    method store (line 257) | def store(self, assignment: MigrationAssignment, async_op: bool = False):
    method load (line 260) | def load(self, assignment: MigrationAssignment, async_op: bool = False):

FILE: lmdeploy/pytorch/disagg/config.py
  class ServingStrategy (line 8) | class ServingStrategy(enum.Enum):
  class EngineRole (line 22) | class EngineRole(enum.Enum):
  class MigrationBackend (line 40) | class MigrationBackend(enum.Enum):
  class RDMALinkType (line 47) | class RDMALinkType(enum.Enum):
  class DistServeRDMAConfig (line 54) | class DistServeRDMAConfig(BaseModel):
  class DistServeTCPConfig (line 72) | class DistServeTCPConfig(BaseModel):
  class DistServeNVLinkConfig (line 76) | class DistServeNVLinkConfig(BaseModel):
  class DistServeEngineConfig (line 80) | class DistServeEngineConfig(BaseModel):
  class MooncakeEngineConfig (line 112) | class MooncakeEngineConfig(DistServeEngineConfig):

FILE: lmdeploy/pytorch/disagg/conn/engine_conn.py
  class EngineP2PConnection (line 24) | class EngineP2PConnection:
    method __init__ (line 26) | def __init__(self, engine: 'Engine'):
    method p2p_initialize (line 34) | def p2p_initialize(self, init_request: DistServeInitRequest):
    method p2p_connect (line 54) | def p2p_connect(self, conn_request: DistServeConnectionRequest):
    method p2p_drop_connect (line 62) | def p2p_drop_connect(self, drop_conn_request: DistServeDropConnectionR...
    method zmq_send (line 67) | async def zmq_send(self, remote_engine_id: str, remote_session_id: int):
    method handle_zmq_recv (line 71) | async def handle_zmq_recv(self, remote_engine_id: str):
    method zmq_disconnect (line 83) | async def zmq_disconnect(self, remote_engine_id: str):

FILE: lmdeploy/pytorch/disagg/conn/protocol.py
  class MigrationProtocol (line 11) | class MigrationProtocol(enum.Enum):
  class DistServeConnectionStatus (line 27) | class DistServeConnectionStatus(enum.Enum):
  class DistServeInitRequest (line 33) | class DistServeInitRequest(BaseModel):
  class DistServeEngineEndpointInfo (line 49) | class DistServeEngineEndpointInfo(BaseModel):
  class DistServeKVTransferEndpointInfo (line 53) | class DistServeKVTransferEndpointInfo(BaseModel):
  class DistServeInitResponse (line 58) | class DistServeInitResponse(BaseModel):
  class DistServeConnectionRequest (line 69) | class DistServeConnectionRequest(BaseModel):
  class DistServeConnectionResponse (line 76) | class DistServeConnectionResponse(BaseModel):
  class MigrationRequest (line 80) | class MigrationRequest(BaseModel):
  class DistServeCacheFreeRequest (line 91) | class DistServeCacheFreeRequest(BaseModel):
  class DistServeDropConnectionRequest (line 96) | class DistServeDropConnectionRequest(BaseModel):

FILE: lmdeploy/pytorch/disagg/conn/proxy_conn.py
  class PDConnectionStatus (line 23) | class PDConnectionStatus(enum.Enum):
  class PDConnectionState (line 29) | class PDConnectionState:
    method __init__ (line 32) | def __init__(self, status: PDConnectionStatus, event: asyncio.Event):
    method wait (line 36) | async def wait(self):
    method set_status (line 39) | def set_status(self, status: PDConnectionStatus):
  function get_server_api (line 43) | def get_server_api(url: str, api: str):
  class PDConnectionPool (line 47) | class PDConnectionPool:
    method __init__ (line 65) | def __init__(self):
    method reg_instance (line 94) | def reg_instance(self, role: EngineRole, endpoint: str):
    method dereg_instance (line 102) | def dereg_instance(self, endpoint: str):
    method shelf_prefill_session (line 115) | def shelf_prefill_session(self, conn_key: Tuple[str, str], session_id:...
    method unshelf_prefill_session (line 118) | def unshelf_prefill_session(self, conn_key: Tuple[str, str], session_i...
    method connect (line 121) | async def connect(self, conn_req: PDConnectionMessage):
    method is_connected (line 261) | def is_connected(self, p_url: str, d_url: str):
    method drop (line 267) | def drop(self, pd_key: Tuple[str, str]):

FILE: lmdeploy/pytorch/disagg/messages.py
  class MigrationExecutionBatch (line 10) | class MigrationExecutionBatch(BaseModel):
  class AssignmentInstruct (line 17) | class AssignmentInstruct(BaseModel):
  class MigrationAssignment (line 25) | class MigrationAssignment(BaseModel):
  class PDConnectionMessage (line 32) | class PDConnectionMessage(BaseModel):
  class DistServeRegisterMRMessage (line 41) | class DistServeRegisterMRMessage(BaseModel):

FILE: lmdeploy/pytorch/distributed.py
  class DistGroup (line 16) | class DistGroup:
    method close (line 25) | def close(self):
  function _build_tp_group_impl (line 39) | def _build_tp_group_impl(tp: int,
  function _build_attn_tp_group (line 89) | def _build_attn_tp_group(context: 'DistContext',
  function _build_mlp_tp_group (line 114) | def _build_mlp_tp_group(context: 'DistContext',
  function _build_moe_tp_group (line 144) | def _build_moe_tp_group(context: 'DistContext',
  function _build_tp_group (line 179) | def _build_tp_group(context: 'DistContext', timeout: timedelta, cpu_back...
  class DistContext (line 188) | class DistContext:
    method _build_ep_group (line 204) | def _build_ep_group(cls, context: 'DistContext', timeout: timedelta, c...
    method build (line 228) | def build(cls, rank: int = 0, dist_config: DistConfig = None, ccl_back...
    method close (line 261) | def close(self):
  class DistManager (line 281) | class DistManager(CtxMgrBase[DistContext]):
    method __init__ (line 284) | def __init__(self):
    method current_config (line 287) | def current_config(self) -> DistConfig:
  function get_dist_manager (line 292) | def get_dist_manager():
  function get_world_rank (line 297) | def get_world_rank():
  function get_tp_world_rank (line 306) | def get_tp_world_rank(layer_type: Optional[str] = None):
  function get_dp_world_rank (line 320) | def get_dp_world_rank():
  function get_ep_world_rank (line 325) | def get_ep_world_rank():
  function _check_group_device (line 330) | def _check_group_device(device: str):
  function get_process_group (line 336) | def get_process_group(device: str = None):
  function get_dist_group (line 341) | def get_dist_group(layer_type: str = 'attn'):
  function get_tp_group (line 355) | def get_tp_group(device: str = 'gpu', layer_type: str = 'attn'):
  function get_group (line 369) | def get_group(group_type: str, device: str):
  function all_reduce (line 379) | def all_reduce(tensor, op=ReduceOp.SUM, group='tp', async_op=False):
  function broadcast (line 386) | def broadcast(tensor, src, group='tp', async_op=False):
  function all_gather_object (line 393) | def all_gather_object(object_list, obj, group='tp'):
  function all_gather (line 399) | def all_gather(tensor_list, tensor, group='tp', async_op=False):
  function all_gather_into_tensor (line 405) | def all_gather_into_tensor(output_tensor, input_tensor, group='tp', asyn...
  function reduce_scatter (line 411) | def reduce_scatter(output, input_list, op=ReduceOp.SUM, group='tp', asyn...
  function gather_by_tp_sizes (line 418) | def gather_by_tp_sizes(x: torch.Tensor,
  function reduce_scatter_by_tp_sizes (line 433) | def reduce_scatter_by_tp_sizes(out: torch.Tensor, rank: int, tp_sizes: L...

FILE: lmdeploy/pytorch/engine/base.py
  class EngineBase (line 6) | class EngineBase:
    method close (line 8) | def close(self) -> None:
    method start_loop (line 12) | def start_loop(self) -> None:
    method end_session (line 15) | def end_session(self, session_id: int):
    method p2p_initialize (line 19) | def p2p_initialize(self, conn_request: DistServeInitRequest):
    method p2p_connect (line 23) | def p2p_connect(self, conn_request: DistServeConnectionRequest):
    method p2p_drop_connect (line 27) | def p2p_drop_connect(self, drop_conn_request: DistServeDropConnectionR...
    method create_instance (line 35) | def create_instance(self, cuda_stream_id=0):
  class EngineInstanceBase (line 40) | class EngineInstanceBase:
    method async_end (line 42) | async def async_end(self, session_id: int):
    method async_cancel (line 46) | async def async_cancel(self, session_id: int):
    method async_stream_infer (line 50) | async def async_stream_infer(self, *args, **kwargs):

FILE: lmdeploy/pytorch/engine/cache_engine.py
  function round_up (line 25) | def round_up(x: int, alignment: int) -> int:
  class CacheDesc (line 31) | class CacheDesc:
    method __post_init__ (line 37) | def __post_init__(self):
  function _get_kv_cache_dtype (line 43) | def _get_kv_cache_dtype(model_config: ModelConfig):
  class CacheEngine (line 54) | class CacheEngine:
    method __init__ (line 67) | def __init__(
    method cpu_cache (line 113) | def cpu_cache(self):
    method gpu_cache (line 118) | def gpu_cache(self):
    method num_gpu_blocks (line 123) | def num_gpu_blocks(self):
    method num_cpu_blocks (line 128) | def num_cpu_blocks(self):
    method _get_key_block_shape_impl (line 133) | def _get_key_block_shape_impl(cls,
    method _get_value_block_shape_impl (line 160) | def _get_value_block_shape_impl(cls,
    method get_k_cache_desc (line 189) | def get_k_cache_desc(cls, model_config: ModelConfig, cache_config: Cac...
    method get_v_cache_desc (line 208) | def get_v_cache_desc(cls, model_config: ModelConfig, cache_config: Cac...
    method get_quant_cache_descs (line 227) | def get_quant_cache_descs(cls, k_cache_desc: CacheDesc, v_cache_desc: ...
    method get_custom_cache_descs (line 241) | def get_custom_cache_descs(cls, model_config: ModelConfig, cache_confi...
    method allocate_caches (line 256) | def allocate_caches(cls, num_blocks: int, model_config: ModelConfig, c...
    method allocate_gpu_cache (line 286) | def allocate_gpu_cache(self):
    method allocate_cpu_cache (line 299) | def allocate_cpu_cache(self):
    method get_custom_cache_shape_impl (line 313) | def get_custom_cache_shape_impl(num_layers: int, num_blocks: int, bloc...
    method _allocate_single_custom_cache (line 318) | def _allocate_single_custom_cache(shape: Sequence[int], dtype: torch.d...
    method allocate_custom_cache (line 322) | def allocate_custom_cache(self, device: str):
    method _swap (line 338) | def _swap(self, src: List[torch.Tensor], dst: List[torch.Tensor], src_...
    method swap_in (line 360) | def swap_in(self, src_to_dst: Dict[int, int]) -> None:
    method swap_out (line 368) | def swap_out(self, src_to_dst: Dict[int, int]) -> None:
    method get_cache_block_size (line 377) | def get_cache_block_size(cls, cache_config: CacheConfig, model_config:...
    method p2p_initialize (line 399) | def p2p_initialize(self, migration_init_request: DistServeInitRequest)...
    method p2p_connect (line 420) | def p2p_connect(self, remote_engine_id: str, migration_conn_request: L...
    method migrate (line 423) | async def migrate(self, migration_execution_inputs: MigrationExecution...
  class StateCacheEngine (line 459) | class StateCacheEngine:
    method __init__ (line 462) | def __init__(self, cache_config: CacheConfig):
    method allocate_caches (line 469) | def allocate_caches(num_caches: int, state_shapes: List[Tuple[Tuple[in...
    method get_cache_state_size (line 495) | def get_cache_state_size(state_shapes: List[Tuple[Tuple[int], torch.dt...
    method state_caches (line 508) | def state_caches(self):
    method init_caches (line 512) | def init_caches(self, idx: torch.Tensor, mask: torch.Tensor):

FILE: lmdeploy/pytorch/engine/config_builder.py
  class ConfigBuilder (line 11) | class ConfigBuilder:
    method update_engine_config (line 14) | def update_engine_config(engine_config: PytorchEngineConfig):
    method build_scheduler_config (line 46) | def build_scheduler_config(engine_config: PytorchEngineConfig):
    method build_cache_config (line 54) | def build_cache_config(engine_config: PytorchEngineConfig):
    method build_backend_config (line 73) | def build_backend_config(engine_config: PytorchEngineConfig):
    method build_dist_config (line 82) | def build_dist_config(engine_config: PytorchEngineConfig):
    method build_misc_config (line 88) | def build_misc_config(engine_config: PytorchEngineConfig):
    method build_specdecode_config (line 94) | def build_specdecode_config(target_model, speculative_config: Speculat...

FILE: lmdeploy/pytorch/engine/engine.py
  class InferOutput (line 35) | class InferOutput:
  function _build_seq_meta (line 57) | def _build_seq_meta(cache_config: CacheConfig, seq_strategy: Any, sampli...
  function response_reqs (line 64) | def response_reqs(req_manager: RequestManager,
  class Engine (line 78) | class Engine(EngineBase):
    method __init__ (line 87) | def __init__(
    method from_pretrained (line 191) | def from_pretrained(cls,
    method _download_adapters (line 232) | def _download_adapters(self, adapters: Dict[str, str], engine_config: ...
    method _build_adapter_manager (line 246) | def _build_adapter_manager(self, adapters):
    method _bind_request_manager (line 249) | def _bind_request_manager(self):
    method _response (line 258) | def _response(self, resp: Response, resp_type: ResponseType, data: Any...
    method _get_max_session_len (line 262) | def _get_max_session_len(self):
    method _on_add_session (line 277) | def _on_add_session(self, reqs: List[Request], **kwargs):
    method _on_stop_session (line 289) | def _on_stop_session(self, reqs: List[Request], **kwargs):
    method _on_end_session (line 308) | def _on_end_session(self, reqs: List[Request], **kwargs):
    method _on_add_message (line 324) | def _on_add_message(self, reqs: List[Request], **kwargs):
    method _add_message (line 362) | def _add_message(self, reqs: List[Request]):
    method model_config (line 416) | def model_config(self) -> ModelConfig:
    method p2p_initialize (line 420) | def p2p_initialize(self, init_request: DistServeInitRequest):
    method p2p_connect (line 423) | def p2p_connect(self, conn_request: DistServeConnectionRequest):
    method p2p_drop_connect (line 426) | def p2p_drop_connect(self, drop_conn_request: DistServeDropConnectionR...
    method _loop_finally (line 429) | def _loop_finally(self):
    method update_params (line 435) | def update_params(self, request: Any):
    method sleep (line 439) | def sleep(self, level: int = 1):
    method wakeup (line 443) | def wakeup(self, tags: Optional[List[str]] = None):
    method async_loop (line 447) | async def async_loop(self):
    method close (line 475) | def close(self):
    method start (line 486) | def start(self):
    method stop (line 493) | def stop(self):
    method wait_tasks (line 498) | async def wait_tasks(self):
    method create_instance (line 511) | def create_instance(self, cuda_stream_id=0):
    method start_loop (line 522) | def start_loop(self):
    method end_session (line 526) | def end_session(self, session_id: int):
    method get_engine_config (line 533) | def get_engine_config(self):
    method get_schedule_metrics (line 536) | def get_schedule_metrics(self):

FILE: lmdeploy/pytorch/engine/engine_checker.py
  class EngineChecker (line 12) | class EngineChecker(BaseChecker):
    method __init__ (line 15) | def __init__(self,
    method check (line 77) | def check(self):
    method _handle_impl (line 100) | def _handle_impl(self):
    method handle (line 103) | def handle(self):

FILE: lmdeploy/pytorch/engine/engine_instance.py
  function _check_resp (line 17) | def _check_resp(resp: Response, state: ResponseType, warning_msg: str = ...
  function _check_resp_success (line 27) | def _check_resp_success(resp: Response, warning_msg: str = None):
  function async_try_add_session (line 32) | async def async_try_add_session(req_sender: RequestSender, session_id: i...
  function async_cancel (line 43) | async def async_cancel(req_sender: RequestSender, session_id: int):
  function try_add_session (line 50) | def try_add_session(req_sender: RequestSender, session_id: int):
  function end (line 61) | def end(req_sender: RequestSender, session_id: int):
  function cancel (line 67) | def cancel(req_sender: RequestSender, session_id: int):
  class EngineInstance (line 75) | class EngineInstance(EngineInstanceBase):
    method __init__ (line 82) | def __init__(self, engine: Engine):
    method __del__ (line 90) | def __del__(self):
    method _get_extra_outputs (line 94) | def _get_extra_outputs(self, resp: Response):
    method _async_try_add_session (line 110) | async def _async_try_add_session(self, session_id: int):
    method _try_add_session (line 118) | def _try_add_session(self, session_id: int):
    method async_stream_infer (line 126) | async def async_stream_infer(self,
    method async_infer (line 211) | async def async_infer(self,
    method stream_infer (line 240) | def stream_infer(self,
    method infer (line 277) | def infer(self,
    method async_end (line 298) | async def async_end(self, session_id: int):
    method end (line 302) | def end(self, session_id: int):
    method async_cancel (line 306) | async def async_cancel(self, session_id: int):
    method cancel (line 310) | def cancel(self, session_id: int):

FILE: lmdeploy/pytorch/engine/engine_loop.py
  class CounterEvent (line 37) | class CounterEvent(asyncio.Event):
    method __init__ (line 39) | def __init__(self):
    method set (line 43) | def set(self):
    method clear (line 49) | def clear(self):
  class RunableEventAsync (line 55) | class RunableEventAsync:
    method __init__ (line 58) | def __init__(self, scheduler: 'Scheduler'):
    method wait (line 62) | async def wait(self):
    method set (line 66) | def set(self):
  function build_runable_event (line 74) | def build_runable_event(scheduler: 'Scheduler'):
  class EngineLoopConfig (line 80) | class EngineLoopConfig:
    method from_engine (line 91) | def from_engine(engine: 'Engine'):
  class EngineLoop (line 106) | class EngineLoop:
    method __init__ (line 109) | def __init__(self,
    method preprocess_loop (line 137) | async def preprocess_loop(self):
    method _log_resps (line 144) | def _log_resps(outputs: List[InferOutput]):
    method _send_resp (line 151) | def _send_resp(self, out: InferOutput):
    method _update_logprobs (line 169) | def _update_logprobs(step_outputs: List[InferOutput]):
    method _send_resps (line 186) | def _send_resps(self, step_outputs: List[InferOutput]):
    method send_response_loop (line 198) | async def send_response_loop(self):
    method _make_infer_outputs (line 212) | def _make_infer_outputs(
    method _main_loop_try_send_next_inputs (line 301) | async def _main_loop_try_send_next_inputs(self):
    method _main_loop_get_outputs (line 310) | async def _main_loop_get_outputs(
    method main_loop (line 332) | async def main_loop(self):
    method update_running_migration (line 365) | def update_running_migration(self, running: 'SeqList', next_token_ids:...
    method _migration_loop_migrate (line 382) | async def _migration_loop_migrate(self, migration_ready: 'SeqList'):
    method _migration_loop_get_outputs (line 410) | async def _migration_loop_get_outputs(self, migration_ready: 'SeqList'):
    method _migration_loop_process_ready (line 431) | async def _migration_loop_process_ready(self, migration_ready: 'SeqLis...
    method migration_loop (line 440) | async def migration_loop(self):
    method start (line 453) | def start(self, event_loop: asyncio.AbstractEventLoop):
    method wait_tasks (line 473) | async def wait_tasks(self):
    method stop (line 494) | def stop(self):
    method cancel (line 503) | def cancel(self):
  function build_engine_loop (line 511) | def build_engine_loop(engine: 'Engine'):

FILE: lmdeploy/pytorch/engine/executor/__init__.py
  function get_distributed_executor_backend (line 12) | def get_distributed_executor_backend(world_size: int, dp: int, device_ty...
  function build_executor (line 56) | def build_executor(

FILE: lmdeploy/pytorch/engine/executor/base.py
  class ExecutorBase (line 16) | class ExecutorBase:
    method __init__ (line 19) | def __init__(self,
    method download_models (line 45) | def download_models(self):
    method build_model (line 49) | def build_model(self):
    method gather_free_mem (line 53) | def gather_free_mem(self):
    method set_cache_config (line 57) | def set_cache_config(self, cache_config: CacheConfig, spec_cache_confi...
    method set_model_config (line 61) | def set_model_config(self, model_config: ModelConfig, spec_model_confi...
    method build_graph_runner (line 65) | def build_graph_runner(self):
    method build_cache_engine (line 69) | def build_cache_engine(self):
    method warmup (line 73) | def warmup(self):
    method sleep (line 77) | async def sleep(self, level: int = 1):
    method wakeup (line 81) | def wakeup(self, tags: Optional[List[str]] = None):
    method update_params (line 85) | def update_params(self, request: Any):
    method get_input_processor (line 89) | def get_input_processor(self):
    method start (line 93) | def start(self, forward_event: asyncio.Event):
    method wait_tasks (line 97) | async def wait_tasks(self):
    method stop (line 101) | def stop(self):
    method release (line 105) | def release(self):
    method forward_async (line 109) | async def forward_async(self, inputs):
    method get_output_async (line 113) | async def get_output_async(self):
    method p2p_initialize (line 119) | def p2p_initialize(self, remote_engine_config: DistServeInitRequest):
    method p2p_connect (line 123) | def p2p_connect(self, conn_request: List[DistServeKVTransferEndpointIn...
    method migrate (line 127) | async def migrate(self, batch: MigrationExecutionBatch):
    method _get_runtime_size (line 133) | def _get_runtime_size(self, num_free_gpu_mem: int, cache_block_size: i...
    method _adjust_block_size (line 148) | def _adjust_block_size(self):
    method _get_state_cache_mem (line 161) | def _get_state_cache_mem(self):
    method update_configs (line 185) | def update_configs(self):
    method init (line 241) | def init(self):
    method remote_log (line 258) | def remote_log(self, msg: str):

FILE: lmdeploy/pytorch/engine/executor/base_worker.py
  class WorkerWrapperBase (line 20) | class WorkerWrapperBase:
    method __init__ (line 23) | def __init__(
    method init_process_group (line 57) | def init_process_group(self, rank: int, master_addr: str = None, maste...
    method pack_output (line 69) | def pack_output(self, output: Dict):
    method get_outputs (line 73) | async def get_outputs(self):
    method build_model (line 77) | def build_model(self):
    method get_free_mem (line 94) | def get_free_mem(self):
    method set_cache_config (line 98) | def set_cache_config(self, cache_config: CacheConfig, spec_cache_confi...
    method set_model_config (line 102) | def set_model_config(self, model_config: ModelConfig, spec_model_confi...
    method build_graph_runner (line 106) | def build_graph_runner(self):
    method build_cache_engine (line 110) | def build_cache_engine(self):
    method update_params (line 114) | def update_params(self, request: Any):
    method warmup (line 118) | def warmup(self):
    method sleep (line 122) | async def sleep(self, level: int = 1):
    method wakeup (line 126) | def wakeup(self, tags: Optional[List[str]] = None):
    method get_input_processor (line 130) | def get_input_processor(self):
    method start (line 134) | def start(self):
    method wait_tasks (line 139) | async def wait_tasks(self):
    method stop (line 152) | def stop(self):
    method stop_async (line 156) | async def stop_async(self):
    method forward_async (line 159) | async def forward_async(self, inputs):
    method get_output_async (line 163) | async def get_output_async(self):
    method release (line 169) | def release(self):
    method p2p_initialize (line 175) | def p2p_initialize(self, init_request: DistServeInitRequest):
    method p2p_connect (line 178) | def p2p_connect(self, remote_engine_id: str, conn_request: List[DistSe...
    method migrate (line 181) | async def migrate(self, inputs: MigrationExecutionBatch):

FILE: lmdeploy/pytorch/engine/executor/dist_utils.py
  function find_available_port (line 11) | def find_available_port() -> bool:
  function setup_master_addr (line 20) | def setup_master_addr(addr: str, port: str):
  function init_dist_environ (line 32) | def init_dist_environ(rank: int, world_size: int):
  function init_process_group (line 38) | def init_process_group(rank: int, world_size: int):

FILE: lmdeploy/pytorch/engine/executor/mp_executor.py
  function get_num_packages (line 37) | def get_num_packages(data_size):
  class Notifier (line 42) | class Notifier:
    method __init__ (line 44) | def __init__(self, num_receiver: int, mp_ctx: SpawnContext):
    method _update_event_id (line 49) | def _update_event_id(self):
    method set (line 52) | def set(self):
    method set_async (line 60) | async def set_async(self):
    method wait (line 71) | def wait(self):
    method wait_async (line 80) | async def wait_async(self):
    method close (line 89) | def close(self):
  class SharedBuffer (line 95) | class SharedBuffer:
    method __init__ (line 98) | def __init__(self, proc_id: int, notifier: Notifier, name: str = None):
    method acquire_buf (line 117) | def acquire_buf(self):
    method name (line 125) | def name(self):
    method pack_data (line 128) | def pack_data(self, data, receiver_mask):
    method send (line 144) | def send(self, data, receiver_mask: int = 0xff):
    method send_async (line 149) | async def send_async(self, data, receiver_mask: int = 0xff):
    method _receive_step0 (line 154) | def _receive_step0(self):
    method _receive_step1 (line 170) | def _receive_step1(self, dumped_data, is_receiver, remain_size):
    method receive (line 185) | def receive(self):
    method receive_async (line 191) | async def receive_async(self):
    method close (line 197) | def close(self):
  class MPExecutor (line 207) | class MPExecutor(ExecutorBase):
    method setup_master_addr (line 211) | def setup_master_addr(cls):
    method __init__ (line 220) | def __init__(self,
    method collective_rpc (line 286) | def collective_rpc(self,
    method collective_rpc_async (line 315) | async def collective_rpc_async(self,
    method download_models (line 343) | def download_models(self):
    method build_model (line 347) | def build_model(self):
    method gather_free_mem (line 351) | def gather_free_mem(self):
    method set_cache_config (line 356) | def set_cache_config(self, cache_config: CacheConfig, spec_cache_confi...
    method set_model_config (line 360) | def set_model_config(self, model_config: ModelConfig, spec_model_confi...
    method build_graph_runner (line 364) | def build_graph_runner(self):
    method build_cache_engine (line 368) | def build_cache_engine(self):
    method warmup (line 372) | def warmup(self):
    method _prefetch_outputs (line 376) | async def _prefetch_outputs(self):
    method start (line 381) | def start(self, forward_event: asyncio.Event):
    method wait_tasks (line 389) | async def wait_tasks(self):
    method forward_async (line 394) | async def forward_async(self, inputs):
    method get_
Condensed preview — 1274 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (8,431K chars).
[
  {
    "path": ".clang-format",
    "chars": 1881,
    "preview": "Language: Cpp\nAccessModifierOffset: -4\nAlignAfterOpenBracket: Align\nAllowShortEnumsOnASingleLine: false\nAlignConsecutive"
  },
  {
    "path": ".claude/skills/check-env/SKILL.md",
    "chars": 1213,
    "preview": "---\nname: check-env\ndescription: Check if the LMDeploy dev environment is properly set up.\n---\n\n# Check LMDeploy Dev Env"
  },
  {
    "path": ".claude/skills/code-navigation/SKILL.md",
    "chars": 2962,
    "preview": "---\nname: code-navigation\ndescription: LMDeploy codebase directory map for fast orientation.\n---\n\n# LMDeploy Project Str"
  },
  {
    "path": ".claude/skills/resolve-review/SKILL.md",
    "chars": 645,
    "preview": "---\nname: resolve-review\ndescription: Fetch and resolve PR review comments, then push fixes.\n---\n\n# Resolve PR Review Co"
  },
  {
    "path": ".claude/skills/submit-pr/SKILL.md",
    "chars": 1078,
    "preview": "---\nname: submit-pr\ndescription: Submit a GitHub pull request for LMDeploy.\n---\n\n# Submit a PR for LMDeploy\n\n## 1. Creat"
  },
  {
    "path": ".claude/skills/support-new-model/SKILL.md",
    "chars": 15198,
    "preview": "---\nname: support-new-model\ndescription: Add a new LLM or VLM to LMDeploy's PyTorch backend.\n---\n\n# Tutorial: Adding a N"
  },
  {
    "path": ".github/CONTRIBUTING.md",
    "chars": 11243,
    "preview": "## Contributing to LMDeploy\n\nWelcome to the LMDeploy community, all kinds of contributions are welcomed, including but n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/1-bug-report.yml",
    "chars": 2031,
    "preview": "name: 🐞 Bug report\ndescription: Create a report to help us reproduce and fix the bug\ntitle: \"[Bug] \"\nlabels: ['Bug']\n\nbo"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/2-feature-request.yml",
    "chars": 1171,
    "preview": "name: 🚀 Feature request\ndescription: Suggest an idea for this project\ntitle: \"[Feature] \"\n\nbody:\n- type: markdown\n  attr"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/3-documentation.yml",
    "chars": 547,
    "preview": "name: 📚 Documentation\ndescription: Report an issue related to the documentation.\nlabels: \"kind/doc,status/unconfirmed\"\nt"
  },
  {
    "path": ".github/pull_request_template.md",
    "chars": 1342,
    "preview": "Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more he"
  },
  {
    "path": ".github/release.yml",
    "chars": 622,
    "preview": "changelog:\n  categories:\n    - title: 🚀 Features\n      labels:\n        - feature\n        - enhancement\n    - title: 💥 Im"
  },
  {
    "path": ".github/scripts/action_tools.py",
    "chars": 11818,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport glob\nimport json\nimport logging\nimport os\nimport shutil\nimport su"
  },
  {
    "path": ".github/scripts/check_lmdeploy.py",
    "chars": 1014,
    "preview": "# Copyright (c) MegFlow. All rights reserved.\nimport glob\nimport os\n\nimport fire\n\n\ndef check_module_init(root: str):\n   "
  },
  {
    "path": ".github/scripts/doc_link_checker.py",
    "chars": 2467,
    "preview": "# Copyright (c) MegFlow. All rights reserved.\n# /bin/python3\n\nimport argparse\nimport os\nimport re\n\n\ndef make_parser():\n "
  },
  {
    "path": ".github/scripts/eval_base_config.py",
    "chars": 10298,
    "preview": "from copy import deepcopy\n\nfrom mmengine.config import read_base\nfrom opencompass.models import TurboMindModel\n\nwith rea"
  },
  {
    "path": ".github/scripts/eval_chat_config.py",
    "chars": 23202,
    "preview": "from copy import deepcopy\n\nfrom mmengine.config import read_base\nfrom opencompass.models import TurboMindModelwithChatTe"
  },
  {
    "path": ".github/scripts/eval_regression_base_models.py",
    "chars": 6565,
    "preview": "from copy import deepcopy\n\nfrom mmengine.config import read_base\n\nwith read_base():\n    # choose a list of datasets\n    "
  },
  {
    "path": ".github/scripts/eval_regression_chat_models.py",
    "chars": 10081,
    "preview": "from copy import deepcopy\n\nfrom mmengine.config import read_base\n\nwith read_base():\n    # choose a list of datasets\n    "
  },
  {
    "path": ".github/scripts/eval_stable_object_config.py",
    "chars": 4101,
    "preview": "from mmengine.config import read_base\nfrom opencompass.models import OpenAISDK\n\nwith read_base():\n    # choose a list of"
  },
  {
    "path": ".github/scripts/eval_stable_subject_config.py",
    "chars": 2391,
    "preview": "from mmengine.config import read_base\nfrom opencompass.models import OpenAISDK\nfrom opencompass.partitioners.sub_naive i"
  },
  {
    "path": ".github/workflows/api_eval.yml",
    "chars": 9246,
    "preview": "name: api_eval\n\non:\n  workflow_dispatch:\n    inputs:\n      repo_org:\n        required: false\n        description: 'Teste"
  },
  {
    "path": ".github/workflows/benchmark.yml",
    "chars": 9090,
    "preview": "name: benchmark_test\n\non:\n  workflow_dispatch:\n    inputs:\n      repo_org:\n        required: false\n        description: "
  },
  {
    "path": ".github/workflows/cuda12.8_whl_release.yml",
    "chars": 3648,
    "preview": "name: cuda12.8-whl-release\n\non:\n  push:\n    tags:\n      - '*'\n  workflow_dispatch:\n\npermissions:\n  contents: write\n\njobs"
  },
  {
    "path": ".github/workflows/daily_ete_test.yml",
    "chars": 44681,
    "preview": "name: daily_ete_test\n\non:\n  workflow_dispatch:\n    inputs:\n      repo_org:\n        required: false\n        description: "
  },
  {
    "path": ".github/workflows/daily_ete_test_3090.yml",
    "chars": 19313,
    "preview": "name: daily_ete_test_3090\n\non:\n  workflow_dispatch:\n    inputs:\n      repo_org:\n        required: false\n        descript"
  },
  {
    "path": ".github/workflows/daily_ete_test_5080.yml",
    "chars": 20315,
    "preview": "name: daily_ete_test_5080\n\non:\n  workflow_dispatch:\n    inputs:\n      repo_org:\n        required: false\n        descript"
  },
  {
    "path": ".github/workflows/docker.yml",
    "chars": 4789,
    "preview": "name: publish-docker\n\non:\n  push:\n    paths-ignore:\n      - \"!.github/workflows/docker.yml\"\n      - \".github/**\"\n      -"
  },
  {
    "path": ".github/workflows/docker_dev.yml",
    "chars": 1527,
    "preview": "name: publish-dev-docker\n\non:\n  workflow_dispatch:\n    inputs:\n      repo_ref:\n        required: false\n        descripti"
  },
  {
    "path": ".github/workflows/evaluate.yml",
    "chars": 7639,
    "preview": "name: evaluate\n\non:\n  workflow_dispatch:\n    inputs:\n      repo_org:\n        required: false\n        description: 'Teste"
  },
  {
    "path": ".github/workflows/lint.yml",
    "chars": 1555,
    "preview": "name: lint\n\non: [push, pull_request]\n\njobs:\n  lint:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout"
  },
  {
    "path": ".github/workflows/linux_x64_gpu.yml",
    "chars": 1657,
    "preview": "name: linux-x64-gpu\non:\n  push:\n    paths:\n      - '.github/workflows/linux_x64_gpu.yml'\n      - 'src/**'\n      - 'CMake"
  },
  {
    "path": ".github/workflows/mllm_api_eval.yml",
    "chars": 9232,
    "preview": "name: mllm_api_eval\n\non:\n  workflow_dispatch:\n    inputs:\n      repo_org:\n        required: false\n        description: '"
  },
  {
    "path": ".github/workflows/pr_ete_test.yml",
    "chars": 10422,
    "preview": "name: pr_ete_test\n\non:\n  pull_request:\n    paths:\n      - \".github/workflows/pr_ete_test.yml\"\n      - \"cmake/**\"\n      -"
  },
  {
    "path": ".github/workflows/pypi.yml",
    "chars": 3365,
    "preview": "name: publish to pypi\n\non:\n  push:\n    branches:\n      - main\n    paths:\n      - \"lmdeploy/version.py\"\n  workflow_dispat"
  },
  {
    "path": ".github/workflows/stable.yml",
    "chars": 9342,
    "preview": "name: stable_test\n\non:\n  workflow_dispatch:\n    inputs:\n      repo_org:\n        required: false\n        description: 'Te"
  },
  {
    "path": ".github/workflows/stale.yml",
    "chars": 1613,
    "preview": "name: 'Close stale issues and PRs'\n\non:\n  schedule:\n    # check issue and pull request once at 01:30 a.m. every day\n    "
  },
  {
    "path": ".github/workflows/test_docker.yml",
    "chars": 4855,
    "preview": "name: test-docker\n\non:\n  push:\n    paths:\n      - 'docker/**'\n      - '.github/workflows/*docker.yml'\n  pull_request:\n  "
  },
  {
    "path": ".github/workflows/unit_test.yml",
    "chars": 1900,
    "preview": "name: unit-test\n\non:\n  pull_request:\n    paths:\n      - \".github/workflows/unit_test.yml\"\n      - \"cmake/**\"\n      - \"sr"
  },
  {
    "path": ".github/workflows/windows_x64_gpu.yml",
    "chars": 1343,
    "preview": "name: windows-x64-gpu\non:\n  push:\n    paths:\n      - '.github/workflows/windows_x64_gpu.yml'\n      - 'src/**'\n      - 'C"
  },
  {
    "path": ".gitignore",
    "chars": 1040,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n.vscode/\n.idea/\n# C extensions\n*.so\n\n# Distrib"
  },
  {
    "path": ".pre-commit-config.yaml",
    "chars": 1926,
    "preview": "repos:\n  - repo: https://github.com/PyCQA/flake8\n    rev: 5.0.4\n    hooks:\n      - id: flake8\n        args: ['--extend-i"
  },
  {
    "path": ".pylintrc",
    "chars": 19041,
    "preview": "[MASTER]\n\n# A comma-separated list of package or module names from where C extensions may\n# be loaded. Extensions are lo"
  },
  {
    "path": "CLAUDE.md",
    "chars": 4818,
    "preview": "# CLAUDE.md\n\nThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.\n\n## "
  },
  {
    "path": "CMakeLists.txt",
    "chars": 12363,
    "preview": "# Copyright (c) 2019-2023, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "LICENSE",
    "chars": 11379,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "MANIFEST.in",
    "chars": 131,
    "preview": "\ninclude lmdeploy/lib/*.so\ninclude lmdeploy/lib/*.so*\ninclude lmdeploy/lib/*.dll\ninclude lmdeploy/lib/*.pyd\ninclude lmde"
  },
  {
    "path": "README.md",
    "chars": 15402,
    "preview": "<div align=\"center\">\n  <img src=\"docs/en/_static/image/lmdeploy-logo.svg\" width=\"450\"/>\n\n[![PyPI](https://img.shields.io"
  },
  {
    "path": "README_ja.md",
    "chars": 11157,
    "preview": "<div align=\"center\">\n  <img src=\"docs/en/_static/image/lmdeploy-logo.svg\" width=\"450\"/>\n\n[![PyPI](https://img.shields.io"
  },
  {
    "path": "README_zh-CN.md",
    "chars": 12571,
    "preview": "<div align=\"center\">\n  <img src=\"docs/en/_static/image/lmdeploy-logo.svg\" width=\"450\"/>\n\n[![PyPI](https://img.shields.io"
  },
  {
    "path": "autotest/benchmark/test_apiserver_performance.py",
    "chars": 4293,
    "preview": "import pytest\nfrom utils.benchmark_utils import restful_test\nfrom utils.config_utils import get_func_config_list\n\n\ndef g"
  },
  {
    "path": "autotest/benchmark/test_longtext_performance.py",
    "chars": 3454,
    "preview": "import pytest\nfrom utils.benchmark_utils import longtext_throughput_test\nfrom utils.config_utils import get_func_config_"
  },
  {
    "path": "autotest/benchmark/test_mllm_apiserver_performance.py",
    "chars": 3433,
    "preview": "import pytest\nfrom utils.benchmark_utils import restful_test\nfrom utils.config_utils import get_func_config_list\n\n\ndef g"
  },
  {
    "path": "autotest/benchmark/test_prefixcache_performance.py",
    "chars": 4062,
    "preview": "import pytest\nfrom utils.benchmark_utils import prefixcache_throughput_test\nfrom utils.config_utils import get_func_conf"
  },
  {
    "path": "autotest/benchmark/test_throughput_performance.py",
    "chars": 4867,
    "preview": "import pytest\nfrom utils.benchmark_utils import throughput_test\nfrom utils.config_utils import get_func_config_list, get"
  },
  {
    "path": "autotest/chat_prompt_case.yml",
    "chars": 1345,
    "preview": "base_testcase:\n    - 乌鲁木齐的景点A brief introduction to Urumqi’s attractions:\n        - contain:\n            - urumqi\n      "
  },
  {
    "path": "autotest/config.yml",
    "chars": 6961,
    "preview": "model_path: /nvme/qa_test_models\nresource_path: /nvme/qa_test_models/resource\nlog_path: /nvme/qa_test_models/autotest_lo"
  },
  {
    "path": "autotest/config_3090.yml",
    "chars": 3199,
    "preview": "model_path: /nvme/qa_test_models\nresource_path: /nvme/qa_test_models/resource\nlog_path: /nvme/qa_test_models/autotest_lo"
  },
  {
    "path": "autotest/config_3090_legacy.yml",
    "chars": 3427,
    "preview": "model_path: /nvme/qa_test_models\nresource_path: /nvme/qa_test_models/resource\nlog_path: /nvme/qa_test_models/autotest_lo"
  },
  {
    "path": "autotest/config_5080.yml",
    "chars": 2473,
    "preview": "model_path: /nvme/qa_test_models\nresource_path: /nvme/qa_test_models/resource\nlog_path: /nvme/qa_test_models/autotest_lo"
  },
  {
    "path": "autotest/config_5080_legacy.yml",
    "chars": 2549,
    "preview": "model_path: /nvme/qa_test_models\nresource_path: /nvme/qa_test_models/resource\nlog_path: /nvme/qa_test_models/autotest_lo"
  },
  {
    "path": "autotest/config_ascend.yml",
    "chars": 2948,
    "preview": "model_path: /mnt/vc-intern-delivery/qa-llm-cicd/qa_test_models\nresource_path: /mnt/vc-intern-delivery/qa-llm-cicd/resour"
  },
  {
    "path": "autotest/config_h.yml",
    "chars": 6311,
    "preview": "model_path: /mnt/shared-storage-user/llmrazor-share/qa-llm-cicd/cicd-autotest/eval_resource/model\nresource_path: /mnt/sh"
  },
  {
    "path": "autotest/config_h800.yml",
    "chars": 3993,
    "preview": "model_path: /nvme/qa_test_models\nresource_path: /nvme/qa_test_models/resource\nlog_path: /nvme/qa_test_models/autotest_mo"
  },
  {
    "path": "autotest/config_h_legacy.yml",
    "chars": 2018,
    "preview": "model_path: /mnt/shared-storage-user/llmrazor-share/qa-llm-cicd/cicd-autotest/eval_resource/model\nresource_path: /mnt/sh"
  },
  {
    "path": "autotest/config_legacy.yml",
    "chars": 5953,
    "preview": "model_path: /nvme/qa_test_models\nresource_path: /nvme/qa_test_models/resource\nlog_path: /nvme/qa_test_models/autotest_lo"
  },
  {
    "path": "autotest/config_test.yml",
    "chars": 4830,
    "preview": "model_path: /nvme/qa_test_models\nresource_path: /nvme/qa_test_models/resource\nlog_path: /nvme/qa_test_models/autotest_mo"
  },
  {
    "path": "autotest/config_testascend.yml",
    "chars": 794,
    "preview": "model_path: /nvme/qa_test_models\nresource_path: /nvme/qa_test_models/resource\nlog_path: /nvme/qa_test_models/autotest_mo"
  },
  {
    "path": "autotest/conftest.py",
    "chars": 2588,
    "preview": "import os\n\nimport pytest\nimport yaml\nfrom utils.config_utils import get_config\nfrom utils.constant import DEFAULT_SERVER"
  },
  {
    "path": "autotest/evaluate/eval_config_chat.py",
    "chars": 5495,
    "preview": "# flake8: noqa\n\nfrom mmengine.config import read_base\nfrom opencompass.models import OpenAISDK\nfrom opencompass.partitio"
  },
  {
    "path": "autotest/evaluate/test_api_evaluate.py",
    "chars": 16900,
    "preview": "import os\nimport time\n\nimport pytest\nimport utils.constant as constant\nfrom utils.config_utils import get_case_str_by_co"
  },
  {
    "path": "autotest/evaluate/test_mllm_api_evaluate.py",
    "chars": 9115,
    "preview": "import os\n\nimport pytest\nimport utils.constant as constant\nfrom utils.config_utils import get_case_str_by_config, get_fu"
  },
  {
    "path": "autotest/interface/pipeline/test_pipeline_func.py",
    "chars": 33085,
    "preview": "import multiprocessing as mp\n\nimport pydantic\nimport pytest\nfrom utils.config_utils import set_device_env_variable, unse"
  },
  {
    "path": "autotest/interface/pipeline/test_pipeline_longtext_func.py",
    "chars": 8397,
    "preview": "import multiprocessing as mp\nimport os\n\nimport numpy as np\nimport pytest\nfrom utils.config_utils import set_device_env_v"
  },
  {
    "path": "autotest/interface/restful/test_restful_chat_completions_v1.py",
    "chars": 65969,
    "preview": "from typing import Literal\n\nimport pytest\nfrom openai import OpenAI\nfrom utils.constant import BACKEND_LIST, RESTFUL_MOD"
  },
  {
    "path": "autotest/interface/restful/test_restful_completions_v1.py",
    "chars": 9602,
    "preview": "import pytest\nfrom utils.constant import BACKEND_LIST, RESTFUL_BASE_MODEL_LIST\nfrom utils.restful_return_check import as"
  },
  {
    "path": "autotest/interface/restful/test_restful_generate.py",
    "chars": 47042,
    "preview": "import json\nimport os\nimport re\nimport time\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\nfrom datetim"
  },
  {
    "path": "autotest/prompt_case.yml",
    "chars": 1721,
    "preview": "identity:\n    - 你好,你叫什么名字#hi, what's your name:\nmemory_test:\n    - 简要介绍乌鲁木齐的景点#A brief introduction to Urumqi’s attracti"
  },
  {
    "path": "autotest/pytest.ini",
    "chars": 232,
    "preview": "[pytest]\npython_files = test*_*.py  # test file\npython_classes = Test*     # test class\npython_functions = test_*  # tes"
  },
  {
    "path": "autotest/template.json",
    "chars": 61,
    "preview": "{\n    \"model_name\": \"base\",\n    \"capability\": \"completion\"\n}\n"
  },
  {
    "path": "autotest/toolchain/test_lagent.py",
    "chars": 1089,
    "preview": "import pytest\n\n\n@pytest.mark.order(10)\n@pytest.mark.lagent\n@pytest.mark.flaky(reruns=2)\n@pytest.mark.parametrize('model'"
  },
  {
    "path": "autotest/tools/chat/test_command_chat_hf_pytorch.py",
    "chars": 4613,
    "preview": "import pytest\nfrom tools.common_case_config import (MODELSCOPE_CONFIG, PYTORCH_LORA_TEST_LLM_GPU1, PYTORCH_LORA_TEST_LLM"
  },
  {
    "path": "autotest/tools/chat/test_command_chat_hf_turbomind.py",
    "chars": 4038,
    "preview": "import pytest\nfrom tools.common_case_config import (MODELSCOPE_CONFIG, TURBOMIND_FALLBACK_TEST_LLM_GPU1,\n               "
  },
  {
    "path": "autotest/tools/common_case_config.py",
    "chars": 8685,
    "preview": "TURBOMIND_PR_TEST_LLM_GPU2 = [{\n    'model': 'Qwen/Qwen3-30B-A3B',\n    'backend': 'turbomind',\n    'communicator': 'cuda"
  },
  {
    "path": "autotest/tools/pipeline/llm_case.py",
    "chars": 3287,
    "preview": "import json\nimport os\n\nimport fire\nimport yaml\n\nfrom lmdeploy import GenerationConfig, PytorchEngineConfig, TurbomindEng"
  },
  {
    "path": "autotest/tools/pipeline/mllm_case.py",
    "chars": 16968,
    "preview": "import json\n\nimport fire\nimport numpy as np\nfrom PIL import Image\n\nfrom lmdeploy import GenerationConfig, PytorchEngineC"
  },
  {
    "path": "autotest/tools/pipeline/test_pipeline_chat_pytorch_llm.py",
    "chars": 4811,
    "preview": "import pytest\nfrom tools.common_case_config import (MODELSCOPE_CONFIG, PYTORCH_LORA_TEST_LLM_GPU1, PYTORCH_LORA_TEST_LLM"
  },
  {
    "path": "autotest/tools/pipeline/test_pipeline_chat_pytorch_mllm.py",
    "chars": 1328,
    "preview": "import pytest\nfrom utils.config_utils import get_func_config_list\nfrom utils.pipeline_chat import run_pipeline_mllm_test"
  },
  {
    "path": "autotest/tools/pipeline/test_pipeline_chat_turbomind_llm.py",
    "chars": 4218,
    "preview": "import pytest\nfrom tools.common_case_config import (MODELSCOPE_CONFIG, TURBOMIND_FALLBACK_TEST_LLM_GPU1,\n               "
  },
  {
    "path": "autotest/tools/pipeline/test_pipeline_chat_turbomind_mllm.py",
    "chars": 2408,
    "preview": "import pytest\nfrom tools.common_case_config import (TURBOMIND_FALLBACK_TEST_MLLM_GPU1, TURBOMIND_PR_TEST_MLLM_GPU1,\n    "
  },
  {
    "path": "autotest/tools/quantization/test_quantization_awq.py",
    "chars": 1862,
    "preview": "import os\n\nimport allure\nimport pytest\nfrom utils.config_utils import get_cuda_prefix_by_workerid, get_quantization_mode"
  },
  {
    "path": "autotest/tools/quantization/test_quantization_w8a8.py",
    "chars": 1079,
    "preview": "import os\n\nimport allure\nimport pytest\nfrom utils.config_utils import get_cuda_prefix_by_workerid, get_quantization_mode"
  },
  {
    "path": "autotest/tools/restful/test_restful_chat_hf_pytorch_llm.py",
    "chars": 10569,
    "preview": "import time\n\nimport pytest\nfrom tools.common_case_config import (MODELSCOPE_CONFIG, PYTORCH_LORA_TEST_LLM_GPU1, PYTORCH_"
  },
  {
    "path": "autotest/tools/restful/test_restful_chat_hf_pytorch_mllm.py",
    "chars": 1342,
    "preview": "import pytest\nfrom utils.config_utils import get_func_config_list\nfrom utils.run_restful_chat import run_mllm_test\n\nBACK"
  },
  {
    "path": "autotest/tools/restful/test_restful_chat_hf_turbomind_llm.py",
    "chars": 6480,
    "preview": "import pytest\nfrom tools.common_case_config import (MODELSCOPE_CONFIG, REASONING_TEST_LLM, TOOLCALL_TEST_LLM,\n          "
  },
  {
    "path": "autotest/tools/restful/test_restful_chat_hf_turbomind_mllm.py",
    "chars": 1657,
    "preview": "import pytest\nfrom tools.common_case_config import TURBOMIND_FALLBACK_TEST_MLLM_GPU1\nfrom utils.config_utils import get_"
  },
  {
    "path": "autotest/utils/benchmark_utils.py",
    "chars": 12030,
    "preview": "import os\nimport time\n\nimport allure\nimport utils.constant as constant\nfrom utils.common_utils import execute_command_wi"
  },
  {
    "path": "autotest/utils/common_utils.py",
    "chars": 2062,
    "preview": "import os\nimport subprocess\nimport sys\n\n\ndef execute_command_with_logging(cmd,\n                                 log_file"
  },
  {
    "path": "autotest/utils/config_utils.py",
    "chars": 41879,
    "preview": "import copy\nimport os\nfrom collections import OrderedDict\nfrom typing import Any\n\nimport yaml\n\nfrom lmdeploy.utils impor"
  },
  {
    "path": "autotest/utils/constant.py",
    "chars": 4449,
    "preview": "import os\n\nDEFAULT_PORT = 23333\nDEFAULT_SERVER = os.getenv('MASTER_ADDR', '127.0.0.1')\nPROXY_PORT = 8000\n\nEVAL_CONFIGS ="
  },
  {
    "path": "autotest/utils/evaluate_utils.py",
    "chars": 12974,
    "preview": "import csv\nimport glob\nimport json\nimport os\nimport subprocess\nimport time\n\nimport allure\nimport pandas as pd\nfrom mmeng"
  },
  {
    "path": "autotest/utils/get_run_config.py",
    "chars": 2093,
    "preview": "from lmdeploy.model import MODELS\n\n\n# Deprecated function\ndef get_model_name(model):\n    model_names = ['llama', 'llama2"
  },
  {
    "path": "autotest/utils/mp_log_utils.py",
    "chars": 1095,
    "preview": "import os\n\nimport allure\nfrom pytest_assume.plugin import assume\n\n\ndef write_log(config, result, msg, is_new: bool = Tru"
  },
  {
    "path": "autotest/utils/pipeline_chat.py",
    "chars": 19376,
    "preview": "import json\nimport os\nimport shutil\nimport time\n\nimport allure\nfrom pytest_assume.plugin import assume\nfrom utils.common"
  },
  {
    "path": "autotest/utils/proxy_distributed_utils.py",
    "chars": 11981,
    "preview": "import os\nimport random\nimport socket\nimport subprocess\nimport time\nfrom typing import Any\n\nimport requests\nfrom utils.c"
  },
  {
    "path": "autotest/utils/quantization_utils.py",
    "chars": 2965,
    "preview": "import os\nimport subprocess\nfrom subprocess import PIPE\n\n\ndef quantization(config,\n                 quantization_model_n"
  },
  {
    "path": "autotest/utils/ray_distributed_utils.py",
    "chars": 12345,
    "preview": "import os\nimport random\nimport socket\nimport subprocess\nimport time\nfrom time import time as time_time\nfrom typing impor"
  },
  {
    "path": "autotest/utils/restful_return_check.py",
    "chars": 5553,
    "preview": "import re\n\n\ndef assert_chat_completions_batch_return(output, model_name, check_logprobs: bool = False, logprobs_num: int"
  },
  {
    "path": "autotest/utils/rule_condition_assert.py",
    "chars": 2103,
    "preview": "def assert_result(input, rule_condition, model_name: str = None):\n    input = input.replace('\\n', '\\\\n')\n    input_lower"
  },
  {
    "path": "autotest/utils/run_client_chat.py",
    "chars": 5220,
    "preview": "import os\nimport time\nfrom subprocess import PIPE, Popen\n\nimport allure\nfrom utils.config_utils import get_case_str_by_c"
  },
  {
    "path": "autotest/utils/run_restful_chat.py",
    "chars": 32828,
    "preview": "import json\nimport os\nimport subprocess\nimport time\n\nimport allure\nimport psutil\nimport requests\nfrom openai import Open"
  },
  {
    "path": "autotest/utils/toolkit.py",
    "chars": 1071,
    "preview": "from functools import lru_cache\n\nfrom transformers import AutoTokenizer\n\n\ndef parse_sse_stream(content: str) -> list[str"
  },
  {
    "path": "benchmark/README.md",
    "chars": 828,
    "preview": "# Benchmark\n\nWe provide several profiling tools to benchmark our models.\n\n## profile with dataset\n\nDownload the dataset "
  },
  {
    "path": "benchmark/benchmark_decode.py",
    "chars": 2784,
    "preview": "import json\nimport pickle\nimport time\nfrom pathlib import Path\n\nimport fire\nimport numpy as np\nfrom transformers import "
  },
  {
    "path": "benchmark/benchmark_pipeline.py",
    "chars": 3181,
    "preview": "import os\nimport subprocess\nfrom typing import Dict, List\n\nimport fire\nimport yaml\n\n\ndef get_cmd(model_path, backend, en"
  },
  {
    "path": "benchmark/benchmark_serving.py",
    "chars": 8911,
    "preview": "import os\nimport subprocess\nimport time\nfrom typing import Dict, List, Optional, Tuple\n\nimport fire\nimport yaml\n\n\ndef ge"
  },
  {
    "path": "benchmark/benchmark_throughput.py",
    "chars": 3140,
    "preview": "import os\nimport subprocess\nfrom typing import Dict, List\n\nimport fire\nimport yaml\n\n\ndef get_cmd(model_path, backend, en"
  },
  {
    "path": "benchmark/lmdeploy.yml",
    "chars": 1551,
    "preview": "num_promts: &num_prompts 10000\ndataset_path: &dataset_path \"/nvme1/shared/ShareGPT_V3_unfiltered_cleaned_split.json\"\ndat"
  },
  {
    "path": "benchmark/profile_pipeline_api.py",
    "chars": 15120,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport argparse\nimport json\nimport os\nimport random\nfrom typing import L"
  },
  {
    "path": "benchmark/profile_restful_api.py",
    "chars": 57249,
    "preview": "# Modify from https://github.com/sgl-project/sglang/blob/main/python/sglang/bench_serving.py  # noqa\n# Adapted from http"
  },
  {
    "path": "benchmark/profile_throughput.py",
    "chars": 18156,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\nimport argparse\nimport asyncio\nimport json\nimport os\nimport random\nfrom "
  },
  {
    "path": "builder/manywheel/Dockerfile_2014",
    "chars": 1908,
    "preview": "# WARNING: CentOS 7 is out of date since 6/30/2024, we should use the following one in the future\n# FROM quay.io/pypa/ma"
  },
  {
    "path": "builder/manywheel/README.md",
    "chars": 526,
    "preview": "# LMDeploy Build System\n\n## Building lmdeploy builder images\n\nTo build all lmdeploy builder images, such as \"lmdeploy-bu"
  },
  {
    "path": "builder/manywheel/build_all_lmdeploy_builders.sh",
    "chars": 244,
    "preview": "#!/usr/bin/env bash\n\nset -eou pipefail\n\nTOPDIR=$(git rev-parse --show-toplevel)/builder\n\nfor cuda_version in 12.4 12.6 1"
  },
  {
    "path": "builder/manywheel/build_all_wheel.sh",
    "chars": 465,
    "preview": "#!/usr/bin/env bash\n\nset -eou pipefail\n\nTOPDIR=$(git rev-parse --show-toplevel)/builder\n\nCUDA_VER=${CUDA_VER:-12.8}\n\nPLA"
  },
  {
    "path": "builder/manywheel/build_lmdeploy_builder.sh",
    "chars": 1312,
    "preview": "#!/usr/bin/env bash\n\nset -eou pipefail\n\nTOPDIR=$(git rev-parse --show-toplevel)/builder\nGPU_ARCH_VERSION=${GPU_ARCH_VERS"
  },
  {
    "path": "builder/manywheel/build_wheel.sh",
    "chars": 702,
    "preview": "#!/usr/bin/env bash\nset -eux\n\nPYTHON_VERSION=\"$1\"\nPLAT_NAME=\"$2\"\nDOCKER_TAG=\"$3\"\nOUTPUT_DIR=\"$4\"\n\nDOCKER_IMAGE=\"openmmla"
  },
  {
    "path": "builder/manywheel/entrypoint_build.sh",
    "chars": 670,
    "preview": "#!/usr/bin/env bash\nset -eux\n\nexport PYTHON_VERSION=$PYTHON_VERSION\nexport PLAT_NAME=$PLAT_NAME\nexport USERID=${USERID}\n"
  },
  {
    "path": "builder/manywheel/scripts/install_conda.sh",
    "chars": 239,
    "preview": "#!/bin/bash\n\nset -ex\n\nwget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh\nchmod +x  Miniconda3"
  },
  {
    "path": "builder/manywheel/scripts/install_cuda.sh",
    "chars": 5215,
    "preview": "#!/bin/bash\n\nset -ex\n\nfunction install_118 {\n    echo \"Installing CUDA 11.8 and NCCL 2.15\"\n    rm -rf /usr/local/cuda-11"
  },
  {
    "path": "builder/manywheel/scripts/install_openmpi.sh",
    "chars": 213,
    "preview": "#!/bin/bash\n\nset -ex\n\nwget -q https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gz\ntar xf openmpi-4."
  },
  {
    "path": "builder/windows/README.md",
    "chars": 336,
    "preview": "# Build lmdeploy on windows\n\n## Requirements\n\n- [CMake 3.17+](https://github.com/Kitware/CMake/releases)\n- [Visual Studi"
  },
  {
    "path": "builder/windows/generate.ps1",
    "chars": 226,
    "preview": "cmake .. -A x64 -T \"v143,cuda=$env:CUDA_PATH\" `\n    -DCMAKE_BUILD_TYPE=Release `\n    -DCMAKE_INSTALL_PREFIX=install `\n  "
  },
  {
    "path": "builder/windows/setup_cuda.ps1",
    "chars": 4834,
    "preview": "# Copyright (c) OpenMMLab. All rights reserved.\n# Adapted from https://github.com/thewh1teagle/vibe/blob/5d7b75568ca65ab"
  },
  {
    "path": "cmake/Modules/FindNCCL.cmake",
    "chars": 7172,
    "preview": "# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.\n#\n# From PyTorch:\n#\n# Copyright (c) 2016-     Facebo"
  },
  {
    "path": "cmake/TritonTurboMindBackendConfig.cmake.in",
    "chars": 1965,
    "preview": "# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.\n#\n# Redistribution and use in source and binary form"
  },
  {
    "path": "cmake/TurboMindConfig.cmake.in",
    "chars": 1858,
    "preview": "# Copyright (c) 2021-2023, NVIDIA CORPORATION. All rights reserved.\n#\n# Redistribution and use in source and binary form"
  },
  {
    "path": "cmake/yaml-cpp_cmake_policy.patch",
    "chars": 391,
    "preview": "diff --git a/CMakeLists.txt b/CMakeLists.txt\nindex 46dc180..b746ac1 100644\n--- a/CMakeLists.txt\n+++ b/CMakeLists.txt\n@@ "
  },
  {
    "path": "debug.sh",
    "chars": 426,
    "preview": "#!/bin/bash -e\n\nbuilder=\"-G Ninja\"\n\nif [ \"$1\" == \"make\" ]; then\n    builder=\"\"\nfi\n\ncmake ${builder} .. \\\n    -DCMAKE_BUI"
  },
  {
    "path": "docker/Dockerfile",
    "chars": 2492,
    "preview": "# Base images\nARG IMAGE_TYPE=final\nARG CUDA_VERSION=cu12\n\nFROM nvidia/cuda:13.0.2-devel-ubuntu22.04 AS cu13\nENV CUDA_VER"
  },
  {
    "path": "docker/Dockerfile.jetson",
    "chars": 1720,
    "preview": "# Base images\nFROM nvcr.io/nvidia/l4t-base:r36.2.0\nENV CUDA_VER=12.6 \\\n    PYTHON_VERSION=3.10 \\\n    PATH=/opt/py3/bin:/"
  },
  {
    "path": "docker/Dockerfile_ascend_a2_300i",
    "chars": 1409,
    "preview": "# DOCKER_BUILDKIT=1 docker build --build-arg ASCEND_DEVICE_TYPE=ascend_a2 \\\n#     --build-arg DLINFER_TAG=main --build-a"
  },
  {
    "path": "docker/Dockerfile_ascend_a3",
    "chars": 1310,
    "preview": "# DOCKER_BUILDKIT=1 docker build --build-arg ASCEND_DEVICE=ascend_a3 \\\n#     --build-arg DLINFER_TAG=main --build-arg LM"
  },
  {
    "path": "docker/Dockerfile_dev",
    "chars": 1957,
    "preview": "FROM nvidia/cuda:12.8.1-devel-ubuntu22.04 AS cu12.8\n\n# environment variables\nENV DEBIAN_FRONTEND=noninteractive \\\n    TZ"
  },
  {
    "path": "docker/InternVL_Dockerfile",
    "chars": 429,
    "preview": "ARG CUDA_VERSION=cu12\n\nFROM openmmlab/lmdeploy:latest-cu12 AS cu12\nENV CUDA_VERSION_SHORT=cu123\n\nFROM openmmlab/lmdeploy"
  },
  {
    "path": "docker/Qwen2VL_Dockerfile",
    "chars": 413,
    "preview": "ARG CUDA_VERSION=cu12\n\nFROM openmmlab/lmdeploy:latest-cu12 AS cu12\nENV CUDA_VERSION_SHORT=cu123\n\nFROM openmmlab/lmdeploy"
  },
  {
    "path": "docker/build.sh",
    "chars": 195,
    "preview": "#!/bin/bash -ex\n\nmkdir -p /wheels\n\nif [[ \"${CUDA_VERSION_SHORT}\" = \"cu130\" ]]; then\n    pip install nvidia-nccl-cu13\nels"
  },
  {
    "path": "docker/install.sh",
    "chars": 2796,
    "preview": "#!/bin/bash -ex\n\n# Skip system setup if virtual env already exists (e.g., in dev image)\nif [ ! -f \"/opt/py3/bin/python\" "
  },
  {
    "path": "docker/prepare_wheel.sh",
    "chars": 1823,
    "preview": "#!/bin/bash -ex\n\nexport PATH=/opt/py3/bin:$PATH\n\npip install \"cmake<4.0\" wheel ninja setuptools packaging\n\nif [[ ${PYTHO"
  },
  {
    "path": "docs/en/.readthedocs.yaml",
    "chars": 237,
    "preview": "version: 2\n\nformats: all\n\nbuild:\n  os: \"ubuntu-22.04\"\n  tools:\n    python: \"3.10\"\n\n\nsphinx:\n  configuration: docs/en/con"
  },
  {
    "path": "docs/en/Makefile",
    "chars": 581,
    "preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS    =\nSPHI"
  },
  {
    "path": "docs/en/_static/css/readthedocs.css",
    "chars": 121,
    "preview": "table.autosummary td {\n  width: 50%\n}\n\nimg.align-center {\n  display: block;\n  margin-left: auto;\n  margin-right: auto;\n}"
  },
  {
    "path": "docs/en/advance/chat_template.md",
    "chars": 3293,
    "preview": "# Customized chat template\n\nThe effect of the applied chat template can be observed by **setting log level** `INFO`.\n\nLM"
  },
  {
    "path": "docs/en/advance/context_parallel.md",
    "chars": 1350,
    "preview": "# Context Parallel\n\nWhen the memory on a single GPU is insufficient to deploy a model, it is often deployed using tensor"
  },
  {
    "path": "docs/en/advance/debug_turbomind.md",
    "chars": 4287,
    "preview": "# How to debug Turbomind\n\nTurbomind is implemented in C++, which is not as easy to debug as Python. This document provid"
  },
  {
    "path": "docs/en/advance/long_context.md",
    "chars": 4692,
    "preview": "# Context length extrapolation\n\nLong text extrapolation refers to the ability of LLM to handle data longer than the trai"
  },
  {
    "path": "docs/en/advance/metrics.md",
    "chars": 4732,
    "preview": "# Production Metrics\n\nLMDeploy exposes a set of metrics via Prometheus, and provides visualization via Grafana.\n\n## Setu"
  },
  {
    "path": "docs/en/advance/pytorch_multinodes.md",
    "chars": 2677,
    "preview": "# PyTorchEngine Multi-Node Deployment Guide\n\nTo support larger-scale model deployment requirements, PyTorchEngine provid"
  },
  {
    "path": "docs/en/advance/pytorch_multithread.md",
    "chars": 2291,
    "preview": "# PyTorchEngine Multithread\n\nWe have removed `thread_safe` mode from PytorchEngine since [PR2907](https://github.com/Int"
  },
  {
    "path": "docs/en/advance/pytorch_new_model.md",
    "chars": 6842,
    "preview": "# lmdeploy.pytorch New Model Support\n\nlmdeploy.pytorch is designed to simplify the support for new models and the develo"
  },
  {
    "path": "docs/en/advance/pytorch_profiling.md",
    "chars": 1723,
    "preview": "# PyTorchEngine Profiling\n\nWe provide multiple profiler to analysis the performance of PyTorchEngine.\n\n## PyTorch Profil"
  },
  {
    "path": "docs/en/advance/spec_decoding.md",
    "chars": 2671,
    "preview": "# Speculative Decoding\n\nSpeculative decoding is an optimization technique that introcude a lightweight draft model to pr"
  },
  {
    "path": "docs/en/advance/structed_output.md",
    "chars": 3120,
    "preview": "# Structured output\n\nStructured output, also known as guided decoding, forces the model to generate text that exactly ma"
  },
  {
    "path": "docs/en/advance/update_weights.md",
    "chars": 3064,
    "preview": "# Update Weights\n\nLMDeploy supports update model weights online for scenes such as RL training. Here are the steps to do"
  },
  {
    "path": "docs/en/api/cli.rst",
    "chars": 133,
    "preview": "Command-line Tools\n===================\n\n.. sphinx_argparse_cli::\n   :module: lmdeploy.cli\n   :func: run\n   :hook:\n   :pr"
  },
  {
    "path": "docs/en/api/openapi.rst",
    "chars": 298,
    "preview": "OpenAPI Endpoints\n==================\n.. currentmodule:: lmdeploy\n\nOpenAI Compatible API Endpoints\n----------------------"
  },
  {
    "path": "docs/en/api/pipeline.rst",
    "chars": 428,
    "preview": "Inference pipeline\n==================\n.. currentmodule:: lmdeploy\n\nPipeline\n--------\n.. autofunction:: pipeline\n.. autoc"
  },
  {
    "path": "docs/en/benchmark/a100_fp16.md",
    "chars": 5388,
    "preview": "# TurboMind Benchmark on A100\n\nAll the following results are tested on A100-80G(x8) CUDA 11.8.\n\nThe tested lmdeploy vers"
  },
  {
    "path": "docs/en/benchmark/benchmark.md",
    "chars": 1659,
    "preview": "# Benchmark\n\nPlease install the lmdeploy precompiled package and download the script and the test dataset:\n\n```shell\npip"
  },
  {
    "path": "docs/en/benchmark/evaluate_with_opencompass.md",
    "chars": 4498,
    "preview": "# Model Evaluation Guide\n\nThis document describes how to evaluate a model's capabilities on academic datasets using Open"
  },
  {
    "path": "docs/en/benchmark/evaluate_with_vlmevalkit.md",
    "chars": 1946,
    "preview": "# Multi-Modal Model Evaluation Guide\n\nThis document describes how to evaluate multi-modal models' capabilities using VLM"
  },
  {
    "path": "docs/en/conf.py",
    "chars": 9141,
    "preview": "#\n# Configuration file for the Sphinx documentation builder.\n#\n# This file does only contain a selection of the most com"
  },
  {
    "path": "docs/en/faq.md",
    "chars": 4706,
    "preview": "# FAQ\n\n## ModuleNotFoundError\n\n### No module named 'mmengine.config.lazy'\n\nThere is probably a cached mmengine in your l"
  },
  {
    "path": "docs/en/get_started/ascend/get_started.md",
    "chars": 4459,
    "preview": "# Get Started with Huawei Ascend\n\nWe currently support running lmdeploy on **Atlas 800T A3, Atlas 800T A2 and Atlas 300I"
  },
  {
    "path": "docs/en/get_started/camb/get_started.md",
    "chars": 3130,
    "preview": "# Cambricon\n\nThe usage of lmdeploy on a Cambricon device is almost the same as its usage on CUDA with PytorchEngine in l"
  },
  {
    "path": "docs/en/get_started/get_started.md",
    "chars": 8088,
    "preview": "# Quick Start\n\nThis tutorial shows the usage of LMDeploy on CUDA platform:\n\n- Offline inference of LLM model and VLM mod"
  },
  {
    "path": "docs/en/get_started/index.rst",
    "chars": 176,
    "preview": "On Other Platforms\n=================================\n\n.. toctree::\n   :maxdepth: 1\n   :caption: OtherPF\n\n   ascend/get_s"
  },
  {
    "path": "docs/en/get_started/installation.md",
    "chars": 2980,
    "preview": "# Installation\n\nLMDeploy is a python library for compressing, deploying, and serving Large Language Models(LLMs) and Vis"
  },
  {
    "path": "docs/en/get_started/maca/get_started.md",
    "chars": 2865,
    "preview": "# MetaX-tech\n\nThe usage of lmdeploy on a MetaX-tech device is almost the same as its usage on CUDA with PytorchEngine in"
  },
  {
    "path": "docs/en/index.rst",
    "chars": 3768,
    "preview": "Welcome to LMDeploy's tutorials!\n====================================\n\n.. figure:: ./_static/image/lmdeploy-logo.svg\n  :"
  },
  {
    "path": "docs/en/inference/load_hf.md",
    "chars": 1535,
    "preview": "# Load huggingface model directly\n\nStarting from v0.1.0, Turbomind adds the ability to pre-process the model parameters "
  },
  {
    "path": "docs/en/inference/pytorch.md",
    "chars": 4998,
    "preview": "# Architecture of lmdeploy.pytorch\n\n`lmdeploy.pytorch` is an inference engine in LMDeploy that offers a developer-friend"
  },
  {
    "path": "docs/en/inference/turbomind.md",
    "chars": 5623,
    "preview": "# Architecture of TurboMind\n\nTurboMind is an inference engine that supports high throughput inference for conversational"
  },
  {
    "path": "docs/en/inference/turbomind_config.md",
    "chars": 9024,
    "preview": "# TurboMind Config\n\nTurboMind is one of the inference engines of LMDeploy. When using it to do model inference, you need"
  },
  {
    "path": "docs/en/llm/api_server.md",
    "chars": 9908,
    "preview": "# OpenAI Compatible Server\n\nThis article primarily discusses the deployment of a single LLM model across multiple GPUs o"
  },
  {
    "path": "docs/en/llm/api_server_lora.md",
    "chars": 2836,
    "preview": "# Serving LoRA\n\n## Launch LoRA\n\nLoRA is currently only supported by the PyTorch backend. Its deployment process is simil"
  },
  {
    "path": "docs/en/llm/api_server_reasoning.md",
    "chars": 3821,
    "preview": "# Reasoning Outputs\n\nFor models that support reasoning capabilities, such as [DeepSeek R1](https://huggingface.co/deepse"
  },
  {
    "path": "docs/en/llm/api_server_tools.md",
    "chars": 13763,
    "preview": "# Tools Calling\n\nLMDeploy supports tools for InternLM2, InternLM2.5, llama3.1 and Qwen2.5 models. Please use `--tool-cal"
  },
  {
    "path": "docs/en/llm/codellama.md",
    "chars": 5786,
    "preview": "# codellama\n\n## Introduction\n\n[codellama](https://github.com/facebookresearch/codellama) features enhanced coding capabi"
  },
  {
    "path": "docs/en/llm/pipeline.md",
    "chars": 8221,
    "preview": "# Offline Inference Pipeline\n\nIn this tutorial, We will present a list of examples to introduce the usage of `lmdeploy.p"
  },
  {
    "path": "docs/en/llm/proxy_server.md",
    "chars": 3964,
    "preview": "# Request Distributor Server\n\nThe request distributor service can parallelize multiple api_server services. Users only n"
  },
  {
    "path": "docs/en/make.bat",
    "chars": 752,
    "preview": "@ECHO OFF\n\npushd %~dp0\n\nREM Command file for Sphinx documentation\n\nif \"%SPHINXBUILD%\" == \"\" (\n\tset SPHINXBUILD=sphinx-bu"
  },
  {
    "path": "docs/en/multi_modal/api_server_vl.md",
    "chars": 5678,
    "preview": "# OpenAI Compatible Server\n\nThis article primarily discusses the deployment of a single large vision language model acro"
  },
  {
    "path": "docs/en/multi_modal/cogvlm.md",
    "chars": 1562,
    "preview": "# CogVLM\n\n## Introduction\n\nCogVLM is a powerful open-source visual language model (VLM). LMDeploy supports CogVLM-17B mo"
  },
  {
    "path": "docs/en/multi_modal/deepseek_vl2.md",
    "chars": 1932,
    "preview": "# DeepSeek-VL2\n\n## Introduction\n\nDeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Mode"
  },
  {
    "path": "docs/en/multi_modal/gemma3.md",
    "chars": 1631,
    "preview": "# Gemma3\n\n## Introduction\n\nGemma is a family of lightweight, state-of-the-art open models from Google, built from the sa"
  },
  {
    "path": "docs/en/multi_modal/index.rst",
    "chars": 271,
    "preview": "Vision-Language Models\n=================================\n\n.. toctree::\n   :maxdepth: 2\n   :caption: Examples\n\n   deepsee"
  },
  {
    "path": "docs/en/multi_modal/internvl.md",
    "chars": 8517,
    "preview": "# InternVL\n\nLMDeploy supports the following InternVL series of models, which are detailed in the table below:\n\n|        "
  },
  {
    "path": "docs/en/multi_modal/llava.md",
    "chars": 5124,
    "preview": "# LLaVA\n\nLMDeploy supports the following llava series of models, which are detailed in the table below:\n\n|              "
  },
  {
    "path": "docs/en/multi_modal/minicpmv.md",
    "chars": 6545,
    "preview": "# MiniCPM-V\n\nLMDeploy supports the following MiniCPM-V series of models, which are detailed in the table below:\n\n|      "
  }
]

// ... and 1074 more files (download for full content)

About this extraction

This page contains the full source code of the InternLM/lmdeploy GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 1274 files (7.7 MB), approximately 2.1M tokens, and a symbol index with 7894 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!