Full Code of liguodongiot/llm-action for AI

main f97e03bc1ecb cached

822 files

6.2 MB

1.7M tokens

1065 symbols

1 requests

Download .txt

Showing preview only (6,612K chars total). Download the full file or copy to clipboard to get everything.

Repository: liguodongiot/llm-action
Branch: main
Commit: f97e03bc1ecb
Files: 822
Total size: 6.2 MB

Directory structure:
gitextract_50r359j2/

├── .gitignore
├── LICENSE
├── README.md
├── ai-compiler/
│   ├── README.md
│   ├── Treebeard/
│   │   └── README.md
│   ├── treelit/
│   │   ├── README.md
│   │   └── xgb.md
│   └── triton-lang/
│       └── README.md
├── ai-framework/
│   ├── README.md
│   ├── TensorRT-Model-Optimizer.md
│   ├── cuda/
│   │   └── README.md
│   ├── deepspeed/
│   │   ├── 1.DeepSpeed入门.md
│   │   ├── 2.安装DeepSpeed.md
│   │   ├── 3.基于CIFAR-10使用DeepSpeed进行分布式训练 .md
│   │   ├── DeepSpeed配置JSON文件.md
│   │   ├── README.md
│   │   ├── config-json/
│   │   │   ├── README.md
│   │   │   └── deepspeed-nvme.md
│   │   ├── deepspeed-slurm.md
│   │   ├── hello_bert/
│   │   │   ├── README.md
│   │   │   ├── train_bert.py
│   │   │   └── train_bert_ds.py
│   │   └── training/
│   │       └── pipeline_parallelism/
│   │           └── README.md
│   ├── dlrover.md
│   ├── huggingface-accelerate/
│   │   └── README.md
│   ├── huggingface-peft/
│   │   └── README.md
│   ├── huggingface-transformers/
│   │   ├── API.md
│   │   ├── FSDP.md
│   │   └── README.md
│   ├── huggingface-trl/
│   │   └── README.md
│   ├── jax/
│   │   ├── README.md
│   │   └── reference.md
│   ├── llama-cpp/
│   │   └── README.md
│   ├── megatron-deepspeed/
│   │   └── README.md
│   ├── megatron-lm/
│   │   └── README.md
│   ├── mxnet/
│   │   ├── README.md
│   │   ├── mnist.py
│   │   ├── mxnet_cnn_mnist.py
│   │   ├── mxnet_mlp_mnist.py
│   │   ├── oneflow_cnn_mnist.py
│   │   ├── oneflow_mlp_mnist.py
│   │   └── reference.md
│   ├── oneflow/
│   │   ├── README.md
│   │   ├── oneflow_mlp_mnist.py
│   │   └── reference.md
│   ├── openai-triton/
│   │   └── README.md
│   ├── paddlepaddle/
│   │   ├── README.md
│   │   └── reference.md
│   ├── pai-megatron-patch/
│   │   └── README.md
│   ├── pai-torchacc.md
│   ├── pytorch/
│   │   ├── README.md
│   │   ├── install.md
│   │   └── reference.md
│   ├── tensorflow/
│   │   ├── README.md
│   │   └── reference.md
│   ├── transformer-engine/
│   │   └── mnist/
│   │       ├── README.md
│   │       ├── main.py
│   │       └── main_stat.py
│   └── unsloth-微调.md
├── ai-infra/
│   ├── ai-cluster/
│   │   └── README.md
│   ├── ai-hardware/
│   │   ├── AI芯片软件生态.md
│   │   ├── CUDA.md
│   │   ├── GPU-network.md
│   │   ├── GPU相关环节变量.md
│   │   ├── NIXL.md
│   │   ├── OEM-DGX.md
│   │   ├── README.md
│   │   ├── TSMC-台积电.md
│   │   ├── cuda镜像.md
│   │   ├── gpudirect.md
│   │   └── 硬件对比.md
│   ├── communication.md
│   ├── 存储/
│   │   ├── README.md
│   │   ├── REF.md
│   │   ├── nvme-ssd.md
│   │   ├── 固态硬盘.md
│   │   └── 存储.md
│   ├── 算力/
│   │   ├── AI芯片.md
│   │   ├── GPU工作原理.md
│   │   ├── NVIDIA-GPU型号.md
│   │   ├── 推理芯片.md
│   │   └── 昇腾NPU.md
│   └── 网络/
│       ├── HPC性能测试.md
│       ├── IB-docker.md
│       ├── IB流量监控.md
│       ├── IB软件.md
│       ├── InfiniBand.md
│       ├── NCCL.md
│       ├── README.md
│       ├── REF.md
│       ├── Spine-Leaf和InfiniBand网络架构区别简述.md
│       ├── nccl-test-集合通讯的性能测试.md
│       ├── nvbandwidth.md
│       ├── roce.md
│       ├── 网络硬件.md
│       ├── 通信软件.md
│       └── 集合通信原语.md
├── blog/
│   ├── TODO.md
│   ├── ai-infra/
│   │   ├── AI 集群基础设施 InfiniBand 详解.md
│   │   └── AI 集群基础设施 NVMe SSD 详解.md
│   ├── distribution-parallelism/
│   │   ├── 大模型分布式训练并行技术（一）-概述.md
│   │   ├── 大模型分布式训练并行技术（九）-总结.md
│   │   └── 大模型分布式训练并行技术（六）-多维混合并行.md
│   ├── llm-algo/
│   │   ├── moe.md
│   │   └── 大白话Transformer架构.md
│   ├── llm-compression/
│   │   ├── 大模型量化技术原理-ZeroQuant系列.md
│   │   └── 大模型量化技术原理：QoQ量化及QServe推理服务系统.md
│   ├── llm-inference/
│   │   └── 大模型推理框架概述.md
│   ├── llm-localization/
│   │   ├── 大模型国产化适配1-华为昇腾AI全栈软硬件平台总结.md
│   │   └── 大模型国产化适配4-基于昇腾910使用LLaMA-13B进行多机多卡训练.md
│   ├── llm-peft/
│   │   ├── 大模型参数高效微调技术原理综述（一）-背景、参数高效微调简介.md
│   │   └── 大模型参数高效微调技术原理综述（五）-LoRA、AdaLoRA、QLoRA.md
│   └── reference/
│       └── 高性能 LLM 推理框架的设计与实现.md
├── docs/
│   ├── README.md
│   ├── conda.md
│   ├── flash-attention/
│   │   └── FlashAttention.md
│   ├── llm-base/
│   │   ├── FLOPS.md
│   │   ├── NVIDIA-Nsight-Systems性能分析.md
│   │   ├── README.md
│   │   ├── a800-env-install.md
│   │   ├── ai-algo.md
│   │   ├── autoregressive-lm-decoding-methods.md
│   │   ├── dcgmi.md
│   │   ├── distribution-parallelism/
│   │   │   ├── README.md
│   │   │   ├── auto-parallel/
│   │   │   │   ├── Alpa.md
│   │   │   │   ├── Flexflow.md
│   │   │   │   ├── Galvatron.md
│   │   │   │   ├── Mesh-Tensorflow.md
│   │   │   │   ├── README.md
│   │   │   │   ├── Unity.md
│   │   │   │   ├── auto-parallel.md
│   │   │   │   ├── gspmd.md
│   │   │   │   ├── 分布式训练自动并行概述.md
│   │   │   │   └── 飞桨面向异构场景下的自动并行设计与实践.md
│   │   │   ├── data-parallelism/
│   │   │   │   └── README.md
│   │   │   ├── moe-parallel/
│   │   │   │   ├── README.md
│   │   │   │   ├── moe-framework.md
│   │   │   │   ├── moe-parallel.md
│   │   │   │   └── paddle_moe.py
│   │   │   ├── multidimensional-hybrid-parallel/
│   │   │   │   └── README.md
│   │   │   ├── pipeline-parallelism/
│   │   │   │   └── README.md
│   │   │   ├── tensor-parallel/
│   │   │   │   ├── README.md
│   │   │   │   └── tensor-parallel.md
│   │   │   └── 并行技术.drawio
│   │   ├── distribution-training/
│   │   │   ├── Bloom-176B训练经验.md
│   │   │   ├── FP16-BF16.md
│   │   │   ├── GLM-130B训练经验.md
│   │   │   ├── OPT-175B训练经验.md
│   │   │   ├── README.md
│   │   │   └── 自动混合精度.md
│   │   ├── gpu-env-var.md
│   │   ├── h800-env-install.md
│   │   ├── monitor.md
│   │   ├── multimodal/
│   │   │   └── sora.md
│   │   ├── nvidia-smi-dmon.md
│   │   ├── nvidia-smi.md
│   │   ├── rlhf/
│   │   │   └── README.md
│   │   ├── scenes/
│   │   │   ├── README.md
│   │   │   ├── cv/
│   │   │   │   ├── README.md
│   │   │   │   ├── paddle/
│   │   │   │   │   └── README.md
│   │   │   │   ├── pytorch/
│   │   │   │   │   └── README.md
│   │   │   │   └── reference.md
│   │   │   └── multi-modal/
│   │   │       ├── README.md
│   │   │       └── reference.md
│   │   ├── singularity命令.md
│   │   ├── slurm.md
│   │   ├── 分布式训练加速技术.md
│   │   ├── 多机RDMA性能测试.txt
│   │   └── 机器学习中常用的数据类型.md
│   ├── llm-experience.md
│   ├── llm-inference/
│   │   ├── DeepSpeed-Inference.md
│   │   ├── KV-Cache.md
│   │   ├── LLM服务框架对比.md
│   │   ├── README.md
│   │   ├── blog.md
│   │   ├── flexflow/
│   │   │   └── 投机采样.md
│   │   ├── llm推理优化技术.md
│   │   ├── llm推理框架.md
│   │   └── vllm.md
│   ├── llm-peft/
│   │   ├── LoRA-FA.md
│   │   ├── MAM_Adapter.md
│   │   ├── README.md
│   │   └── ReLoRA.md
│   ├── llm-summarize/
│   │   ├── README.md
│   │   ├── distribution_dl_roadmap.md
│   │   ├── 大模型实践总结-20230930.md
│   │   ├── 大模型实践总结.md
│   │   ├── 文档大模型.md
│   │   ├── 金融大模型.md
│   │   └── 领域大模型.md
│   └── transformer内存估算.md
├── faq/
│   └── FAQ.md
├── git-pull-push.sh
├── llm-algo/
│   ├── FLOPs.md
│   ├── InternLM-20B.md
│   ├── README.md
│   ├── baichuan2/
│   │   └── baichuan.md
│   ├── bert/
│   │   └── 模型架构.md
│   ├── bert.md
│   ├── bloom/
│   │   └── README.md
│   ├── bloom.md
│   ├── chatglm/
│   │   ├── README.md
│   │   └── 模型架构.md
│   ├── chatglm2/
│   │   ├── README.md
│   │   └── 模型架构.md
│   ├── chatglm3/
│   │   ├── README.md
│   │   └── reference.md
│   ├── chatgpt/
│   │   └── README.md
│   ├── deepseek/
│   │   ├── DeepSeek-R1.md
│   │   ├── DeepSeek-V2.md
│   │   ├── DeepSeek-V3.md
│   │   └── README.md
│   ├── glm-130b/
│   │   └── README.md
│   ├── glm4.md
│   ├── gpt/
│   │   └── README.md
│   ├── gpt2/
│   │   ├── README.md
│   │   ├── hf_modeling_gpt2.py
│   │   └── 模型架构.md
│   ├── gpt3/
│   │   └── README.md
│   ├── llama/
│   │   ├── README.md
│   │   └── 模型架构.md
│   ├── llama.md
│   ├── mixtral/
│   │   └── README.md
│   ├── mlp.md
│   ├── moe/
│   │   └── README.md
│   ├── qwen/
│   │   ├── README.md
│   │   └── 参数说明及函数说明.md
│   ├── qwen2.md
│   ├── t5/
│   │   └── README.md
│   ├── transformer/
│   │   ├── README.md 
│   │   ├── Transformer中FFN的记忆功能.md
│   │   └── 模型架构.md
│   ├── transformer.md
│   ├── 基本概念.md
│   ├── 旋转编码RoPE.md
│   ├── 模型架构类图.drawio
│   └── 训练范式.md
├── llm-alignment/
│   ├── DPO.md
│   ├── README.md
│   ├── RLHF.md
│   └── 基本概念.md
├── llm-application/
│   ├── Higress.md
│   ├── README.md
│   ├── agent/
│   │   ├── OpenClaw.md
│   │   └── OpenCode/
│   │       └── README.md
│   ├── embbedding-model.md
│   ├── gradio/
│   │   └── README.md
│   ├── langchain/
│   │   ├── README.md
│   │   ├── serve.py
│   │   └── tutorials/
│   │       ├── client.py
│   │       └── serve.py
│   ├── one-api.md
│   ├── pre-post-handle/
│   │   └── README.md
│   ├── rag/
│   │   ├── README.md
│   │   ├── embedding.md
│   │   ├── 存在的一些问题.md
│   │   └── 方案.md
│   ├── vector-db/
│   │   ├── README.md
│   │   └── reference.md
│   └── 应用场景.md
├── llm-compression/
│   ├── PaddleSlim/
│   │   ├──  quantization.md
│   │   └── README.md
│   ├── README.md
│   ├── distillation/
│   │   ├── GKD.md
│   │   ├── MINILLM.md
│   │   ├── README.md
│   │   ├── SCOTT.md
│   │   └── 大模型蒸馏概述.md
│   ├── gptqmodel/
│   │   └── README.md
│   ├── llm-compressor/
│   │   ├── README.md
│   │   ├── source-code.md
│   │   ├── 剪枝.md
│   │   └── 量化方案.md
│   ├── quantization/
│   │   ├── FP6-LLM.md
│   │   ├── GPTQ.md
│   │   ├── LLM-int8.md
│   │   ├── PEQA.md
│   │   ├── QQQ-W4A8.md
│   │   ├── README.md
│   │   ├── SmoothQuant.md
│   │   ├── SpinQuant.md
│   │   ├── ZeroQuant(4+2).md
│   │   ├── ZeroQuant.md
│   │   ├── fp4.md
│   │   ├── fp6.md
│   │   ├── fp8.md
│   │   ├── kv-cache-quant.md
│   │   ├── llm-qat/
│   │   │   ├── LLM-QAT.md
│   │   │   ├── README.md
│   │   │   ├── cfd70ff/
│   │   │   │   ├── README.md
│   │   │   │   ├── generate_data.py
│   │   │   │   ├── inference.py
│   │   │   │   ├── merge_gen_data.py
│   │   │   │   ├── pip.conf
│   │   │   │   ├── run_train.sh
│   │   │   │   ├── train.py
│   │   │   │   └── utils.py
│   │   │   ├── f4d873a/
│   │   │   │   ├── datautils.py
│   │   │   │   ├── run_train.sh
│   │   │   │   └── train.py
│   │   │   └── log.md
│   │   ├── moe模型量化.md
│   │   ├── tools.md
│   │   ├── 可视化/
│   │   │   ├── README.md
│   │   │   ├── qwen_activate_visual.ipynb
│   │   │   └── qwen_visual.ipynb
│   │   ├── 大模型量化概述.md
│   │   └── 量化基础.md
│   ├── sparsity/
│   │   └── README.md
│   ├── tools.md
│   ├── 大模型压缩综述.md
│   └── 经验.md
├── llm-data-engineering/
│   ├── README.md
│   ├── dataset/
│   │   ├── README.md
│   │   ├── baichuan2.md
│   │   ├── chinese-corpus-all.md
│   │   └── english-corpus-all.md
│   ├── reference.md
│   └── sft-dataset/
│       ├── baichuan2_test.py
│       ├── evol-instruct.md
│       ├── firefly-template.py
│       ├── jinja-demo.py
│       ├── jinja-llm-baichuan.py
│       ├── jinja-llm-baichuan2.py
│       ├── jinja-llm-bloom.py
│       ├── jinja-llm-chatglm3.py
│       ├── jinja-llm.py
│       ├── jinja.md
│       ├── 数据格式设计.md
│       └── 数据集格式.md
├── llm-eval/
│   ├── EvalScope.md
│   ├── README.md
│   ├── eval-data/
│   │   ├── longtext_L115433-question.txt
│   │   ├── longtext_L115433.txt
│   │   ├── longtext_L32503_answer.txt
│   │   ├── longtext_L32503_question.txt
│   │   ├── longtext_L64031.txt
│   │   └── longtext_L64031_question.txt
│   ├── llm-performance/
│   │   ├── AI芯片性能.md
│   │   ├── README.md
│   │   ├── hardware-performance/
│   │   │   ├── gpu-monitor-ui.py
│   │   │   └── pynvml-stat-memory.py
│   │   ├── llmperf.md
│   │   ├── mindie/
│   │   │   ├── lantency/
│   │   │   │   ├── README.md
│   │   │   │   ├── perfermance-stat.py
│   │   │   │   ├── performance-stream-baichuan2.py
│   │   │   │   ├── performance-stream-chatglm3.py
│   │   │   │   ├── performance-stream-qwen1.5.py
│   │   │   │   ├── performance-stream-qwen1.py
│   │   │   │   ├── performance-stream.py
│   │   │   │   └── stat_input_token.py
│   │   │   └── locust-lantency-throughput/
│   │   │       ├── README.md
│   │   │       ├── hello.py
│   │   │       ├── llm-910b4-baichuan2-7b-2tp.py
│   │   │       ├── llm-910b4-chatglm3-6b-2tp.py
│   │   │       ├── llm-910b4-qwen-72b-8tp.py
│   │   │       ├── llm-910b4-qwen1.5-4tp.py
│   │   │       ├── qwen1.5-72b-8tp.html
│   │   │       └── 示例.py
│   │   ├── perfetto.md
│   │   ├── stat_gpu_memory.py
│   │   ├── tgi-benchmark.md
│   │   ├── vllm/
│   │   │   ├── README.md
│   │   │   ├── vllm-locust-qwen1.5-7b-long.py
│   │   │   └── vllm-performance-stream-qwen1.5-long.py
│   │   ├── vllm-benchmark.md
│   │   ├── wrk-性能测试工具.md
│   │   ├── 大模型场景下训练和推理性能指标名词解释.md
│   │   ├── 推理性能测试.md
│   │   └── 训练性能测试.md
│   ├── llm-precision/
│   │   ├── C-Eval.md
│   │   ├── README.md
│   │   └── 模型质量评估.md
│   ├── opencompass.md
│   └── 大模型测评集.md
├── llm-inference/
│   ├── DeepSpeed-Inference.md
│   ├── Flash-Decoding.md
│   ├── FlashInfer.md
│   ├── FlexFlow-Serve.md
│   ├── GuidedGeneration.md
│   ├── KV-Cache优化.md
│   ├── Mooncake.md
│   ├── NanoFlow.md
│   ├── PD分离.md
│   ├── README.md
│   ├── RTP-LLM.md
│   ├── ascend/
│   │   └── mindformers/
│   │       ├── README.md
│   │       ├── baichuan2/
│   │       │   ├── README.md
│   │       │   ├── baichuan-inference.py
│   │       │   └── baichuan-stat.py
│   │       ├── chatglm3/
│   │       │   ├── README.md
│   │       │   ├── chatglm-gen.py
│   │       │   ├── chatglm-inference.py
│   │       │   └── chatglm-stat.py
│   │       ├── mindsporelite-inference.py
│   │       ├── mindsporelite-stat.py
│   │       └── text_generator_infer.py
│   ├── chatgpt.md
│   ├── deepspeed-mii/
│   │   └── README.md
│   ├── faster-transformer/
│   │   ├── README.md
│   │   ├── bloom/
│   │   │   ├── README.md
│   │   │   └── firefly_lambada_1w_stat_token.py
│   │   ├── gpt/
│   │   │   └── README.md
│   │   ├── llama/
│   │   │   └── README.md
│   │   └── megatron-gpt2/
│   │       ├── gpt_summarization.py
│   │       ├── gpt_summarization_stat.py
│   │       └── megatron-gpt2-fp8.md
│   ├── flexflow-serve/
│   │   └── benchmark-batch1.py
│   ├── huggingface-tgi/
│   │   └── README.md
│   ├── huggingface-transformer/
│   │   └── README.md
│   ├── lightllm/
│   │   └── README.md
│   ├── lmdeploy/
│   │   ├── README.md
│   │   ├── 功能.md
│   │   └── 服务启动参数.md
│   ├── native-model/
│   │   └── chatglm3-6b/
│   │       └── cli_demo.py
│   ├── offload.md
│   ├── openai.md
│   ├── sglang/
│   │   ├── README.md
│   │   ├── source-code.md
│   │   ├── 服务器启动参数.md
│   │   └── 项目代码结构.md
│   ├── tensorrt/
│   │   ├── README.md
│   │   └── install.md
│   ├── tensorrt-llm/
│   │   ├── FP8.md
│   │   ├── Memory Usage of TensorRT-LLM.md
│   │   ├── README.md
│   │   ├── TRT-LLM引擎构建参数.md
│   │   ├── Triton服务启动参数.md
│   │   └── 安装.md
│   ├── triton/
│   │   ├── REAEME.md
│   │   ├── onnx/
│   │   │   └── README.md
│   │   └── resnet50/
│   │       ├── client.py
│   │       ├── config.pbtxt
│   │       ├── labels.txt
│   │       └── resnet50_convert_torchscript.py
│   ├── vllm/
│   │   ├── FAQ.md
│   │   ├── FP8.md
│   │   ├── README.md
│   │   ├── REF.md
│   │   ├── api_client.py
│   │   ├── cmd.md
│   │   ├── vllm.md
│   │   ├── 服务启动参数.md
│   │   ├── 源码.md
│   │   ├── 请求处理流程.md
│   │   └── 长文本推理.md
│   ├── web/
│   │   ├── fastapi/
│   │   │   ├── README.md
│   │   │   └── llm-qwen-mindspore-lite.py
│   │   ├── flask/
│   │   │   ├── README.md
│   │   │   └── llm-qwen-mindspore-lite.py
│   │   └── sanic/
│   │       └── README.md
│   ├── xinference/
│   │   └── README.md
│   ├── 分离式推理架构.md
│   ├── 大模型推理张量并行.md
│   └── 解码策略.md
├── llm-interview/
│   ├── README.md
│   ├── base.md
│   ├── comprehensive.md
│   ├── llm-algo.md
│   ├── llm-app.md
│   ├── llm-compress.md
│   ├── llm-eval.md
│   ├── llm-ft.md
│   ├── llm-inference.md
│   ├── llm-rlhf.md
│   └── llm-train.md
├── llm-localization/
│   ├── README.md
│   ├── ascend/
│   │   ├── FAQ.md
│   │   ├── README.md
│   │   ├── ascend-c/
│   │   │   └── README.md
│   │   ├── ascend-infra/
│   │   │   ├── HCCL.md
│   │   │   ├── MacOS环境.md
│   │   │   ├── ascend-dmi.md
│   │   │   ├── ascend-docker-runtime.md
│   │   │   ├── ascend-docker.md
│   │   │   ├── ascend-llm下载.md
│   │   │   ├── ascend-npu-smi.md
│   │   │   ├── docker环境升级cann.md
│   │   │   ├── network.md
│   │   │   ├── npu监控.md
│   │   │   ├── 操作系统.md
│   │   │   ├── 昇腾卡-soc版本.md
│   │   │   ├── 昇腾卡注意事项.md
│   │   │   ├── 昇腾镜像.md
│   │   │   ├── 服务器配置.md
│   │   │   ├── 环境安装.md
│   │   │   └── 达芬奇架构.md
│   │   ├── ascend910-env-install.md
│   │   ├── fabric-insight/
│   │   │   └── README.md
│   │   ├── firefly-ascend.md
│   │   ├── mindformers/
│   │   │   ├── README.md
│   │   │   ├── baichuan2/
│   │   │   │   ├── baichuan2训练.md
│   │   │   │   ├── run_baichuan2_7b.yaml
│   │   │   │   ├── run_baichuan2_7b_910b.yaml
│   │   │   │   └── run_baichuan2_7b_lora_910b.yaml
│   │   │   ├── chatglm/
│   │   │   │   ├── README.md
│   │   │   │   ├── chat_glm.py
│   │   │   │   ├── glm_6b.yaml
│   │   │   │   ├── glm_6b_chat.yaml
│   │   │   │   ├── merge_ckpt.py
│   │   │   │   ├── merge_ckpt_lora.py
│   │   │   │   ├── pt2ms.py
│   │   │   │   ├── run_glm_6b_finetune.yaml
│   │   │   │   ├── run_glm_6b_infer.yaml
│   │   │   │   ├── run_glm_6b_lora.yaml
│   │   │   │   └── run_glm_6b_lora_infer.yaml
│   │   │   ├── env.md
│   │   │   ├── llama/
│   │   │   │   └── README.md
│   │   │   ├── qwen/
│   │   │   │   ├── qwen1训练.md
│   │   │   │   ├── run_qwen_7b.yaml
│   │   │   │   └── run_qwen_7b_910b.yaml
│   │   │   ├── qwen1.5/
│   │   │   │   ├── qwen1.5训练.md
│   │   │   │   ├── run_qwen1_5_7b_finetune.yaml
│   │   │   │   └── run_qwen1_5_7b_infer.yaml
│   │   │   ├── trick.md
│   │   │   └── 权重格式转换.md
│   │   ├── mindie/
│   │   │   ├── 2.0.RC2/
│   │   │   │   └── qwen.md
│   │   │   ├── README.md
│   │   │   ├── config/
│   │   │   │   ├── chatglm3-6b.json
│   │   │   │   ├── qwen-72b.json
│   │   │   │   └── run.sh
│   │   │   ├── config-1.0.RC1.json
│   │   │   ├── docker/
│   │   │   │   ├── README.md
│   │   │   │   ├── TEST.md
│   │   │   │   ├── baichuan2-13b.json
│   │   │   │   ├── baichuan2-7b.json
│   │   │   │   ├── deploy.sh
│   │   │   │   ├── install_and_enable_cann.sh
│   │   │   │   ├── llm-server.sh
│   │   │   │   ├── mindie-1.0.Dockerfile
│   │   │   │   ├── mindie-all-1.0.Dockerfile
│   │   │   │   ├── mindie-env-1.0.Dockerfile
│   │   │   │   ├── qwen-72b.json
│   │   │   │   ├── qwen1.5-14b.json
│   │   │   │   ├── qwen1.5-72b.json
│   │   │   │   └── qwen1.5-7b.json
│   │   │   ├── llm-server.sh
│   │   │   ├── mindid-1.0-offical.md
│   │   │   ├── mindid-performance.md
│   │   │   ├── mindie-1.0.Dockerfile
│   │   │   ├── mindie-1.0.RC2.md
│   │   │   ├── mindie-1.0.md
│   │   │   ├── mindie-1.0.rc2-config.json
│   │   │   ├── mindie-1.0.rc2-llm-server.sh
│   │   │   ├── mindie-2.0.rc2.md
│   │   │   ├── mindie-20240411.md
│   │   │   ├── mindie-api.md
│   │   │   ├── model-test.md
│   │   │   ├── script/
│   │   │   │   ├── model-test.py
│   │   │   │   └── run.sh
│   │   │   ├── 性能调优.md
│   │   │   └── 日志分析.txt
│   │   ├── mindspore/
│   │   │   ├── MindSpore-note.md
│   │   │   ├── README.md
│   │   │   ├── bert.md
│   │   │   ├── reference.md
│   │   │   └── 镜像.md
│   │   ├── modellink/
│   │   │   ├── README.md
│   │   │   ├── dataset.md
│   │   │   ├── llm.md
│   │   │   ├── qwen.md
│   │   │   ├── 环境-20240521.md
│   │   │   └── 环境安装.md
│   │   ├── msmodelslim/
│   │   │   ├── README.md
│   │   │   └── llm_quant/
│   │   │       ├── baichuan2-w8a8.py
│   │   │       ├── calib_set.json
│   │   │       └── qwen1.5-72b-w8a16.py
│   │   ├── openmind/
│   │   │   └── README.md
│   │   ├── peft/
│   │   │   ├── README.md
│   │   │   └── finetune-lora.py
│   │   ├── pytorch/
│   │   │   ├── README.md
│   │   │   └── llm-lora.py
│   │   ├── standford-alpaca/
│   │   │   ├── README.md
│   │   │   ├── ds_config_zero2.json
│   │   │   ├── ds_config_zero3.json
│   │   │   ├── requirements.txt
│   │   │   ├── train.py
│   │   │   └── utils.py
│   │   ├── transformers/
│   │   │   └── README.md
│   │   ├── vllm-ascend/
│   │   │   └── README.md
│   │   ├── 优质学习资料.md
│   │   ├── 昇腾LLM支持概览.md
│   │   └── 昇腾卡注意事项.md
│   ├── modelscope/
│   │   └── README.md
│   ├── paddle/
│   │   └── PaddleNLP.md
│   └── tianshuzhixin/
│       ├── README.md
│       └── ixsmi.md
├── llm-maas/
│   ├── OpenAI-ChatGPT.md
│   └── README.md
├── llm-optimizer/
│   ├── FlashAttention.md
│   ├── README.md
│   ├── SplitFuse.md
│   ├── kv-cache.md
│   ├── xformers.md
│   └── 计算通信重叠.md
├── llm-pipeline/
│   └── REAEMD.md
├── llm-tools/
│   ├── Pytorch-Profiler.md
│   ├── README.md
│   ├── base-profiler.py
│   ├── nsight/
│   │   └── README.md
│   ├── nsight.md
│   ├── nvtx.md
│   ├── profiler-recipe.py
│   ├── tensorboard-profiler.py
│   └── 可视化.md
├── llm-train/
│   ├── README.md
│   ├── alpa/
│   │   └── train/
│   │       ├── pipeshard_parallelism.ipynb
│   │       └── pipeshard_parallelism.py
│   ├── alpaca/
│   │   ├── README.md
│   │   ├── ds_config.json
│   │   ├── ds_config_zero2.json
│   │   ├── ds_config_zero2_ddp.json
│   │   ├── inference.py
│   │   ├── train.py
│   │   └── train_ddp.py
│   ├── alpaca-lora/
│   │   ├── README.md
│   │   ├── export_hf_checkpoint.py
│   │   ├── export_state_dict_checkpoint.py
│   │   ├── finetune.py
│   │   ├── finetune_metrics_epoch.py
│   │   ├── generate.py
│   │   └── inference.py
│   ├── chatglm/
│   │   ├── README.md
│   │   ├── deepspeed.json
│   │   ├── ds_train_finetune.sh
│   │   ├── evaluate.sh
│   │   ├── evaluate_finetune.sh
│   │   ├── inference.py
│   │   ├── main.py
│   │   ├── train.sh
│   │   └── train_ptuningv2_dp.sh
│   ├── chatglm-lora/
│   │   ├── README.md
│   │   ├── finetune.py
│   │   ├── finetune_ddp.py
│   │   └── inference.py
│   ├── chinese-llama-alpaca/
│   │   ├── README.md
│   │   ├── inference_hf.py
│   │   ├── merge_llama_with_chinese_lora.py
│   │   ├── merge_tokenizers.py
│   │   ├── run_clm_pt_with_peft.py
│   │   ├── run_clm_sft_with_peft.py
│   │   ├── run_pt.sh
│   │   └── run_sft.sh
│   ├── deepspeedchat/
│   │   ├── README.md
│   │   ├── llama/
│   │   │   └── README.md
│   │   └── training/
│   │       ├── step1_supervised_finetuning/
│   │       │   └── training_scripts/
│   │       │       └── single_node/
│   │       │           └── run_13b.sh
│   │       ├── step2_reward_model_finetuning/
│   │       │   └── training_scripts/
│   │       │       └── single_node/
│   │       │           └── run_350m.sh
│   │       ├── step3_rlhf_finetuning/
│   │       │   └── training_scripts/
│   │       │       └── single_node/
│   │       │           └── run_13b.sh
│   │       └── utils/
│   │           └── data/
│   │               └── raw_datasets.py
│   ├── firefly/
│   │   ├── README.md
│   │   ├── bootstrap-s3.sh
│   │   ├── bootstrap.sh
│   │   ├── dockerfile.md
│   │   └── test_bash_getopts.sh
│   ├── fp8.md
│   ├── galore/
│   │   └── torchrun_main.py
│   ├── megatron/
│   │   ├── README.md
│   │   ├── codegeex/
│   │   │   └── README.md
│   │   ├── gpt2/
│   │   │   ├── README.md
│   │   │   ├── data/
│   │   │   │   ├── cMinhash.cpp
│   │   │   │   ├── download.py
│   │   │   │   ├── file_utils.py
│   │   │   │   └── merge_data.py
│   │   │   ├── gpt-data-preprocess.md
│   │   │   ├── merge_ck_and_inference/
│   │   │   │   ├── README.md
│   │   │   │   ├── checkpoint_loader_megatron.py
│   │   │   │   ├── checkpoint_saver_megatron.py
│   │   │   │   ├── checkpoint_util.py
│   │   │   │   ├── eval_gpt2_lambada.sh
│   │   │   │   ├── run_text_generation_server.py
│   │   │   │   ├── run_text_generation_server_345M.sh
│   │   │   │   ├── run_text_generation_server_345M_2tp_2dp.sh
│   │   │   │   ├── run_text_generation_server_345M_4_tensor_parallel.sh
│   │   │   │   └── text_generation_cli.py
│   │   │   ├── model_merge_eval_inference.md
│   │   │   ├── model_train.md
│   │   │   ├── requirements.txt
│   │   │   └── train/
│   │   │       ├── pretrain_gpt.sh
│   │   │       ├── pretrain_gpt_distributed.sh
│   │   │       ├── pretrain_gpt_distributed_with_4pp.sh
│   │   │       ├── pretrain_gpt_distributed_with_4tp.sh
│   │   │       └── pretrain_gpt_distributed_with_mp.sh
│   │   ├── megatron.drawio
│   │   ├── pretrain.xmind
│   │   ├── project.md
│   │   └── source-code.md
│   ├── megatron-deepspeed/
│   │   ├── README.md
│   │   ├── bigscience/
│   │   │   └── bloom-note.md
│   │   ├── bloom-megatron-deepspeed.md
│   │   ├── microsoft/
│   │   │   ├── H800多机多卡训练坑点.md
│   │   │   ├── README.md
│   │   │   ├── llama-note.md
│   │   │   ├── pip.conf
│   │   │   ├── pretrain_llama2_13b_distributed_fp16.sh
│   │   │   ├── pretrain_llama2_distributed.sh
│   │   │   ├── pretrain_llama_13b_distributed_fp16.sh
│   │   │   ├── pretrain_llama_7b_distributed_fp16.sh
│   │   │   ├── pretrain_llama_distributed_fp16.sh
│   │   │   ├── slurm/
│   │   │   │   ├── README.md
│   │   │   │   ├── llama-multinode-ib.sh
│   │   │   │   ├── megatron-deepspeed-multinode-ib-part2-30b-fp16.slurm
│   │   │   │   └── megatron-deepspeed-multinode-ib-part2-65b-fp16.slurm
│   │   │   ├── 代码.md
│   │   │   ├── 环境准备.md
│   │   │   ├── 训练日志分析.md
│   │   │   └── 项目结构-202312228.md
│   │   └── source-code.md
│   ├── paddle/
│   │   ├── README.md
│   │   └── paddlenlp/
│   │       ├── README.md
│   │       ├── baichuan2/
│   │       │   └── README.md
│   │       └── bloom/
│   │           ├── README.md
│   │           └── sft_argument.json
│   ├── peft/
│   │   ├── LoRA-QLoRA.md
│   │   ├── PEFT-API.md
│   │   ├── Prefix-Tuning.md
│   │   ├── Prompt-Tuning.md
│   │   ├── README.md
│   │   ├── clm/
│   │   │   ├── accelerate_ds_zero3_cpu_offload_config.yaml
│   │   │   ├── peft_ia3_clm.ipynb
│   │   │   ├── peft_lora_clm.ipynb
│   │   │   ├── peft_lora_clm_accelerate_ds_zero3_offload.py
│   │   │   ├── peft_p_tuning_clm.ipynb
│   │   │   ├── peft_p_tuning_lstm_clm.ipynb
│   │   │   ├── peft_p_tuning_v2_clm.ipynb
│   │   │   ├── peft_prefix_tuning_clm.ipynb
│   │   │   └── peft_prompt_tuning_clm.ipynb
│   │   ├── conditional_generation/
│   │   │   └── README.md
│   │   └── multimodal/
│   │       ├── blip2_lora_inference.py
│   │       ├── blip2_lora_int8_fine_tune.py
│   │       └── finetune_bloom_bnb_peft.ipynb
│   ├── pytorch/
│   │   ├── Pytorch源码解读.md
│   │   ├── README.md
│   │   ├── api.md
│   │   ├── distribution/
│   │   │   ├── README.md
│   │   │   ├── api.md
│   │   │   ├── data-parallel/
│   │   │   │   ├── README.md
│   │   │   │   ├── ddp_launch.py
│   │   │   │   ├── ddp_main.py
│   │   │   │   ├── elastic_ddp.py
│   │   │   │   ├── minGPT-ddp/
│   │   │   │   │   ├── README.md
│   │   │   │   │   ├── multinode.sh
│   │   │   │   │   ├── sbatch_run.sh
│   │   │   │   │   ├── sbatch_run_sig.sh
│   │   │   │   │   └── sbatch_run_sig_opt.sh
│   │   │   │   ├── sbatch_run.sh
│   │   │   │   └── 使用DDP训练真实世界的模型.md
│   │   │   ├── pipeline-parallel/
│   │   │   │   ├── 1-流水线.md
│   │   │   │   ├── 2-使用torchtext训练transformer模型.md
│   │   │   │   ├── 3-使用流水线并行训练Transformer模型.md
│   │   │   │   ├── 4-使用DDP与流水线并行训练Transformer模型.md
│   │   │   │   ├── README.md
│   │   │   │   ├── ddp_pipeline.py
│   │   │   │   ├── pipeline_tutorial.ipynb
│   │   │   │   └── transformer_tutorial.ipynb
│   │   │   ├── rpc/
│   │   │   │   └── README.md
│   │   │   ├── sequence-parallelism/
│   │   │   │   └── README.md
│   │   │   ├── tensor-parallel/
│   │   │   │   ├── 2d_parallel_example.py
│   │   │   │   ├── README.md
│   │   │   │   ├── sequence_parallel_example.py
│   │   │   │   ├── tensor_parallel_example.py
│   │   │   │   └── utils.py
│   │   │   ├── torchrun.md
│   │   │   ├── 分布式通信包.md
│   │   │   ├── 多机多卡.md
│   │   │   └── 多机训练.md
│   │   ├── resource.md
│   │   └── torchrun.md
│   ├── qlora/
│   │   ├── README.md
│   │   ├── accuracy.py
│   │   ├── export_hf_checkpoint.py
│   │   ├── inference.py
│   │   ├── inference_merge.py
│   │   ├── inference_qlora.py
│   │   └── qlora.py
│   ├── slurm/
│   │   ├── README.md
│   │   ├── deepspeed/
│   │   │   ├── pp-multinode-machine.slurm
│   │   │   ├── pp-multinode-singularity.slurm
│   │   │   ├── pp-mutinode-singularity-pmix.slurm
│   │   │   ├── pp-standalone-singularity-v2.slurm
│   │   │   └── pp-standalone-singularity.slurm
│   │   ├── megatron-deepspeed/
│   │   │   └── megatron-deepspeed-multinode-ib-part2-65b-fp16.slurm
│   │   └── pytorch/
│   │       ├── alpaca-docker.slurm
│   │       ├── alpaca-machine.slurm
│   │       ├── alpaca-singularity.slurm
│   │       ├── mingpt-singularity-multinode-2.slurm
│   │       └── mingpt-singularity-multinode.slurm
│   └── vicuna/
│       └── README.md
├── llmops/
│   ├── FAQ.md
│   ├── README.md
│   ├── kubernetes.md
│   ├── tq-llm/
│   │   └── train/
│   │       ├── FAQ.md
│   │       ├── README.md
│   │       ├── bootstrap-llm-zero3-offload.sh
│   │       ├── bootstrap-llm.sh
│   │       ├── bootstrap-llm2.sh
│   │       ├── zero2-offload.json
│   │       └── zero3-offload.json
│   ├── 使用docker进行多机多卡训练.md
│   ├── 千帆大模型平台.md
│   └── 模型推理平台方案.md
├── mkdir-dir-file.sh
├── paper/
│   ├── A Survey on Efficient Training of Transformers.md
│   ├── LESS-选择有影响力的数据进行目标指令精调.md
│   ├── LLM增强LLMS.md
│   ├── PagedAttention.md
│   ├── README.md
│   ├── data/
│   │   ├── LESS 实践：仅用少量的数据完成目标指令微调.md
│   │   ├── LESS-选择有影响力的数据进行目标指令精调.md
│   │   └── LESS.md
│   ├── inference/
│   │   ├── llm-in-a-flash.md
│   │   ├── orca.md
│   │   └── 迈向高效的生成式大语言模型服务综述-从算法到系统.md
│   ├── llm对齐综述.md
│   ├── moe/
│   │   └── README.md
│   ├── parameter-pruning/
│   │   ├── LLM-Pruner.md
│   │   ├── SparseGPT.md
│   │   ├── Wanda.md
│   │   └── 公式.md
│   └── training/
│       ├── A Survey on Efficient Training of Transformers.md
│       ├── GaLore.md
│       └── Reducing Activation Recomputation in Large Transformer Models.md
└── template/
    └── server.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# poetry
#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
#   This is especially recommended for binary packages to ensure reproducibility, and is more
#   commonly ignored for libraries.
#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/


================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: README.md
================================================
<p align="center">
  <img src="https://github.com/liguodongiot/llm-action/blob/main/pic/llm-action-v4.jpg" >
</p>


<p> 
<a href="https://github.com/liguodongiot/llm-action/stargazers">
<img src="https://img.shields.io/github/stars/liguodongiot/llm-action?style=social" > </a>
<a href="https://github.com/liguodongiot/llm-action/blob/main/pic/wx.jpg"> <img src="https://img.shields.io/badge/吃果冻不吐果冻皮-1AAD19.svg?style=plastic&logo=wechat&logoColor=white" > </a>
<a href="https://www.zhihu.com/people/liguodong-iot"> <img src="https://img.shields.io/badge/吃果冻不吐果冻皮-0079FF.svg?style=plastic&logo=zhihu&logoColor=white"> </a>
<a href="https://juejin.cn/user/3642056016410728"> <img src="https://img.shields.io/badge/掘金-吃果冻不吐果冻皮-000099.svg?style=plastic&logo=juejin"> </a>
<a href="https://liguodong.blog.csdn.net/"> <img src="https://img.shields.io/badge/CSDN-吃果冻不吐果冻皮-6B238E.svg"> </a>
<a href="https://www.lab4ai.cn/register?agentID=user-PqCML6LJZO"> <img src="https://img.shields.io/badge/Lab4AI-大模型实验室-1E90FF.svg"> </a>
</p> 


## 目录

- :snail: [LLM训练](#llm训练)
  - 🐫 [LLM训练实战](#llm训练实战)
  - 🐼 [LLM参数高效微调技术原理](#llm微调技术原理)
  - 🐰 [LLM参数高效微调技术实战](#llm微调实战)
  - 🐘 [LLM分布式训练并行技术](#llm分布式训练并行技术)
  - 🌋 [分布式AI框架](#分布式ai框架)
  - 📡 [分布式训练网络通信](#分布式训练网络通信)
  - :herb: [LLM训练优化技术](#llm训练优化技术)
  - :hourglass: [LLM对齐技术](#llm对齐技术)
- 🐎 [LLM推理](#llm推理)
  - 🚀 [LLM推理框架](#llm推理框架)
  - ✈️ [LLM推理优化技术](#llm推理优化技术)
- ♻️ [LLM压缩](#llm压缩)
  - 📐 [LLM量化](#llm量化)
  - 🔰 [LLM剪枝](#llm剪枝)
  - 💹 [LLM知识蒸馏](#llm知识蒸馏)
  - ♑️ [低秩分解](#低秩分解)
- :herb: [LLM测评](#llm测评)
  - 🔯 [LLM效果评测](#llm效果评测)
  - 🔘 [LLM推理性能压测](#llm推理性能压测)
- :palm_tree: [LLM数据工程](#llm数据工程)
  - :dolphin: [LLM微调高效数据筛选技术](#llm微调高效数据筛选技术)
- :cyclone: [提示工程](#提示工程)
- ♍️ [LLM算法架构](#llm算法架构)
- :jigsaw: [LLM应用开发](#llm应用开发)
- 🀄️ [LLM国产化适配](#llm国产化适配)
- 🔯 [AI编译器](#ai编译器)
- 🔘 [AI基础设施](#ai基础设施)
  - :maple_leaf: [AI加速卡](#ai加速卡)
  - :octocat: [AI集群网络通信](#ai集群网络通信)
- 💟 [LLMOps](#llmops)
- 🍄 [LLM生态相关技术](#llm生态相关技术)
- 💹 [LLM性能分析](#llm性能分析)
- :dizzy: [LLM面试题](#llm面试题)
- 🔨 [服务器基础环境安装及常用工具](#服务器基础环境安装及常用工具)
- 💬 [LLM学习交流群](#llm学习交流群)
- 👥 [微信公众号](#微信公众号)
- ⭐️ [Star History](#star-history)
- :link: [AI工程化课程推荐](#ai工程化课程推荐)


## 大模型实验室Lab4AI普惠算力

**基于大模型实验室的GPU算力实践**


|  主题      | 实践          | 博客/视频     |
|:------------ |:-----------------------------:|:--------:| 
| 基于ComfyUI调用Flux文生图模型生成动漫风格图像      | [链接](https://www.lab4ai.cn/project/detail?utm_source=guodong&id=f82ca14acda040ba8a3412feb541ba29&type=project)          | [链接](https://mp.weixin.qq.com/s/OEDQO-IkT4uo_HMjBXGuCA)     |
| 告别传统客服：三步骤，LLaMA-Factory零代码打造会订票的专属大模型      | [链接](https://www.lab4ai.cn/project/detail?utm_source=guodong&id=a78043adcef84cd998516e1bcd39562f&type=project)          | [链接](https://mp.weixin.qq.com/s/N_CQEBEjN0E31x4Vg31rEQ)    |
| 打造基于多模态AI的苏东坡数字人      | [链接](https://www.lab4ai.cn/project/detail?utm_source=guodong&id=1f1097f45ea64abca3359e4c0615720a&type=project)          | -     |
| WeClone：从聊天记录创造数字分身的一站式解决方案      | [链接](https://www.lab4ai.cn/project/detail?utm_source=guodong&id=ab83d14684fa45d197f67eddb3d8316c&type=project)          | [链接](https://mp.weixin.qq.com/s/2pOD8YexWtmuPhV4C7uKJA)     |
| LightX2V 4步蒸馏模型：20倍速的高质量视频生成革命      | [链接](https://www.lab4ai.cn/project/detail?utm_source=guodong&id=d5556b93078d4defbb58c9f722b674df&type=project)          | [链接](https://mp.weixin.qq.com/s/kVz1dwthn3nOLT0jTeiQgg)     |
| 基于Qwen3-8B的沉浸式苏东坡角色扮演大模型      | [链接](https://www.lab4ai.cn/project/detail?utm_source=guodong&id=315457fba1b3432c935865d1c5aa1ffe&type=project)          | [链接](https://mp.weixin.qq.com/s/bCCHa2RsKieJZizORU19dQ)     |
| LightLLM轻量化部署新范式，打造高性能法律智能体      | [链接](https://www.lab4ai.cn/project/detail?utm_source=guodong&id=b417085ae8cd4dd0bef7161c3d583b15&type=project)          | [链接](https://mp.weixin.qq.com/s/j8rJyoBA02ypPEkxb9XSVg)     |
| RoboMIND——机器人多形态通用智能评测基准      | [链接](https://www.lab4ai.cn/project/detail?utm_source=guodong&id=492a471cd6054a179660c760f0026704&type=project)          | [链接](https://mp.weixin.qq.com/s/i_QPGuqaXfql6cPELxlUVg)     |
| 经典论文复现：《Attention Is All You Need》      | [链接](https://www.lab4ai.cn/paper/detail?utm_source=guodong&id=e90aa38fdff9420e8902bc71909fa005&type=paper)          | [链接](https://www.bilibili.com/video/BV1Fvp3zBEAN/?spm_id_from=333.1387.homepage.video_card.click)     |
| 经典论文复现：《SELF-INSTRUCT: Aligning Language Models <br> with Self-Generated Instructions》| [链接](https://www.lab4ai.cn/paper/detail?utm_source=guodong&id=2bbf2f4971f74c6e8def26879233f2fe&type=paper)          | -     |



**GPU算力优惠活动**

- 资源不够用，来 Lab4AI 享 H800 GPU，用 H800 比 4090 还划算: [详情](https://mp.weixin.qq.com/s/61OtlvP3N4vl0D67eCzSWA)


**算力福利**

- 大模型实验室Lab4AI：[免费领取50元GPU算力](https://www.lab4ai.cn/register?agentID=user-PqCML6LJZO)
- 大模型实验室群：[点击加入](https://github.com/liguodongiot/liguodongiot/tree/main/images/lab4ai.png)



**AI训练营**

- AI应用开发工程师技能 & 春招面试训练营：[点击加入](https://www.lab4ai.cn/course/detail?utm_source=guodong&id=2b86361ed6a54611850c073defe04327)
- 斯坦福CS336 从零手搓大语言模型实战：[点击加入](https://www.lab4ai.cn/course/detail?utm_source=guodong&id=49325466ca58436782b65a887883805f)
- 7天AI智能体全栈开发实战集训营：[点击加入](https://www.lab4ai.cn/course/detail?utm_source=guodong&id=f3fba5d60b2542bf8783e59dcc24d836)



## LLM训练

### LLM训练实战

下面汇总了我在大模型实践中训练相关的所有教程。从6B到65B，从全量微调到高效微调（LoRA，QLoRA，P-Tuning v2），再到RLHF（基于人工反馈的强化学习）。

| LLM                         | 预训练/SFT/RLHF...            | 参数     | 教程                                                                                                                                                                                                                     | 代码                                                                                     |
| --------------------------- | ----------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
| Alpaca                      | full fine-turning             | 7B       | [从0到1复现斯坦福羊驼（Stanford Alpaca 7B）](https://zhuanlan.zhihu.com/p/618321077)                                                                                                                                        | [配套代码](https://github.com/liguodongiot/llm-action/tree/main/llm-train/alpaca)               |
| Alpaca(LLaMA)               | LoRA                          | 7B~65B   | 1.[足够惊艳，使用Alpaca-Lora基于LLaMA(7B)二十分钟完成微调，效果比肩斯坦福羊驼](https://zhuanlan.zhihu.com/p/619426866)<br>2. [使用 LoRA 技术对 LLaMA 65B 大模型进行微调及推理](https://zhuanlan.zhihu.com/p/632492604)    | [配套代码](https://github.com/liguodongiot/llm-action/tree/main/llm-train/alpaca-lora)          |
| BELLE(LLaMA/Bloom)          | full fine-turning             | 7B       | 1.[基于LLaMA-7B/Bloomz-7B1-mt复现开源中文对话大模型BELLE及GPTQ量化](https://zhuanlan.zhihu.com/p/618876472) <br> 2. [BELLE(LLaMA-7B/Bloomz-7B1-mt)大模型使用GPTQ量化后推理性能测试](https://zhuanlan.zhihu.com/p/621128368) | N/A                                                                                      |
| ChatGLM                     | LoRA                          | 6B       | [从0到1基于ChatGLM-6B使用LoRA进行参数高效微调](https://zhuanlan.zhihu.com/p/621793987)                                                                                                                                      | [配套代码](https://github.com/liguodongiot/llm-action/tree/main/llm-train/chatglm-lora)         |
| ChatGLM                     | full fine-turning/P-Tuning v2 | 6B       | [使用DeepSpeed/P-Tuning v2对ChatGLM-6B进行微调](https://zhuanlan.zhihu.com/p/622351059)                                                                                                                                     | [配套代码](https://github.com/liguodongiot/llm-action/tree/main/llm-train/chatglm)              |
| Vicuna(LLaMA)               | full fine-turning             | 7B       | [大模型也内卷，Vicuna训练及推理指南，效果碾压斯坦福羊驼](https://zhuanlan.zhihu.com/p/624012908)                                                                                                                            | N/A                                                                                      |
| OPT                         | RLHF                          | 0.1B~66B | 1.[一键式 RLHF 训练 DeepSpeed Chat（一）：理论篇](https://zhuanlan.zhihu.com/p/626159553) <br> 2. [一键式 RLHF 训练 DeepSpeed Chat（二）：实践篇](https://zhuanlan.zhihu.com/p/626214655)                                 | [配套代码](https://github.com/liguodongiot/llm-action/tree/main/llm-train/deepspeedchat)        |
| MiniGPT-4(LLaMA)            | full fine-turning             | 7B       | [大杀器，多模态大模型MiniGPT-4入坑指南](https://zhuanlan.zhihu.com/p/627671257)                                                                                                                                             | N/A                                                                                      |
| Chinese-LLaMA-Alpaca(LLaMA) | LoRA（预训练+微调）           | 7B       | [中文LLaMA&amp;Alpaca大语言模型词表扩充+预训练+指令精调](https://zhuanlan.zhihu.com/p/631360711)                                                                                                                            | [配套代码](https://github.com/liguodongiot/llm-action/tree/main/llm-train/chinese-llama-alpaca) |
| LLaMA                       | QLoRA                         | 7B/65B   | [高效微调技术QLoRA实战，基于LLaMA-65B微调仅需48G显存，真香](https://zhuanlan.zhihu.com/p/636644164)                                                                                                                         | [配套代码](https://github.com/liguodongiot/llm-action/tree/main/llm-train/qlora)                |
| LLaMA                       | GaLore                         | 60M/7B   | [突破内存瓶颈，使用 GaLore 一张4090消费级显卡也能预训练LLaMA-7B](https://zhuanlan.zhihu.com/p/686686751)   | [配套代码](https://github.com/liguodongiot/llm-action/blob/main/llm-train/galore/torchrun_main.py)  |

**[⬆ 一键返回目录](#目录)**

### LLM微调技术原理

对于普通大众来说，进行大模型的预训练或者全量微调遥不可及。由此，催生了各种参数高效微调技术，让科研人员或者普通开发者有机会尝试微调大模型。

因此，该技术值得我们进行深入分析其背后的机理，本系列大体分七篇文章进行讲解。

![peft方法](./pic/llm/train/sft/peft方法.jpg)


- [大模型参数高效微调技术原理综述（一）-背景、参数高效微调简介](https://zhuanlan.zhihu.com/p/635152813)
- [大模型参数高效微调技术原理综述（二）-BitFit、Prefix Tuning、Prompt Tuning](https://zhuanlan.zhihu.com/p/635686756)
- [大模型参数高效微调技术原理综述（三）-P-Tuning、P-Tuning v2](https://zhuanlan.zhihu.com/p/635848732)
- [大模型参数高效微调技术原理综述（四）-Adapter Tuning及其变体](https://zhuanlan.zhihu.com/p/636038478)
- [大模型参数高效微调技术原理综述（五）-LoRA、AdaLoRA、QLoRA](https://zhuanlan.zhihu.com/p/636215898)
- [大模型参数高效微调技术原理综述（六）-MAM Adapter、UniPELT](https://zhuanlan.zhihu.com/p/636362246)
- [大模型参数高效微调技术原理综述（七）-最佳实践、总结](https://zhuanlan.zhihu.com/p/649755252)

### LLM微调实战

下面给大家分享**大模型参数高效微调技术实战**，该系列主要针对 HuggingFace PEFT 框架支持的一些高效微调技术进行讲解。

| 教程          | 代码             | 框架             |
| ------------- | --------------- | --------------- |
| [大模型参数高效微调技术实战（一）-PEFT概述及环境搭建](https://zhuanlan.zhihu.com/p/651744834)          | N/A                                                                                                       | HuggingFace PEFT |
| [大模型参数高效微调技术实战（二）-Prompt Tuning](https://zhuanlan.zhihu.com/p/646748939)               | [配套代码](https://github.com/liguodongiot/llm-action/blob/main/llm-train/peft/clm/peft_prompt_tuning_clm.ipynb) | HuggingFace PEFT |
| [大模型参数高效微调技术实战（三）-P-Tuning](https://zhuanlan.zhihu.com/p/646876256)                    | [配套代码](https://github.com/liguodongiot/llm-action/blob/main/llm-train/peft/clm/peft_p_tuning_clm.ipynb)      | HuggingFace PEFT |
| [大模型参数高效微调技术实战（四）-Prefix Tuning / P-Tuning v2](https://zhuanlan.zhihu.com/p/648156780) | [配套代码](https://github.com/liguodongiot/llm-action/blob/main/llm-train/peft/clm/peft_p_tuning_v2_clm.ipynb)   | HuggingFace PEFT |
| [大模型参数高效微调技术实战（五）-LoRA](https://zhuanlan.zhihu.com/p/649315197)                        | [配套代码](https://github.com/liguodongiot/llm-action/blob/main/llm-train/peft/clm/peft_lora_clm.ipynb)          | HuggingFace PEFT |
| [大模型参数高效微调技术实战（六）-IA3](https://zhuanlan.zhihu.com/p/649707359)                         | [配套代码](https://github.com/liguodongiot/llm-action/blob/main/llm-train/peft/clm/peft_ia3_clm.ipynb)           | HuggingFace PEFT |
| [大模型微调实战（七）-基于LoRA微调多模态大模型](https://zhuanlan.zhihu.com/p/670048482)       |     [配套代码](https://github.com/liguodongiot/llm-action/blob/main/llm-train/peft/multimodal/blip2_lora_int8_fine_tune.py) | HuggingFace PEFT |
| [大模型微调实战（八）-使用INT8/FP4/NF4微调大模型](https://zhuanlan.zhihu.com/p/670116171)    |     [配套代码](https://github.com/liguodongiot/llm-action/blob/main/llm-train/peft/multimodal/finetune_bloom_bnb_peft.ipynb) | PEFT、bitsandbytes |




**[⬆ 一键返回目录](#目录)**

### [LLM分布式训练并行技术](https://github.com/liguodongiot/llm-action/tree/main/docs/llm-base/distribution-parallelism)

近年来，随着Transformer、MOE架构的提出，使得深度学习模型轻松突破上万亿规模参数，传统的单机单卡模式已经无法满足超大模型进行训练的要求。因此，我们需要基于单机多卡、甚至是多机多卡进行分布式大模型的训练。

而利用AI集群，使深度学习算法更好地从大量数据中高效地训练出性能优良的大模型是分布式机器学习的首要目标。为了实现该目标，一般需要根据硬件资源与数据/模型规模的匹配情况，考虑对计算任务、训练数据和模型进行划分，从而进行分布式训练。因此，分布式训练相关技术值得我们进行深入分析其背后的机理。

下面主要对大模型进行分布式训练的并行技术进行讲解，本系列大体分九篇文章进行讲解。

- [大模型分布式训练并行技术（一）-概述](https://zhuanlan.zhihu.com/p/598714869)
- [大模型分布式训练并行技术（二）-数据并行](https://zhuanlan.zhihu.com/p/650002268)
- [大模型分布式训练并行技术（三）-流水线并行](https://zhuanlan.zhihu.com/p/653860567)
- [大模型分布式训练并行技术（四）-张量并行](https://zhuanlan.zhihu.com/p/657921100)
- [大模型分布式训练并行技术（五）-序列并行](https://zhuanlan.zhihu.com/p/659792351)
- [大模型分布式训练并行技术（六）-多维混合并行](https://zhuanlan.zhihu.com/p/661279318)
- [大模型分布式训练并行技术（七）-自动并行](https://zhuanlan.zhihu.com/p/662517647)
- [大模型分布式训练并行技术（八）-MOE并行](https://zhuanlan.zhihu.com/p/662518387)
- [大模型分布式训练并行技术（九）-总结](https://zhuanlan.zhihu.com/p/667051845)

**[⬆ 一键返回目录](#目录)**

### 分布式AI框架

- [PyTorch](https://github.com/liguodongiot/llm-action/tree/main/train/pytorch/)
  - PyTorch 单机多卡训练
  - PyTorch 多机多卡训练
- [Megatron-LM](https://github.com/liguodongiot/llm-action/tree/main/train/megatron)
  - Megatron-LM 单机多卡训练
  - Megatron-LM 多机多卡训练
  - [基于Megatron-LM从0到1完成GPT2模型预训练、模型评估及推理](https://juejin.cn/post/7259682893648724029)
- [DeepSpeed](https://github.com/liguodongiot/llm-action/tree/main/train/deepspeed)
  - DeepSpeed 单机多卡训练
  - DeepSpeed 多机多卡训练
- [Megatron-DeepSpeed](https://github.com/liguodongiot/llm-action/tree/main/train/megatron-deepspeed)
  - 基于 Megatron-DeepSpeed 从 0 到1 完成 LLaMA 预训练
  - 基于 Megatron-DeepSpeed 从 0 到1 完成 Bloom 预训练


### 分布式训练网络通信

待更新...


### LLM训练优化技术

- FlashAttention V1、V2
- 混合精度训练
- 重计算
- MQA / GQA
- 梯度累积


### LLM对齐技术


- PPO（近端策略优化）
- DPO
- ORPO



**[⬆ 一键返回目录](#目录)**

## [LLM推理](https://github.com/liguodongiot/llm-action/tree/main/inference)


### 推理引擎

- [大模型推理框架概述](https://www.zhihu.com/question/625415776/answer/3243562246)
- [大模型的好伙伴，浅析推理加速引擎FasterTransformer](https://zhuanlan.zhihu.com/p/626008090)
- [TensorRT-LLM保姆级教程（一）-快速入门](https://zhuanlan.zhihu.com/p/666849728)
- [TensorRT-LLM保姆级教程（二）-离线环境搭建、模型量化及推理](https://zhuanlan.zhihu.com/p/667572720)
- [TensorRT-LLM保姆级教程（三）-使用Triton推理服务框架部署模型](https://juejin.cn/post/7398122968200593419)
- [一文搞懂大模型生成文本的解码策略](https://zhuanlan.zhihu.com/p/1921914053485376792)
- [谈谈LLM生成文本的惩罚参数](https://zhuanlan.zhihu.com/p/1965476299419132173)
- [LLM 确定性推理](https://zhuanlan.zhihu.com/p/1961192621759242664)


迷你LLM推理引擎（非常适合源码学习）：

- [Nano-vLLM源码注释](https://github.com/liguodongiot/nano-vllm)：从头开始构建的轻量级 vLLM 实现。
- [Mini-SGLang](https://github.com/liguodongiot/mini-sglang)：一个轻量但高性能的大型语言模型推理框架，SGLang 的紧凑实现。


生产级LLM推理引擎：

- [vLLM](https://github.com/vllm-project/vllm)
- [SGLang](https://github.com/sgl-project/sglang)

其他推理引擎：

- [LMDeploy](https://github.com/InternLM/lmdeploy)
- [LightLLM](https://github.com/ModelTC/lightllm)：纯Python开发的大语言模型推理和服务框架
- [MNN-LLM](https://github.com/alibaba/MNN)：基于MNN引擎开发的大型语言模型运行时解决方案
- [赤兔](https://github.com/thu-pacman/chitu)
- [mllm](https://github.com/UbiquitousLearning/mllm)：端侧多模态LLM推理引擎



### 推理服务

- [模型推理服务工具综述](https://zhuanlan.zhihu.com/p/721395381)
- [模型推理服务化框架Triton保姆式教程（一）：快速入门](https://zhuanlan.zhihu.com/p/629336492)
- [模型推理服务化框架Triton保姆式教程（二）：架构解析](https://zhuanlan.zhihu.com/p/634143650)
- [模型推理服务化框架Triton保姆式教程（三）：开发实践](https://zhuanlan.zhihu.com/p/634444666)


### LLM推理优化技术

- [LLM推理优化技术-概述]()
- [大模型推理优化技术-KV Cache](https://www.zhihu.com/question/653658936/answer/3569365986)
- [大模型推理服务调度优化技术-Continuous batching](https://zhuanlan.zhihu.com/p/719610083)
- [大模型低显存推理优化-Offload技术](https://juejin.cn/post/7405158045628596224)
- [大模型推理优化技术-KV Cache量化](https://juejin.cn/post/7420231738558627874)
- [大模型推理优化技术-张量并行]()
- [大模型推理服务调度优化技术-Chunked Prefill]()
- [大模型推理优化技术-KV Cache优化方法综述]()
- 大模型吞吐优化技术-多LoRA推理服务
- 大模型推理服务调度优化技术-公平性调度
- 大模型访存优化技术-FlashAttention
- 大模型显存优化技术-PagedAttention
- 大模型解码优化-Speculative Decoding及其变体
- 大模型推理优化-结构化文本生成
- Flash Decoding
- FlashDecoding++


## LLM压缩

近年来，随着Transformer、MOE架构的提出，使得深度学习模型轻松突破上万亿规模参数，从而导致模型变得越来越大，因此，我们需要一些大模型压缩技术来降低模型部署的成本，并提升模型的推理性能。
模型压缩主要分为如下几类：

-   模型剪枝（Pruning）
-   知识蒸馏（Knowledge Distillation）
-   模型量化（Quantization）
-   低秩分解（Low-Rank Factorization）

### [LLM量化](https://github.com/liguodongiot/llm-action/tree/main/model-compression/quantization)

本系列将针对一些常见大模型量化方案（GPTQ、LLM.int8()、SmoothQuant、AWQ等）进行讲述。

- [大模型量化概述](https://www.zhihu.com/question/627484732/answer/3261671478)
- 量化感知训练：
    - [大模型量化感知训练技术原理：LLM-QAT](https://zhuanlan.zhihu.com/p/647589650)
    - [大模型量化感知微调技术原理：QLoRA]()
    - PEQA
- 训练后量化：
    - [大模型量化技术原理：GPTQ、LLM.int8()](https://zhuanlan.zhihu.com/p/680212402)
    - [大模型量化技术原理：SmoothQuant](https://www.zhihu.com/question/576376372/answer/3388402085)
    - [大模型量化技术原理：AWQ、AutoAWQ](https://zhuanlan.zhihu.com/p/681578090)
    - [大模型量化技术原理：SpQR](https://zhuanlan.zhihu.com/p/682871823)
    - [大模型量化技术原理：ZeroQuant系列](https://zhuanlan.zhihu.com/p/683813769)
    - [大模型量化技术原理：FP8](https://www.zhihu.com/question/658712811/answer/3596678896)
    - [大模型量化技术原理：FP6](https://juejin.cn/post/7412893752090853386)
    - [大模型量化技术原理：KIVI、IntactKV、KVQuant](https://juejin.cn/post/7420231738558627874)
    - [大模型量化技术原理：Atom、QuaRot](https://juejin.cn/post/7424334647570513972)
    - [大模型量化技术原理：QoQ量化及QServe推理服务系统](https://zhuanlan.zhihu.com/p/8047106486)
    - 大模型量化技术原理：QuIP、QuIP#、OmniQuant
    - [大模型量化技术原理：FP4]()
- [大模型量化技术原理：总结](https://zhuanlan.zhihu.com/p/11886909512)



### LLM稀疏化

- [万字长文谈深度神经网络剪枝综述](https://zhuanlan.zhihu.com/p/692858636?)


目前，大多数针对大模型模型的压缩技术都专注于模型量化领域，即降低单个权重的数值表示的精度。另一种模型压缩方法模型剪枝的研究相对较少，即删除网络元素，包括从单个权重（非结构化剪枝）到更高粒度的组件，如权重矩阵的整行/列（结构化剪枝）。

本系列将针对一些常见大模型稀疏化方案（LLM-Pruner、SliceGPT、SparseGPT、Wanda等）进行讲述。

- [大模型稀疏化技术原理：概述](https://www.zhihu.com/question/652126515/answer/3457652467)
- [大模型稀疏化技术原理：Double Sparsity](https://zhuanlan.zhihu.com/p/1912877769827783344)
- 大模型稀疏化技术原理：LLM-Pruner、SliceGPT
- 大模型稀疏化技术原理：SparseGPT、Wanda
- 大模型稀疏化技术原理：总结


**结构化剪枝**：

- LLM-Pruner(LLM-Pruner: On the Structural Pruning of Large Language Models)
- LLM-Shearing(Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning)
- SliceGPT: Compress Large Language Models by Deleting Rows and Columns
- LoSparse


**非结构化剪枝**：

- SparseGPT(SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot)
- LoRAPrune(LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning)
- Wanda(A Simple and Effective Pruning Approach for Large Language Models)
- Flash-LLM(Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity)



### LLM知识蒸馏

- [大模型知识蒸馏概述](https://www.zhihu.com/question/625415893/answer/3243565375)

**Standard KD**:

使学生模型学习教师模型(LLM)所拥有的常见知识，如输出分布和特征信息，这种方法类似于传统的KD。

- MINILLM
- GKD

**EA-based KD**:

不仅仅是将LLM的常见知识转移到学生模型中，还涵盖了蒸馏它们独特的涌现能力。具体来说，EA-based KD又分为了上下文学习（ICL）、思维链（CoT）和指令跟随（IF）。

In-Context Learning：

- In-Context Learning distillation

Chain-of-Thought：

- MT-COT
- Fine-tune-CoT
- DISCO
- SCOTT
- SOCRATIC CoT

Instruction Following：

- Lion

### 低秩分解

低秩分解旨在通过将给定的权重矩阵分解成两个或多个较小维度的矩阵，从而对其进行近似。低秩分解背后的核心思想是找到一个大的权重矩阵W的分解，得到两个矩阵U和V，使得W≈U V，其中U是一个m×k矩阵，V是一个k×n矩阵，其中k远小于m和n。U和V的乘积近似于原始的权重矩阵，从而大幅减少了参数数量和计算开销。

在LLM研究的模型压缩领域，研究人员通常将多种技术与低秩分解相结合，包括修剪、量化等。

- ZeroQuant-FP（低秩分解+量化）
- LoRAPrune（低秩分解+剪枝）



## LLM测评



### LLM效果评测


- [C-Eval](https://github.com/liguodongiot/ceval)：全面的中文基础模型评估套件，涵盖了52个不同学科的13948个多项选择题，分为四个难度级别。
- [CMMLU](https://github.com/liguodongiot/CMMLU)：一个综合性的中文评估基准，专门用于评估语言模型在中文语境下的知识和推理能力。CMMLU涵盖了从基础学科到高级专业水平的67个主题。它包括：需要计算和推理的自然科学，需要知识的人文科学和社会科学,以及需要生活常识的中国驾驶规则等。此外，CMMLU中的许多任务具有中国特定的答案，可能在其他地区或语言中并不普遍适用。因此是一个完全中国化的中文测试基准。
- [LVEval](https://github.com/liguodongiot/LVEval)：一个具备5个长度等级（16k、32k、64k、128k和256k）、最大文本测试长度达到256k的长文本评测基准。LV-Eval的平均文本长度达到102,380字，最小/最大文本长度为11,896/387,406字。LV-Eval主要有两类评测任务——单跳QA和多跳QA，共包含11个涵盖中英文的评测数据子集。LV-Eval设计时引入3个关键技术：干扰事实插入（Confusiong Facts Insertion，CFI）提高挑战性，关键词和短语替换（Keyword and Phrase Replacement，KPR）减少信息泄漏，以及基于关键词召回的评测指标（Answer Keywords，AK，指代结合答案关键词和字词黑名单的评价指标）提高评测数值客观性。
- [IFEval: Instruction Following Eval](https://github.com/google-research/google-research/tree/master/instruction_following_eval)/[Paper](https://arxiv.org/abs/2311.07911)：专注评估大模型遵循指令的能力,包含关键词检测、标点控制、输出格式要求等25种任务。
- [SuperCLUE](https://github.com/CLUEbenchmark/SuperCLUE)：一个综合性大模型评测基准，本次评测主要聚焦于大模型的四个能力象限，包括语言理解与生成、专业技能与知识、Agent智能体和安全性，进而细化为12项基础能力。
- [AGIEval](https://github.com/ruixiangcui/AGIEval/)：用于评估基础模型在与人类认知和解决问题相关的任务中的能力。该基准源自 20 项面向普通考生的官方、公开、高标准的入学和资格考试，例如：普通大学入学考试（例如：中国高考（Gaokao）和美国 SAT）、法学院入学考试、数学竞赛、律师资格考试、国家公务员考试。
- [OpenCompass](https://github.com/open-compass/opencompass/blob/main/README_zh-CN.md)：司南 2.0 大模型评测体系。
- [LongBench](https://github.com/THUDM/LongBench)：一个双语（中英文）多任务基准数据集，旨在评估大语言模型的长上下文理解能力。它包含21个任务，涵盖单文档问答、多文档问答、摘要、小样本学习、合成任务和代码补全等。数据集平均任务长度范围为5k到15k，共包含4750个测试数据。LongBench 采用全自动评估方法，旨在以最低的成本衡量和评估模型理解长上下文的能力。
- [EvalScope](https://github.com/modelscope/evalscope)：魔搭社区官方推出的模型评测与性能基准测试框架，专为多样化的模型评估需求而设计。它支持广泛的模型类型，包括但不限于大语言模型、多模态模型、Embedding 模型、Reranker 模型和 CLIP 模型。EvalScope还适用于多种评测场景，如端到端RAG评测、竞技场模式和模型推理性能压测等，其内置多个常用测试基准和评测指标，如MMLU、CMMLU、C-Eval、GSM8K等。



### LLM推理性能压测


- [你真的搞懂了LLM性能压测的各项指标吗？](https://zhuanlan.zhihu.com/p/1989359577871954448)
- [AIPerf](https://github.com/ai-dynamo/aiperf)：英伟达开源的性能测试工具
- [GuideLLM](https://github.com/vllm-project/guidellm)：vLLM开源的性能测试工具
- [EvalScope](https://github.com/modelscope/evalscope)：魔搭社区开源的性能测试工具
- [Inference Perf](https://github.com/kubernetes-sigs/inference-perf)
- [genai-bench](https://github.com/sgl-project/genai-bench)：SGLang开源的性能测试工具
- [GenAI-Perf](https://github.com/liguodongiot/perf_analyzer/tree/main/genai-perf)：英伟达开源的一个命令行工具（**已逐渐被淘汰，建议使用AIPerf**），用于测量通过推理服务提供生成式AI模型的吞吐量和延迟。GenAI-Perf 收集一组不同的指标来捕获推理服务的性能。

| 指标 | 描述 | Aggregations |
| - | - | - |
| <span id="time_to_first_token_metric">Time to First Token</span> | Time between when a request is sent and when its first response is received, one value per request in benchmark | Avg, min, max, p99, p90, p75 |
| <span id="time_to_second_token_metric">Time to Second Token</span> | Time between when the first streaming response is received and when the second streaming response is received, one value per request in benchmark | Avg, min, max, p99, p90, p75 |
| <span id="inter_token_latency_metric">Inter Token Latency</span> | Time between intermediate responses for a single request divided by the number of generated tokens of the latter response, one value per response per request in benchmark | Avg, min, max, p99, p90, p75 |
| Request Latency | Time between when a request is sent and when its final response is received, one value per request in benchmark | Avg, min, max, p99, p90, p75 |
| Output Sequence Length | Total number of output tokens of a request, one value per request in benchmark | Avg, min, max, p99, p90, p75 |
| Input Sequence Length | Total number of input tokens of a request, one value per request in benchmark | Avg, min, max, p99, p90, p75 |
| <span id="output_token_throughput_metric">Output Token Throughput</span> | Total number of output tokens from benchmark divided by benchmark duration | None–one value per benchmark |
| <span id="request_throughput_metric">Request Throughput</span> | Number of final responses from benchmark divided by benchmark duration | None–one value per benchmark |




## LLM数据工程

LLM Data Engineering


### 预训练语料处理技术

![llm-pretrain-pipeline](./pic/llm/train/pretrain/llm-pretrain-pipeline-v2.png)

- 数据收集
- 数据处理
  - 去重
  - 过滤
  - 选择
  - 组合

### LLM微调高效数据筛选技术

- [LLM微调高效数据筛选技术原理-DEITA]()
- [LLM微调高效数据筛选技术原理-MoDS]()
- [LLM微调高效数据筛选技术原理-IFD]()
- [LLM微调高效数据筛选技术原理-CaR]()
- [LESS：仅选择5%有影响力的数据优于全量数据集进行目标指令微调](https://zhuanlan.zhihu.com/p/686007325)
- [LESS 实践：用少量的数据进行目标指令微调](https://zhuanlan.zhihu.com/p/686687923)



## 提示工程

- Zero-Shot Prompting
- Few-Shot Prompting
- Chain-of-Thought (CoT) Prompting
- Automatic Chain-of-Thought (Auto-CoT) Prompting
- Tree-of-Thoughts (ToT) Prompting



## [LLM算法架构](https://github.com/liguodongiot/llm-action/tree/main/docs/llm-base/ai-algo)

![llm-famliy](./pic/llm/model/llm-famliy.jpg)


- [大模型算法演进](https://zhuanlan.zhihu.com/p/600016134)

![llm-famliy](./pic/llm/model/llm-timeline-v2.png)

- [百川智能开源大模型baichuan-7B技术剖析](https://www.zhihu.com/question/606757218/answer/3075464500)
- [百川智能开源大模型baichuan-13B技术剖析](https://www.zhihu.com/question/611507751/answer/3114988669)
- [LLaMA3 技术剖析](https://www.zhihu.com/question/653374932/answer/3470909634)
- [大模型算法架构：DeepSeek技术演进及剖析](https://zhuanlan.zhihu.com/p/1912877300439037789)
- [大模型算法架构：QWen技术演进及剖析]()
- ChatGLM / ChatGLM2 / ChatGLM3 大模型解析
- Bloom 大模型解析
- LLaMA / LLaMA2 大模型解析
- [DeepSeek 视觉语言大模型技术演进（从DeepSeek VL/VL2到DeepSeek OCR）](https://zhuanlan.zhihu.com/p/1976731060562842519)
- Qwen3-Next




## LLM应用开发

大模型是基座，要想让其变成一款产品，我们还需要一些其他相关的技术，比如：向量数据库（Pinecone、Milvus、Vespa、Weaviate），LangChain等。

- [云原生向量数据库Milvus（一）-简述、系统架构及应用场景](https://zhuanlan.zhihu.com/p/476025527)
- [云原生向量数据库Milvus（二）-数据与索引的处理流程、索引类型及Schema](https://zhuanlan.zhihu.com/p/477231485)
- [关于大模型驱动的AI智能体Agent的一些思考](https://zhuanlan.zhihu.com/p/651921120)


### Agent应用



AI Assistant:

- [OpenClaw](https://github.com/openclaw/openclaw)：一款个人 AI 助手


Code Agent:

- [OpenCode](https://github.com/anomalyco/opencode)：一个开源代码智能体，[项目文档](https://opencode.ai/docs/zh-cn/)




## [LLM国产化适配](https://github.com/liguodongiot/llm-action/tree/main/docs/llm_localization)

随着 ChatGPT 的现象级走红，引领了AI大模型时代的变革，从而导致 AI 算力日益紧缺。与此同时，中美贸易战以及美国对华进行AI芯片相关的制裁导致 AI 算力的国产化适配势在必行。本系列将对一些国产化 AI 加速卡进行讲解。

- [大模型国产化适配1-华为昇腾AI全栈软硬件平台总结](https://zhuanlan.zhihu.com/p/637918406)
- [大模型国产化适配2-基于昇腾910使用ChatGLM-6B进行模型推理](https://zhuanlan.zhihu.com/p/650730807)
- [大模型国产化适配3-基于昇腾910使用ChatGLM-6B进行模型训练](https://zhuanlan.zhihu.com/p/651324599)
  - MindRecord数据格式说明、全量微调、LoRA微调
- [大模型国产化适配4-基于昇腾910使用LLaMA-13B进行多机多卡训练](https://zhuanlan.zhihu.com/p/655902796)
- [大模型国产化适配5-百度飞浆PaddleNLP大语言模型工具链总结](https://zhuanlan.zhihu.com/p/665807431)
- [大模型国产化适配6-基于昇腾910B快速验证ChatGLM3-6B/BaiChuan2-7B模型推理](https://zhuanlan.zhihu.com/p/677799157)
- [大模型国产化适配7-华为昇腾LLM落地可选解决方案（MindFormers、ModelLink、MindIE）](https://zhuanlan.zhihu.com/p/692377206)
- [MindIE 1.0.RC1 发布，华为昇腾终于推出了针对LLM的完整部署方案，结束小米加步枪时代](https://www.zhihu.com/question/654472145/answer/3482521709)
- [大模型国产化适配8-基于昇腾MindIE推理工具部署Qwen-72B实战（推理引擎、推理服务化）](https://juejin.cn/post/7365879319598727180)
  - Qwen-72B、Baichuan2-7B、ChatGLM3-6B
- [大模型国产化适配9-LLM推理框架MindIE-Service性能基准测试](https://zhuanlan.zhihu.com/p/704649189)
- [大模型国产化适配10-快速迁移大模型到昇腾910B保姆级教程（Pytorch版）](https://juejin.cn/post/7375351908896866323)
- [大模型国产化适配11-LLM训练性能基准测试（昇腾910B3）](https://juejin.cn/post/7380995631790964772)
- [国产知名AI芯片厂商产品大揭秘-昇腾、海光、天数智芯...](https://f46522gm22.feishu.cn/docx/PfWfdMKo8oXYN6xi7uycuhgFnKg)
- [国内AI芯片厂商的计算平台大揭秘-昇腾、海光、天数智芯...](https://f46522gm22.feishu.cn/docx/XnhcdXVDholUBpxYoMccS11Mnfc)
- [【LLM国产化】量化技术在MindIE推理框架中的应用](https://juejin.cn/post/7416723051377377316)




**[⬆ 一键返回目录](#目录)**


## [AI编译器](https://github.com/liguodongiot/llm-action/tree/main/ai-compiler)

AI编译器是指将机器学习算法从开发阶段，通过变换和优化算法，使其变成部署状态。

- [AI编译器技术剖析（一）-概述](https://zhuanlan.zhihu.com/p/669347560)
- [AI编译器技术剖析（二）-传统编译器](https://zhuanlan.zhihu.com/p/671477784)
- [AI编译器技术剖析（三）-树模型编译工具 Treelite 详解](https://zhuanlan.zhihu.com/p/676723324)
- [AI编译器技术剖析（四）-编译器前端]()
- [AI编译器技术剖析（五）-编译器后端]()
- [AI编译器技术剖析（六）-主流编译框架]()
- [AI编译器技术剖析（七）-深度学习模型编译优化]()
- [lleaves：使用 LLVM 编译梯度提升决策树将预测速度提升10+倍](https://zhuanlan.zhihu.com/p/672584013)

框架：

- MLIR
- XLA
- TVM


## AI基础设施

- [AI 集群基础设施 NVMe SSD 详解](https://zhuanlan.zhihu.com/p/672098336)
- [AI 集群基础设施 InfiniBand 详解](https://zhuanlan.zhihu.com/p/673903240)
- [大模型训练基础设施：算力篇]()


### AI加速卡

- [AI芯片技术原理剖析（一）：国内外AI芯片概述](https://zhuanlan.zhihu.com/p/667686665)
- AI芯片技术原理剖析（二）：英伟达GPU 
- AI芯片技术原理剖析（三）：谷歌TPU

### AI集群

待更新...


### [AI集群网络通信](https://github.com/liguodongiot/llm-action/tree/main/docs/llm-base/network-communication)

待更新...

- 分布式训练网络通讯原语
- AI 集群通信软硬件


## LLMOps

- [在 Kubernetes 上部署机器学习模型的指南](https://zhuanlan.zhihu.com/p/676389726)
- [使用 Kubernetes 部署机器学习模型的优势](https://juejin.cn/post/7320513026188099619)



## LLM生态相关技术

- [大模型词表扩充必备工具SentencePiece](https://zhuanlan.zhihu.com/p/630696264)
- [大模型实践总结](https://www.zhihu.com/question/601594836/answer/3032763174)
- [ChatGLM 和 ChatGPT 的技术区别在哪里？](https://www.zhihu.com/question/604393963/answer/3061358152)
- [现在为什么那么多人以清华大学的ChatGLM-6B为基座进行试验？](https://www.zhihu.com/question/602504880/answer/3041965998)
- [为什么很多新发布的大模型默认使用BF16而不是FP16？](https://www.zhihu.com/question/616600181/answer/3195333332)
- [大模型训练时ZeRO-2、ZeRO-3能否和Pipeline并行相结合？](https://www.zhihu.com/question/652836990/answer/3468210626)
- [一文详解模型权重存储新格式 Safetensors](https://juejin.cn/post/7386360803039838235)
- [一文搞懂大模型文件存储格式新宠GGUF](https://juejin.cn/post/7408858126042726435)
- [DeepGEMM 技术剖析](https://juejin.cn/post/7520475965081813055)


## LLM性能分析


- PyTorch Profiler
- NVIDIA Nsight Systems 
- NVIDIA Nsight Compute


## [LLM面试题](https://github.com/liguodongiot/llm-action/blob/main/llm-interview/README.md)

正在收集中...

- [大模型基础常见面试题](https://github.com/liguodongiot/llm-action/blob/main/llm-interview/base.md)
- [大模型算法常见面试题](https://github.com/liguodongiot/llm-action/blob/main/llm-interview/llm-algo.md)
- [大模型训练常见面试题](https://github.com/liguodongiot/llm-action/blob/main/llm-interview/llm-train.md)
- [大模型微调常见面试题](https://github.com/liguodongiot/llm-action/blob/main/llm-interview/llm-ft.md)
- [大模型评估常见面试题](https://github.com/liguodongiot/llm-action/blob/main/llm-interview/llm-eval.md)
- [大模型压缩常见面试题](https://github.com/liguodongiot/llm-action/blob/main/llm-interview/llm-compress.md)
- [大模型推理常见面试题](https://github.com/liguodongiot/llm-action/blob/main/llm-interview/llm-inference.md)
- [大模型应用常见面试题](https://github.com/liguodongiot/llm-action/blob/main/llm-interview/llm-app.md)
- [大模型综合性面试题](https://github.com/liguodongiot/llm-action/blob/main/llm-interview/comprehensive.md)




**[⬆ 一键返回目录](#目录)**

## 服务器基础环境安装及常用工具

基础环境安装：

- [英伟达A800加速卡常见软件包安装命令](https://github.com/liguodongiot/llm-action/blob/main/docs/llm-base/a800-env-install.md)
- [英伟达H800加速卡常见软件包安装命令](https://github.com/liguodongiot/llm-action/blob/main/docs/llm-base/h800-env-install.md)
- [昇腾910加速卡常见软件包安装命令](https://github.com/liguodongiot/llm-action/blob/main/llm_localization/ascend910-env-install.md)

常用工具：

- [Linux 常见命令大全](https://juejin.cn/post/6992742028605915150)
- [Conda 常用命令大全](https://juejin.cn/post/7089093437223338015)
- [Poetry 常用命令大全](https://juejin.cn/post/6999405667261874183)
- [Docker 常用命令大全](https://juejin.cn/post/7016238524286861325)
- [Docker Dockerfile 指令大全](https://juejin.cn/post/7016595442062327844)
- [Kubernetes 常用命令大全](https://juejin.cn/post/7031201391553019911)
- [集群环境 GPU 管理和监控工具 DCGM 常用命令大全](https://github.com/liguodongiot/llm-action/blob/main/docs/llm-base/dcgmi.md)

## LLM学习交流群

我创建了大模型相关的学习交流群，供大家一起学习交流大模型相关的最新技术，目前已有5个群，每个群都有上百人的规模，**可加我微信进群**（加微信请备注来意，如：进大模型学习交流群+GitHub，进大模型推理加速交流群+GitHub、进大模型应用开发交流群+GitHub、进大模型校招交流群+GitHub等）。**一定要备注哟，否则不予通过**。

PS：**成都有个本地大模型交流群，想进可以另外单独备注下。**

<p align="center">
  <img src="https://github.com/liguodongiot/llm-action/blob/main/pic/wx.jpg">
</p>

## 微信公众号

微信公众号：**吃果冻不吐果冻皮**，该公众号主要分享AI工程化（大模型、MLOps等）相关实践经验，免费电子书籍、论文等。

<p align="center">
  <img src="https://github.com/liguodongiot/llm-action/blob/main/pic/wx-gzh.png" >
</p>

**[⬆ 一键返回目录](#目录)**

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=liguodongiot/llm-action&type=Date)](https://star-history.com/#liguodongiot/llm-action&Date)


## AI工程化课程推荐

如今人工智能的发展可谓是如火如荼，ChatGPT、Sora、文心一言等AI大模型如雨后春笋般纷纷涌现。AI大模型优势在于它能处理复杂性问题；因此，越来越多的企业需要具备**AI算法设计、AI应用开发、模型推理加速及模型压缩**等AI工程化落地的能力。这就导致行业内的工程师，需要快速提升自身的技术栈，以便于在行业内站稳脚跟。我在[llm-resource](https://github.com/liguodongiot/llm-resource) 和 [ai-system](https://github.com/liguodongiot/ai-system)梳理了一些大模型和AI工程化相关资料。








================================================
FILE: ai-compiler/README.md
================================================

## 树模型编译器

- https://mlsys.org/Conferences/doc/2018/196.pdf
- https://github.com/dmlc/treelite
- https://treelite.readthedocs.io/en/latest/


- https://zhuanlan.zhihu.com/p/347514385
- https://zhuanlan.zhihu.com/p/487539515

Treelite是用于有效部署决策树集合的模型编译器。


- 决策树Ensemble的编译优化：https://zhuanlan.zhihu.com/p/597511551



## 深度学习编译器


### 深度学习编译原理

#### AI 编译器前端优化

前端优化作为AI编译器的整体架构主要模块，主要优化的对象是计算图，而计算图是通过AI框架产生的，值得注意的是并不是所有的AI框架都会生成计算图，有了计算图就可以结合深度学习的原理知识进行图的优化。 

- 计算图层（Graph IR）
- 算子融合（OP Fusion）
- 布局转换（Layout Transform）
- 内存分配（Memory Allocation）
- 常量折叠（Constant Fold）
- 公共子表达式消除（CSE）
- 死代码消除（DCE）
- 代数简化（ARM）



#### AI 编译器后端优化

后端优化作为AI编译器跟硬件之间的相连接的模块，更多的是算子或者Kernel进行优化，而优化之前需要把计算图转换称为调度树等IR格式，然后针对每一个算子/Kernel进行循环优化、指令优化和内存优化等技术。 


- 算子循环优化
- 指令和内存优化



### 深度学习编译工具

#### TVM 


#### XLA


#### Glow






## 深度学习编译优化


- 深度学习框架的编译与优化：https://github.com/microsoft/AI-System/tree/main/Textbook/%E7%AC%AC5%E7%AB%A0-%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E6%A1%86%E6%9E%B6%E7%9A%84%E7%BC%96%E8%AF%91%E4%B8%8E%E4%BC%98%E5%8C%96



计算图优化：XLA、TVM、nGraph

算子生成：AutoTVM、TC、Halide


### 图优化

目标：通过图的等价变换化简计算图，从而降低计算复杂度或内存开销。

数据流图作为深度学习框架中的高层中间表示，可以允许任何等价图优化Pass去化简计算流图或提高执行效率


图优化 （1）：算术表达式化简

通过代数运算等价变换化简计算图，如：

```
a * 0 -> 0
a * broadcast(0) ->  broadcast(0)
a * 1 -> a
a * broadcast(1）-> a
a + 0 -> a
a + broadcast(O）-> a
log(exp(x)/y）->x-log(y)
```


图优化 （2）：公共子表达式消除

目的是将相同输入的表达式进行消除，由一个节点来代替，复用计算结果




图优化 （3）：常数传播

如果一个算子的所有输入张量都是常数的话，那么该算子的结果也为常数张量

在编译器计算并化简

例子：假设参数0和参数1为常数张量，最终的图可以化简为什么？

注意：常数传播可能会引起内存的扩张，如：Broadcast




图优化 （4）： GEMM自动融合

Batch same-type operators to leverage GPU massive parallelism


图优化 （4）： GEMM自动融合

通过将输入张量合并成一个大的张量来实现将相同的算子合并成一个更大的算子，从而更好的利用硬件并行度



图优化（5）：算子融合

向量化的多个算子的操作可以合并成一个向量化操作

减少内核启动开销

减少内存的读取，提高计算密度



图优化 （6）：子图替换

利用子图匹配识别出可替换的复杂子图，替换为更高效的合并算子


图优化 （6）：随机子图替换

TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions


小结： 图优化的总结

计算图作为深度学习编译框架的第一层中间表示

```
基于计算图的优化算法：
   算术表达式化简
   公共子表达式消除
   常数传播
   矩阵合并
   算子融合
   子图替换/随机子图替换

思考：
   你可以想到哪些其它的计算图上的优化？
   计算图还有哪些其它的好处
```



### 内存优化

内存优化

目标：通过对计算图的变化以及张量的合理分配来降低内存使用的总量。



内存优化 （1）：基于拓扑序的最小内存分配

将计算流图按照某种拓扑序进行排序。如BFS，ReverseDFS等

按照节点的拓扑顺序依次分配其使用到的输出张量。

当一个张量后面没有其它算子使用时， 则回收到内存池。

当所有张量分配完成后，内存池的最大分配空间就是该计算图需要的最小内存

拓扑序的选择会同时影响模型的计算时间和最大内存占用



内存优化  （2）：根据整数线性规划求解最优内存放置

```
目标：给定任意的计算图最小化执行时间
约束：有限的快速内存，如GPU内存
变量：每一个张量是否放置在快速内存中，还是较慢的外存中，如CPU内存
过程：最优化张量的移动                                                             
需求：每个内核计算的测量时间 
方法：将上述的优化问题建模为整数线性规划问题                                                    
```




内存优化 （3）：张量换入换出与张量重计算

```
方法：
DRAM存储量相对GPU显存来说比较大，可以将数据在GPU与DRAM之间进行转移，或者直接重新计算

发现：
在训练过程中张量的访问模式比较规律

核心思想
根据运行时的张量访问来动态管理内存
替换+预取
重新计算
```



小结： 内存优化的总结

·基于计算图的内存优化算法：
   基于拓扑序的最小内存分配
   根据整数线性规划求解最优内存放置张量换入换出与张量重计算
·思考：
    你可以想到哪些其它的内存优化方法？



### 内核优化

问题: 每个后端平台针对每个算子都需要单独实现至少一个内核

考虑：编程模型、数据排布、线程模型、缓存大小等等





张量运算编译

```
核心思想：分离计算逻辑与调度逻辑
    通过张量运算表达式表示每个算子的通用计算逻辑
    通过调度语言描述算子在映射到具体硬件上时的调度空间



相关工作：
TVM、 Halide、 TACO、 Tensor Comprehension， FlexTensor等

张量运算表达式：例：TVMIR

C=A*B
C= tvm.compute((m，n)，lambda i,j:tvm.sum(A[i，k] * B[k,j], axis=k）

```


其它张量运算表达式

```
Affine Transformation

out = tvm.compute((n，m)，lambda i, j: tvm.sum(data[i, k] * w[j, k], k))
out = tvm.compute((n，m)，lambda i, j: out[i, j]十bias[i])

Convolution
out=tvm.compute(c, h, w), lambda i, x, y: tvm.sum(data[kc,x+kx,y+ky] * w[i, kx, ky], [kx, ky, kc]))


ReLu
out = tvm.compute(shape, lambda *i: tvm.max(0, out(*i))

```



其它算子调度优化

每一种优化都可能产生出多个内核代码的实现

利用自动机器学习 Auto Tuner 在给定时间内搜索出最高效的实现




小结： 内核优化的总结

内核优化与内核生成

算子表达式
算子表示与调度逻辑的分离
自动调度搜索与代码生成



调度优化



NNFusion：全局计算调度优化

目标：通过将多个算子进行协同调度以及精确映射每一个算子到硬件计算单元来充分利用硬件并行度                                               

中间表示：数据流图+细粒度算子并行单元

结果：将每个子图编译成一个硬件计算内核

充分减少上层调度的开销

高效利用硬件并行度




通过引出新的调度原语来支持任务级调度

APEEND: 将一个算子中的一个任务调度到硬件的一个计算单元上

GROUP_SYNC: 维护任务间的依赖关系



```
挑战：调度算子的任务到GPU上的挑战

简单的任务级调度可能引起正确性问题
- 错误依赖
- 死锁



依赖关系的映射

通过将硬件计算单元抽象到软件可控计算单元，并引入细粒度任务级同步支持来保证计算正确性。



并行度的映射

任务级调度可以支持任意算子之间的并行调度，从而最大化硬件利用率。



```

```
本次课程总结

深度神经网络编译器的概念与架构
	中间表达、前端、后端、优化过程

计算图优化
   算术表达式化简、 公共子表达式消除、 常数传播、矩阵合并、 算子融合、 子图替换/随机子图替换
内存优化
   基于拓扑序的最小内存分配、 根据整数线性规划求解最优内存放置、 张量换入换出与张量重计算
内核优化
   算子表达式、  算子表示与调度逻辑的分离、  自动调度搜索与代码生成
调度优化

```







================================================
FILE: ai-compiler/Treebeard/README.md
================================================

Treebeard: An Optimizing Compiler for Decision Tree Based ML Inference


## 流程

输入一个决策树数据结构，compiler通过一系列 IR 转化，将决策树数据结构转化为对CPU来说更加友好的数据结构，从而加速决策树上的推理过程。然后，因为自己有关于决策树的domain knowledge，因此在做codegen比如循环展开，循环交换的时候，可以选择到更好的codegen方式。整个项目基于MLIR实现，因此一些简单的优化比如使用 OpenMP 做并行，可以直接用MLIR。




## 图片


TREEBEARD IR lowering 和 optimization 细节

显示了 TREEBEARD IR 中的三个抽象级别。 高级 IR 是基于树的 IR，用于执行模型级优化，中级 IR 用于独立于内存布局的循环优化，低级 IR 允许我们执行向量化和其他与内存布局相关的优化。





Figure 6: Sparse representation with tile size nt = 3. Leaves l4, l5, l6 and l7 are moved into the leaves array. Extra hops are added for l1, l2 and l3 as T2 is a non-leaf tile. The new leaves added as children of l1, l2 and l3 are moved to the leaves array.




Figure 9: Geomean speedup (over all benchmarks) of TREEBEARD over XGBoost and Treelite on single-core over several batch sizes. 



Table I: List of benchmark datasets and their parameters. The column Leaf-biased reports the number of leaf-biased trees per benchmark with 〈α = 0.075,β = 0.9〉 . 


Table II: Space of optimizations explored. 


================================================
FILE: ai-compiler/treelit/README.md
================================================



```
conda create -n model-inference-venv python=3.9 -y


conda activate model-inference-venv
```







- 机器学习：软件工程方法与实现：https://github.com/chansonZ/book-ml-sem/







================================================
FILE: ai-compiler/treelit/xgb.md
================================================




```
conda create -n model-server-venv python=3.9 -y
```

================================================
FILE: ai-compiler/triton-lang/README.md
================================================





================================================
FILE: ai-framework/README.md
================================================





## 国外


### PyTorch





## 国内


### Oneflow




### PaddlePaddle




### MindSpore






自动混合精度

- https://github.com/Azure/MS-AMP
- FP8-LM





================================================
FILE: ai-framework/TensorRT-Model-Optimizer.md
================================================




- 代码：https://github.com/NVIDIA/TensorRT-Model-Optimizer
- 文档：https://nvidia.github.io/TensorRT-Model-Optimizer/

- 量化方法最佳实践：https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_choosing_quant_methods.html

================================================
FILE: ai-framework/cuda/README.md
================================================





================================================
FILE: ai-framework/deepspeed/1.DeepSpeed入门.md
================================================



## DeepSpeed 

通过简单三步将Pytorch DDP模型训练改造 DeepSpeed DP 模型训练。

第一步：**初始化DeepSpeed引擎**:
```
model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                     model=model,
                                                     model_parameters=params)
```
deepspeed.initialize确保在底层适当地完成了所需的分布式数据并行或混合精度训练所需的所有设置。



第二步：**初始化分布式环境**:
```
deepspeed.init_distributed()
```

DeepSpeed将在其初始化期间自动初始化分布式环境，因此，可以不使用此函数。


第三步，**模型训练**

使用三个简单的API来进行前向传播（callable object）、反向传播（backward）和权重更新（step）来训练模型。

```
for step, batch in enumerate(data_loader):
    #forward() method
    loss = model_engine(batch)

    #runs backpropagation
    model_engine.backward(loss)

    #weight update
    model_engine.step()
```

- Gradient Averaging: 在分布式数据并行训练中，backward 确保在对一个 train_batch_size 进行训练后，梯度在数据并行进程间进行平均。
- Loss Scaling: 在FP16/混合精度训练中, DeepSpeed 引擎会自动处理缩放损失,以避免梯度中的精度损失。
- Learning Rate Scheduler: 当使用 DeepSpeed 的学习率调度器(在ds_config.json文件中指定)时, DeepSpeed 会在每次训练步骤(执行model_engine.step()时)调用调度器的step()方法。当不使用DeepSpeed的学习率调度器时:
  -  如果调度期望在每次训练步骤都执行, 那么用户可以在初始化 DeepSpeed 引擎时将调度器传递给 deepspeed.initialize, 让 DeepSpeed 进行管理、更新或保存/恢复。
  -  如果调度应该在任何其它间隔（例如训练周期）执行，则用户在初始化期间不应将调度传递给 DeepSpeed，必须显式地管理它。





## 多节点环境变量

当在多个节点上进行训练时，我们发现支持传播用户定义的环境变量非常有用。

默认情况下，DeepSpeed 将传播所有设置的 NCCL 和 PYTHON 相关环境变量。

如果您想传播其它变量，可以在名为 .deepspeed_env 的文件中指定它们，该文件包含一个行分隔的 VAR=VAL 条目列表。

DeepSpeed 启动器将查找你执行的本地路径以及你的主目录（~/）。

以一个具体的例子来说明，有些集群需要在训练之前设置特殊的 NCCL 变量。

用户可以简单地将这些变量添加到其主目录中的 `.deepspeed_env` 文件中，该文件如下所示：
```
NCCL_IB_DISABLE=1
NCCL_SOCKET_IFNAME=eth0
```
DeepSpeed 然后会确保在启动每个进程时在整个训练工作的每个节点上设置这些环境变量。


## 兼容MPI 

如上所述，DeepSpeed 提供了自己的并行启动器来帮助启动多节点/多GPU训练作业。如果您喜欢使用MPI（例如: mpirun）启动训练作业，则我们提供对此的支持。

需要注意的是，DeepSpeed 仍将使用 torch 分布式 NCCL 后端，而不是 MPI 后端。

要使用 mpirun + DeepSpeed （使用 mpirun 作为启动器后端）启动你的训练作业，您只需要安装 mpi4py Python 包。DeepSpeed 将使用它来发现 MPI 环境，并将必要的状态（例如 world size、rank 等）传递给 torch 分布式后端。

如果你正在使用模型并行，流水线并行或者在调用 deepspeed.initialize(..) 之前需要使用 torch.distributed 调用，我们为你提供了额外的 DeepSpeed API 调用以支持相同的 MPI。请将您的初始 torch.distributed.init_process_group(..) 调用替换为：deepspeed.init_distributed()


## 资源配置（单节点）

如果我们只在单个节点上运行（具有一个或多个GPU），DeepSpeed不需要像上面描述的那样使用 hostfile。如果没有检测到或传递 hostfile，则 DeepSpeed 将查询本地计算机上的 GPU 数量来发现可用的插槽数。--include 和 --exclude 参数与正常工作相同，但用户应将“localhost”指定为主机名。

另外需要注意的是，CUDA_VISIBLE_DEVICES 不能用于 DeepSpeed 来控制应该使用哪些设备。

例如，要仅使用当前节点的 gpu1，请执行以下操作：
```
deepspeed --include localhost:1 ...
```







================================================
FILE: ai-framework/deepspeed/2.安装DeepSpeed.md
================================================

## 安装DeepSpeed
通过 pip 是最快捷的开始使用 DeepSpeed 的方式，这将安装最新版本的 DeepSpeed，不会与特定的 PyTorch 或 CUDA 版本绑定。DeepSpeed 包含若干个 C++/CUDA 扩展，我们通常称之为“ops”。默认情况下，所有这些 extensions/ops 将使用 torch 的 JIT C++ 扩展加载器即时构建（JIT）(https://pytorch.org/docs/stable/cpp_extension.html) ，该加载器依赖 ninja 在运行时进行动态链接。

```
pip install deepspeed
```

安装完DeepSpeed后，你可以使用 ds_report 或 python -m deepspeed.env_report 命令查看 DeepSpeed 环境报告，以验证你的安装并查看你的机器与哪些 ops 兼容。我们发现，在调试 DeepSpeed 安装或兼容性问题时，这个报告很有用


```
ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/guodong.li/virtual-venv/llama-venv-py310-cu117/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/guodong.li/virtual-venv/llama-venv-py310-cu117/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
```


## 预安装DeepSpeed的Ops

> 注意：在预编译任何 DeepSpeed 的 c++/cuda ops 之前，必须先安装 PyTorch。但是，如果使用 ops 的默认 JIT 编译模式，则不需要预编译安装。


有时我们发现，将一些或全部 DeepSpeed C++/CUDA ops 预先安装而不使用 JIT 编译路径是有用的。为了支持预安装，我们引入了构建环境标志以打开/关闭特定 ops 的构建。

您可以通过设置 DS_BUILD_OPS 环境变量为 1 来指示我们的安装程序（install.sh 或 pip install）尝试安装所有 ops，例如：

DS_BUILD_OPS=1 pip install deepspeed

DeepSpeed 只会安装与你的机器兼容的 ops。有关系统兼容性的更多详细信息，请尝试上面描述的 ds_report 工具。

如果你只想安装特定的 op（例如 FusedLamb），你可以在安装时使用 DS_BUILD 环境变量进行切换。例如，要仅安装带有 FusedLamb op 的 DeepSpeed，请使用：
```
DS_BUILD_FUSED_LAMB=1 pip install deepspeed
```

可用的 DS_BUILD 选项包含：
```
DS_BUILD_OPS 切换所有 ops
DS_BUILD_CPU_ADAM 构建 CPUAdam op
DS_BUILD_FUSED_ADAM 构建 FusedAdam op (from apex)
DS_BUILD_FUSED_LAMB 构建 FusedLamb op
DS_BUILD_SPARSE_ATTN 构建 sparse attention op
DS_BUILD_TRANSFORMER 构建 transformer op
DS_BUILD_TRANSFORMER_INFERENCE 构建 transformer-inference op
DS_BUILD_STOCHASTIC_TRANSFORMER 构建 stochastic transformer op
DS_BUILD_UTILS 构建各种优化工具
DS_BUILD_AIO 构建异步 (NVMe) I/O op
```
为了加速 build-all 过程，您可以使用以下方式并行编译：

DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext" --global-option="-j8"

这应该可以使完整的构建过程加快 2-3 倍。您可以调整 -j 来指定在构建过程中使用多少个 CPU 核心。在此示例中，它设置为 8 个核心。

你还可以构建二进制 wheel，并在具有相同类型的 GPU 和相同软件环境（CUDA 工具包、PyTorch、Python 等）的多台机器上安装它。

DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel

这将在 dist 目录下创建一个 PyPI 二进制轮，例如 dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl，然后你可以直接在多台机器上安装它，在我们的示例中：

```
pip install dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl
```

## 源码安装 DeepSpeed
在从 GitHub 克隆 DeepSpeed 仓库后，您可以通过 pip 在 JIT 模式下安装 DeepSpeed（见下文）。由于不编译任何 C++/CUDA 源文件，此安装过程应该很快完成。


```
pip install .
```

对于跨多个节点的安装，我们发现使用 github 仓库中的 install.sh (https://github.com/microsoft/DeepSpeed/blob/master/install.sh) 脚本安装 DeepSpeed 很有用。这将在本地构建一个 Python whell，并将其复制到你的主机文件（通过 --hostfile 给出，或默认为 /job/hostfile）中列出的所有节点上。

当使用 DeepSpeed 的代码首次运行时，它将自动构建仅运行所需的 CUDA 扩展，并默认将它们放置在 ~/.cache/torch_extensions/ 目录下。下一次执行相同的程序时，这些已预编译的扩展将从该目录加载。

如果你使用多个虚拟环境，则可能会出现问题，因为默认情况下只有一个 torch_extensions 目录，但不同的虚拟环境可能使用不同的设置（例如，不同的 python 或 cuda 版本），然后加载另一个环境构建的 CUDA 扩展将失败。因此，如果需要，你可以使用 TORCH_EXTENSIONS_DIR 环境变量覆盖默认位置。因此，在每个虚拟环境中，你可以将其指向一个唯一的目录，并且 DeepSpeed 将使用它来保存和加载 CUDA 扩展。

你还可以在特定运行中更改它，使用：

```
TORCH_EXTENSIONS_DIR=./torch-extensions deepspeed ...
```



## 选择正确的架构进行构建
如果你在运行 DeepSpeed 时遇到以下错误：
```
RuntimeError: CUDA error: no kernel image is available for execution on the device
```
这意味着 CUDA 扩展没有为你尝试使用的卡构建。

从源代码构建 DeepSpeed 时，DeepSpeed 将尝试支持各种架构，但在 JIT 模式下，它只支持在构建时可见的架构。

你可以通过设置 TORCH_CUDA_ARCH_LIST 环境变量来专门为所需的一系列架构构建：
```
TORCH_CUDA_ARCH_LIST="6.1;7.5;8.6" pip install ...
```

当你为更少的架构构建时，这也会使构建更快。

这也是为了确保使用你的确切架构而建议的。由于各种技术原因，分布式的 PyTorch 二进制文件没有完全支持所有架构，跳过兼容的二进制文件可能会导致未充分利用你的完整卡的计算能力。要查看 deepspeed 来源构建中包含哪些架构 - 保存日志并搜索 -gencode 参数。

完整的 NVIDIA GPU 列表及其计算能力可以在这里 (https://developer.nvidia.com/cuda-gpus) 找到。



## CUDA 版本不匹配
如果在运行时碰到以下错误：

```
Exception: >- DeepSpeed Op Builder: Installed CUDA version {VERSION} does not match the version torch was compiled with {VERSION}, unable to compile cuda/cpp extensions without a matching cuda version.
```

你安装的 CUDA 版本与用于编译 torch 的 CUDA 版本不匹配。我们仅需要主版本匹配（例如，11.1 和 11.8 是可以的）。但是，主版本不匹配可能会导致意外的行为和错误。

解决此错误的最简单方法是更改已安装的 CUDA 版本（使用 nvcc --version 检查）或更新 torch 版本以匹配已安装的 CUDA 版本（使用 python3 -c "import torch; print(torch.version)" 检查）。

如果你想跳过此检查并继续使用不匹配的 CUDA 版本，请使用以下环境变量：

```
DS_SKIP_CUDA_CHECK=1
```

## 针对特定功能的依赖项

一些 DeepSpeed 功能需要 DeepSpeed 的一般依赖项之外的特定依赖项。

有关每个功能/op 的 Python 包依赖项，请参阅我们的 requirements 目录（https://github.com/microsoft/DeepSpeed/tree/master/requirements）。

我们尽力将系统级依赖项最小化，但某些功能需要特殊的系统级软件包。请查看我们的 ds_report 工具输出，以查看您是否缺少给定功能的系统级软件包。











================================================
FILE: ai-framework/deepspeed/3.基于CIFAR-10使用DeepSpeed进行分布式训练 .md
================================================
在本教程中，我们将向 CIFAR-10 模型中添加 DeepSpeed，这是一个小型图像分类模型。

首先，我们将介绍如何运行原始的 CIFAR-10 模型。然后，我们将逐步启用此模型以在 DeepSpeed 中运行。

## 运行原始的 CIFAR-10
CIFAR-10 教程的原始模型代码见（https://github.com/pytorch/tutorials/blob/main/beginner_source/blitz/cifar10_tutorial.py）。我们已将其复制到 DeepSpeedExamples/training/cifar/ （https://github.com/microsoft/DeepSpeedExamples/tree/master/training/cifar）下，并作为子模块提供。要下载，请执行：

```
git clone git@github.com:microsoft/DeepSpeedExamples.git
```

安装 CIFAR-10 模型的 requirements：
```
cd DeepSpeedExamples/training/cifar
pip install -r requirements.txt
```

运行 python cifar10_tutorial.py，它会在第一次运行时下载训练数据集。
```
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
170500096it [00:02, 61124868.24it/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
  cat  frog  frog  frog
[1,  2000] loss: 2.170
[1,  4000] loss: 1.879
[1,  6000] loss: 1.690
[1,  8000] loss: 1.591
[1, 10000] loss: 1.545
[1, 12000] loss: 1.467
[2,  2000] loss: 1.377
[2,  4000] loss: 1.374
[2,  6000] loss: 1.363
[2,  8000] loss: 1.322
[2, 10000] loss: 1.295
[2, 12000] loss: 1.287
Finished Training
GroundTruth:    cat  ship  ship plane
Predicted:    cat  ship plane plane
Accuracy of the network on the 10000 test images: 53 %
Accuracy of plane : 69 %
Accuracy of   car : 59 %
Accuracy of  bird : 56 %
Accuracy of   cat : 36 %
Accuracy of  deer : 37 %
Accuracy of   dog : 26 %
Accuracy of  frog : 70 %
Accuracy of horse : 61 %
Accuracy of  ship : 51 %
Accuracy of truck : 63 %
cuda:0
```



## 使能 DeepSpeed

### 参数解析

使能 DeepSpeed 的第一步是向 CIFAR-10 模型添加 DeepSpeed 参数，可以使用以下方式的 deepspeed.add_config_arguments() 函数：
```
import argparse
import deepspeed

def add_argument():

     parser=argparse.ArgumentParser(description='CIFAR')

     # Data.
     # Cuda.
     parser.add_argument('--with_cuda', default=False, action='store_true',
                         help='use CPU in case there\'s no GPU support')
     parser.add_argument('--use_ema', default=False, action='store_true',
                         help='whether use exponential moving average')

     # Train.
     parser.add_argument('-b', '--batch_size', default=32, type=int,
                         help='mini-batch size (default: 32)')
     parser.add_argument('-e', '--epochs', default=30, type=int,
                         help='number of total epochs (default: 30)')
     parser.add_argument('--local_rank', type=int, default=-1,
                        help='local rank passed from distributed launcher')

     # Include DeepSpeed configuration arguments.
     parser = deepspeed.add_config_arguments(parser)

     args=parser.parse_args()

     return args
```

### 初始化

我们使用 deepspeed.initialize 创建 model_engine、optimizer 和 trainloader，deepspeed.initialize 的定义如下：
```
def initialize(args,
               model,
               optimizer=None,
               model_params=None,
               training_data=None,
               lr_scheduler=None,
               mpu=None,
               dist_init_required=True,
               collate_fn=None):
```

在这里，我们使用 CIFAR-10 模型（net）、args、parameters 和 trainset 初始化 DeepSpeed：

```
parameters = filter(lambda p: p.requires_grad, net.parameters())
args=add_argument()

# Initialize DeepSpeed to use the following features
# 1) Distributed model.
# 2) Distributed data loader.
# 3) DeepSpeed optimizer.
model_engine, optimizer, trainloader, _ = deepspeed.initialize(args=args, model=net, model_parameters=parameters, training_data=trainset)
```

初始化 DeepSpeed 后，将原始 device 和 optimizer 删除：

```
#from deepspeed.accelerator import get_accelerator
#device = torch.device(get_accelerator().device_name(0) if get_accelerator().is_available() else "cpu")
#net.to(device)

#optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

```


### 训练API

deepspeed.initialize 返回的模型是 DeepSpeed 模型引擎，我们将使用它来使用 forward、backward 和 step API 训练模型。
```
for i, data in enumerate(trainloader):
         # Get the inputs; data is a list of [inputs, labels].
         inputs = data[0].to(model_engine.device)
         labels = data[1].to(model_engine.device)

         outputs = model_engine(inputs)
         loss = criterion(outputs, labels)

         model_engine.backward(loss)
         model_engine.step()
```
在使用 mini-batch 更新权重之后，DeepSpeed 会自动处理梯度清零。

### 配置

使用 DeepSpeed 的下一步是创建一个配置 JSON 文件 (ds_config.json)。该文件提供由用户定义的 DeepSpeed 特定参数，例如：批量大小、优化器、调度器和其他参数。
```
{
   "train_batch_size": 4,
   "steps_per_print": 2000,
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.001,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },
   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 0.001,
       "warmup_num_steps": 1000
     }
   },
   "wall_clock_breakdown": false
 }
```

### 运行启用 DeepSpeed 的 CIFAR-10 模型

要使用 DeepSpeed 开始训练已应用 DeepSpeed 的 CIFAR-10 模型，请执行以下命令，默认情况下它将使用所有检测到的 GPU。
```
deepspeed cifar10_deepspeed.py --deepspeed_config ds_config.json
```

DeepSpeed 通常会打印更多的训练细节供用户监视，包括训练设置、性能统计和损失趋势。
```
deepspeed.pt cifar10_deepspeed.py --deepspeed_config ds_config.json
Warning: Permanently added '[192.168.0.22]:42227' (ECDSA) to the list of known hosts.
cmd=['pdsh', '-w', 'worker-0', 'export NCCL_VERSION=2.4.2; ', 'cd /data/users/deepscale/test/ds_v2/examples/cifar;', '/usr/bin/python', '-u', '-m', 'deepspeed.pt.deepspeed_launch', '--world_info=eyJ3b3JrZXItMCI6IFswXX0=', '--node_rank=%n', '--master_addr=192.168.0.22', '--master_port=29500', 'cifar10_deepspeed.py', '--deepspeed', '--deepspeed_config', 'ds_config.json']
worker-0: Warning: Permanently added '[192.168.0.22]:42227' (ECDSA) to the list of known hosts.
worker-0: 0 NCCL_VERSION 2.4.2
worker-0: WORLD INFO DICT: {'worker-0': [0]}
worker-0: nnodes=1, num_local_procs=1, node_rank=0
worker-0: global_rank_mapping=defaultdict(<class 'list'>, {'worker-0': [0]})
worker-0: dist_world_size=1
worker-0: Setting CUDA_VISIBLE_DEVICES=0
worker-0: Files already downloaded and verified
worker-0: Files already downloaded and verified
worker-0:  bird   car horse  ship
worker-0: DeepSpeed info: version=2.1, git-hash=fa937e7, git-branch=master
worker-0: [INFO 2020-02-06 19:53:49] Set device to local rank 0 within node.
worker-0: 1 1
worker-0: [INFO 2020-02-06 19:53:56] Using DeepSpeed Optimizer param name adam as basic optimizer
worker-0: [INFO 2020-02-06 19:53:56] DeepSpeed Basic Optimizer = FusedAdam (
worker-0: Parameter Group 0
worker-0:     betas: [0.8, 0.999]
worker-0:     bias_correction: True
worker-0:     eps: 1e-08
worker-0:     lr: 0.001
worker-0:     max_grad_norm: 0.0
worker-0:     weight_decay: 3e-07
worker-0: )
worker-0: [INFO 2020-02-06 19:53:56] DeepSpeed using configured LR scheduler = WarmupLR
worker-0: [INFO 2020-02-06 19:53:56] DeepSpeed LR Scheduler = <deepspeed.pt.deepspeed_lr_schedules.WarmupLR object at 0x7f64c4c09c18>
worker-0: [INFO 2020-02-06 19:53:56] rank:0 step=0, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]
worker-0: DeepSpeedLight configuration:
worker-0:   allgather_size ............... 500000000
worker-0:   allreduce_always_fp32 ........ False
worker-0:   disable_allgather ............ False
worker-0:   dump_state ................... False
worker-0:   dynamic_loss_scale_args ...... None
worker-0:   fp16_enabled ................. False
worker-0:   global_rank .................. 0
worker-0:   gradient_accumulation_steps .. 1
worker-0:   gradient_clipping ............ 0.0
worker-0:   initial_dynamic_scale ........ 4294967296
worker-0:   loss_scale ................... 0
worker-0:   optimizer_name ............... adam
worker-0:   optimizer_params ............. {'lr': 0.001, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
worker-0:   prescale_gradients ........... False
worker-0:   scheduler_name ............... WarmupLR
worker-0:   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 0.001, 'warmup_num_steps': 1000}
worker-0:   sparse_gradients_enabled ..... False
worker-0:   steps_per_print .............. 2000
worker-0:   tensorboard_enabled .......... False
worker-0:   tensorboard_job_name ......... DeepSpeedJobName
worker-0:   tensorboard_output_path ......
worker-0:   train_batch_size ............. 4
worker-0:   train_micro_batch_size_per_gpu  4
worker-0:   wall_clock_breakdown ......... False
worker-0:   world_size ................... 1
worker-0:   zero_enabled ................. False
worker-0:   json = {
worker-0:     "optimizer":{
worker-0:         "params":{
worker-0:             "betas":[
worker-0:                 0.8,
worker-0:                 0.999
worker-0:             ],
worker-0:             "eps":1e-08,
worker-0:             "lr":0.001,
worker-0:             "weight_decay":3e-07
worker-0:         },
worker-0:         "type":"Adam"
worker-0:     },
worker-0:     "scheduler":{
worker-0:         "params":{
worker-0:             "warmup_max_lr":0.001,
worker-0:             "warmup_min_lr":0,
worker-0:             "warmup_num_steps":1000
worker-0:         },
worker-0:         "type":"WarmupLR"
worker-0:     },
worker-0:     "steps_per_print":2000,
worker-0:     "train_batch_size":4,
worker-0:     "wall_clock_breakdown":false
worker-0: }
worker-0: [INFO 2020-02-06 19:53:56] 0/50, SamplesPerSec=1292.6411179579866
worker-0: [INFO 2020-02-06 19:53:56] 0/100, SamplesPerSec=1303.6726433398537
worker-0: [INFO 2020-02-06 19:53:56] 0/150, SamplesPerSec=1304.4251022567403

......

worker-0: [2, 12000] loss: 1.247
worker-0: [INFO 2020-02-06 20:35:23] 0/24550, SamplesPerSec=1284.4954513975558
worker-0: [INFO 2020-02-06 20:35:23] 0/24600, SamplesPerSec=1284.384033658866
worker-0: [INFO 2020-02-06 20:35:23] 0/24650, SamplesPerSec=1284.4433482972925
worker-0: [INFO 2020-02-06 20:35:23] 0/24700, SamplesPerSec=1284.4664449792422
worker-0: [INFO 2020-02-06 20:35:23] 0/24750, SamplesPerSec=1284.4950124403447
worker-0: [INFO 2020-02-06 20:35:23] 0/24800, SamplesPerSec=1284.4756105952233
worker-0: [INFO 2020-02-06 20:35:24] 0/24850, SamplesPerSec=1284.5251526215386
worker-0: [INFO 2020-02-06 20:35:24] 0/24900, SamplesPerSec=1284.531217073863
worker-0: [INFO 2020-02-06 20:35:24] 0/24950, SamplesPerSec=1284.5125323220368
worker-0: [INFO 2020-02-06 20:35:24] 0/25000, SamplesPerSec=1284.5698818883018
worker-0: Finished Training
worker-0: GroundTruth:    cat  ship  ship plane
worker-0: Predicted:    cat   car   car plane
worker-0: Accuracy of the network on the 10000 test images: 57 %
worker-0: Accuracy of plane : 61 %
worker-0: Accuracy of   car : 74 %
worker-0: Accuracy of  bird : 49 %
worker-0: Accuracy of   cat : 36 %
worker-0: Accuracy of  deer : 44 %
worker-0: Accuracy of   dog : 52 %
worker-0: Accuracy of  frog : 67 %
worker-0: Accuracy of horse : 58 %
worker-0: Accuracy of  ship : 70 %
worker-0: Accuracy of truck : 59 %
```

> 补充：你可以使用 --include localhost:1 类似的命令在单卡上运行模型。此外，--num_gpus可以指定使用多少张GPU来运行。


================================================
FILE: ai-framework/deepspeed/DeepSpeed配置JSON文件.md
================================================
## DeepSpeed Configuration JSON

地址：https://www.deepspeed.ai/docs/config-json/



### FP16 训练的 ZeRO 优化

启用和配置 ZeRO 内存优化



- stage3_gather_16bit_weights_on_model_save: [boolean]

> 在通过 save_16bit_model() 保存模型之前合并权重。 由于权重在 GPU 之间进行分区，因此它们不是 state_dict 的一部分，因此启用此选项时该函数会自动收集权重，然后保存 fp16 模型权重。





================================================
FILE: ai-framework/deepspeed/README.md
================================================



- https://github.com/microsoft/DeepSpeedExamples
- https://github.com/microsoft/DeepSpeedExamples.git








================================================
FILE: ai-framework/deepspeed/config-json/README.md
================================================
- https://www.deepspeed.ai/docs/config-json/


## Batch Size 相关的参数


train_batch_size 必须等于 train_micro_batch_size_per_gpu * gradient_accumulation * gpu数量



### train_batch_size


### train_micro_batch_size_per_gpu


### gradient_accumulation_steps

在平均和应用梯度之前进行累积梯度的训练step数。 

此功能有时对于提高可扩展性很有用，因为它会降低step之间梯度通信的频率。 

此功能的另一个影响是能够在每个 GPU 上使用更大的批量大小进行训练。



## Optimizer 参数

- type：优化器名称。 DeepSpeed 原生支持 Adam、AdamW、OneBitAdam、Lamb 和 OneBitLamb 优化器，同时，也可以从 torch 中导入其他优化器。
   - https://deepspeed.readthedocs.io/en/latest/optimizers.html#optimizers
   - https://pytorch.org/docs/stable/optim.html
- params：用于实例化优化器的参数字典。参数名称必须与优化器构造函数签名匹配（例如，Adam）。
   - https://pytorch.org/docs/stable/optim.html#algorithms
   - https://pytorch.org/docs/stable/generated/torch.optim.Adam.html

Adam 优化器示例：

```
"optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.001,
      "betas": [
        0.8,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 3e-7
    }
  }
```

参数：

- torch_adam: Use torch’s implementation of adam instead of our fused adam implementation ， 默认为false
 




## Scheduler 参数

当执行 model_engine.step() 时，DeepSpeed 在每个训练步骤调用 scheduler 的 step() 方法。

- type：学习率调度器名，DeepSpeed 提供了 LRRangeTest、OneCycle、WarmupLR、WarmupDecayLR 学习率调度器的实现。
   - https://deepspeed.readthedocs.io/en/latest/schedulers.html
- params：用于实例化调度器的参数字典。参数名称应与调度程序构造函数签名匹配。

scheduler 示例：

```
 "scheduler": {
      "type": "WarmupLR",
      "params": {
          "warmup_min_lr": 0,
          "warmup_max_lr": 0.001,
          "warmup_num_steps": 1000
      }
  }
```

## 通讯选项


### communication_data_type


### prescale_gradients


### gradient_predivide_factor


### sparse_gradients


## FP16 训练选项

- 注意：此模式不能与下述 amp 模式结合使用。






```
"fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "consecutive_hysteresis": false,
    "min_loss_scale": 1
}
```





## BFLOAT16 训练选项

- 注意：此模式不能与下述amp模式结合使用。
- 注意：该模式不能与上述fp16模式结合使用。

使用 bfloat16 浮点格式作为 FP16 替代方案。 

BFLOAT16 需要硬件支持（例如：NVIDIA A100）。 

使用 bfloat16 进行训练不需要损失缩放。

示例如下所示。 

```
"bf16": {
   "enabled": true
 }
```



## 自动混合精度 (AMP) 训练选项

注意：该模式不能与上述fp16模式结合使用。 此外，该模式目前与 ZeRO 不兼容。

```
"amp": {
    "enabled": true,
    ...
    "opt_level": "O1",
    ...
}
```


## 梯度裁剪(Gradient Clipping)

- gradient_clipping





## 针对 FP16 训练的 ZeRO 优化


## 参数卸载（Parameter offloading）


启用和配置 ZeRO 优化，将参数卸载到 CPU/NVMe。 仅适用于 ZeRO 阶段 3。

- 注意，如果"device"的值未指定或不支持，则会触发断言。

```
 "offload_param": {
    "device": "[cpu|nvme]",
    "nvme_path": "/local_nvme",
    "pin_memory": [true|false],
    "buffer_count": 5,
    "buffer_size": 1e8,
    "max_in_cpu": 1e9
  }
```

## 优化器卸载

启用和配置 ZeRO 优化，将优化器计算卸载到 CPU 并将优化器状态卸载到 CPU/NVMe。 

CPU 卸载适用于 ZeRO 阶段 1、2、3。NVMe 卸载仅适用于 ZeRO 阶段 3。

- 注意，如果"device"的值未指定或不支持，则会触发断言。


```
 "offload_optimizer": {
    "device": "[cpu|nvme]",
    "nvme_path": "/local_nvme",
    "pin_memory": [true|false],
    "buffer_count": 4,
    "fast_init": false
  }
```

## Activation Checkpointing

```
"activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
    }
```


## 稀疏注意力（Sparse Attention）

```
"sparse_attention": {
 "mode": "fixed",
 "block": 16,
 "different_layout_per_head": true,
 "num_local_blocks": 4,
 "num_global_blocks": 1,
 "attention": "bidirectional",
 "horizontal_global_attention": false,
 "num_different_global_patterns": 4,
 "num_random_blocks": 0,
 "local_window_blocks": [4],
 "global_block_indices": [0],
 "global_block_end_indices": None,
 "num_sliding_window_blocks": 3
}
```


## Logging

### steps_per_print


### wall_clock_breakdown


### dump_state





## Flops 分析器（Flops Profiler）

- detailed：是否打印详细的模型配置。
- output_file：输出文件的路径。 如果没有，Profiler 将打印到标准输出。


```
{
  "flops_profiler": {
    "enabled": false,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null,
    }
}
```


## 监控模块（TensorBoard、WandB、CSV）


### tensorboard

TensorBoard配置示例：
```
"tensorboard": {
    "enabled": true,
    "output_path": "output/ds_logs/",
    "job_name": "train_bert"
}
```





## 压缩（Compression）

### Layer Reduction

### 权重量化（Weight Quantization）


### 激活量化（Activation Quantization）

### 稀疏剪枝（Sparse Pruning）

### 头剪枝(Head Pruning)



### 通道剪枝（Channel Pruning）


## Checkpoint 选项

```
"checkpoint": {
    "tag_validation"="Warn",
    "load_universal"=false,
    "use_node_local_storage"=false,
    "parallel_write":{
        "pipeline_stage": false
    }
}
```



## 数据类型选项

```
"data_types": {
    "grad_accum_dtype"=["fp32"|"fp16"|"bf16"]
    }
}
```



## Data Efficiency









================================================
FILE: ai-framework/deepspeed/config-json/deepspeed-nvme.md
================================================






- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

```

```




================================================
FILE: ai-framework/deepspeed/deepspeed-slurm.md
================================================






## 支持的发布

PDSH_LAUNCHER = 'pdsh'
PDSH_MAX_FAN_OUT = 1024

OPENMPI_LAUNCHER = 'openmpi'
MPICH_LAUNCHER = 'mpich'
IMPI_LAUNCHER = 'impi'
SLURM_LAUNCHER = 'slurm'
MVAPICH_LAUNCHER = 'mvapich'






## Slurm

- https://hpclib.com/Scheduler/Slurm/mpi_guide.html
- https://slurm.schedmd.com/mpi_guide.html







================================================
FILE: ai-framework/deepspeed/hello_bert/README.md
================================================

# HelloDeepSpeed


- 源码：https://github.com/microsoft/DeepSpeedExamples/tree/master/training/HelloDeepSpeed


## HF

```
model = create_model(
        num_layers=num_layers,
        num_heads=num_heads,
        ff_dim=ff_dim,
        h_dim=h_dim,
        dropout=dropout,
    )
model.train()

for step, batch in enumerate(data_iterator, start=start_step):
    optimizer.zero_grad()
    # Forward pass
    loss = model(**batch)
    # Backward pass
    loss.backward()
    # Optimizer Step
    optimizer.step()
```


运行命令：

```
python train_bert.py --checkpoint_dir ./experiments --local_rank 0
```
模型输出权重文件：
```
tree experiments/
experiments/
└── bert_pretrain.2023.6.13.5.34.39.addjtvxg
    ├── checkpoint.iter_1000.pt
    ├── checkpoint.iter_2000.pt
    ├── checkpoint.iter_3000.pt
    ├── checkpoint.iter_4000.pt
    ├── checkpoint.iter_5000.pt
    ├── checkpoint.iter_6000.pt
    ├── checkpoint.iter_7000.pt
    ├── checkpoint.iter_8000.pt
    ├── checkpoint.iter_9000.pt
    ├── gitdiff.log
    ├── githash.log
    ├── hparams.json
    └── tb_dir
        └── events.out.tfevents.1686659679.ai-app-2-46.54673.0

```


## Deepspeed+HF


```
model = create_model(
        num_layers=num_layers,
        num_heads=num_heads,
        ff_dim=ff_dim,
        h_dim=h_dim,
        dropout=dropout,
    )
model, _, _, _ = deepspeed.initialize(model=model,
                                          model_parameters=model.parameters(),
                                          config=ds_config)
model.train()
for step, batch in enumerate(data_iterator, start=start_step):
    # Forward pass
    loss = model(**batch)
    # Backward pass
    model.backward(loss)
    # Optimizer Step
    model.step()
```



运行命令及模型输出权重文件：

```
# 默认使用当前服务器所有GPU卡
deepspeed train_bert_ds.py --checkpoint_dir ./experiments_ds

tree experiments_ds/
experiments_ds/
└── bert_pretrain.2023.6.13.18.58.44.addjtvxg
    ├── gitdiff.log
    ├── githash.log
    ├── global_step1000
    │   ├── mp_rank_00_model_states.pt
    │   ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
    ...
    │   └── zero_pp_rank_7_mp_rank_00_optim_states.pt
    ├── global_step9000
    │   ├── mp_rank_00_model_states.pt
    │   ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
    ...
    │   └── zero_pp_rank_7_mp_rank_00_optim_states.pt
    ├── hparams.json
    ├── latest
    ├── tb_dir
    │   └── events.out.tfevents.1686707924.ai-app-2-46.599.0
    └── zero_to_fp32.py



deepspeed --include localhost:2,3,4,5 train_bert_ds.py --checkpoint_dir ./experiments_multigpu --num_iterations=500 --checkpoint_every=250

tree -h ./experiments_multigpu
./experiments_multigpu
├── [  36]  bert_pretrain.2023.6.13.19.37.59.addjtvxg
│   └── [  63]  global_step250
│       └── [ 47M]  zero_pp_rank_3_mp_rank_00_optim_states.pt
└── [ 169]  bert_pretrain.2023.6.13.19.38.0.addjtvxg
    ├── [ 45K]  gitdiff.log
    ├── [  41]  githash.log
    ├── [ 207]  global_step250
    │   ├── [ 31M]  mp_rank_00_model_states.pt
    │   ├── [ 47M]  zero_pp_rank_0_mp_rank_00_optim_states.pt
    │   ├── [ 47M]  zero_pp_rank_1_mp_rank_00_optim_states.pt
    │   └── [ 47M]  zero_pp_rank_2_mp_rank_00_optim_states.pt
    ├── [ 298]  hparams.json
    ├── [  14]  latest
    ├── [  77]  tb_dir
    │   └── [2.4K]  events.out.tfevents.1686710280.ai-app-2-46.14672.0
    └── [ 18K]  zero_to_fp32.py
```













================================================
FILE: ai-framework/deepspeed/hello_bert/train_bert.py
================================================
import datetime
import json
import pathlib
import re
import string
from functools import partial
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, TypeVar, Union

import random
import datasets
import fire
import loguru
import numpy as np
import pytz
import sh
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from torch.utils.tensorboard import SummaryWriter
from transformers import AutoTokenizer, PreTrainedTokenizer, PreTrainedTokenizerFast
from transformers.models.roberta import RobertaConfig, RobertaModel
from transformers.models.roberta.modeling_roberta import (
    RobertaLMHead,
    RobertaPreTrainedModel,
)

logger = loguru.logger

######################################################################
############### Dataset Creation Related Functions ###################
######################################################################

TokenizerType = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]


def collate_function(batch: List[Tuple[List[int], List[int]]],
                     pad_token_id: int) -> Dict[str, torch.Tensor]:
    """Collect a list of masked token indices, and labels, and
    batch them, padding to max length in the batch.
    """
    max_length = max(len(token_ids) for token_ids, _ in batch)
    padded_token_ids = [
        token_ids +
        [pad_token_id for _ in range(0, max_length - len(token_ids))]
        for token_ids, _ in batch
    ]
    padded_labels = [
        labels + [pad_token_id for _ in range(0, max_length - len(labels))]
        for _, labels in batch
    ]
    src_tokens = torch.LongTensor(padded_token_ids)
    tgt_tokens = torch.LongTensor(padded_labels)
    attention_mask = src_tokens.ne(pad_token_id).type_as(src_tokens)
    return {
        "src_tokens": src_tokens,
        "tgt_tokens": tgt_tokens,
        "attention_mask": attention_mask,
    }


def masking_function(
        text: str,
        tokenizer: TokenizerType,
        mask_prob: float,
        random_replace_prob: float,
        unmask_replace_prob: float,
        max_length: int,
) -> Tuple[List[int], List[int]]:
    """Given a text string, randomly mask wordpieces for Bert MLM
    training.

    Args:
        text (str):
            The input text
        tokenizer (TokenizerType):
            The tokenizer for tokenization
        mask_prob (float):
            What fraction of tokens to mask
        random_replace_prob (float):
            Of the masked tokens, how many should be replaced with
            random tokens (improves performance)
        unmask_replace_prob (float):
            Of the masked tokens, how many should be replaced with
            the original token (improves performance)
        max_length (int):
            The maximum sequence length to consider. Note that for
            Bert style models, this is a function of the number of
            positional embeddings you learn

    Returns:
        Tuple[List[int], List[int]]:
            The masked token ids (based on the tokenizer passed),
            and the output labels (padded with `tokenizer.pad_token_id`)
    """
    # Note: By default, encode does add the BOS and EOS token
    # Disabling that behaviour to make this more clear
    tokenized_ids = ([tokenizer.bos_token_id] +
                     tokenizer.encode(text,
                                      add_special_tokens=False,
                                      truncation=True,
                                      max_length=max_length - 2) +
                     [tokenizer.eos_token_id])
    seq_len = len(tokenized_ids)
    tokenized_ids = np.array(tokenized_ids)
    subword_mask = np.full(len(tokenized_ids), False)

    # Masking the BOS and EOS token leads to slightly worse performance
    low = 1
    high = len(subword_mask) - 1
    mask_choices = np.arange(low, high)
    num_subwords_to_mask = max(
        int((mask_prob * (high - low)) + np.random.rand()), 1)
    subword_mask[np.random.choice(mask_choices,
                                  num_subwords_to_mask,
                                  replace=False)] = True

    # Create the labels first
    labels = np.full(seq_len, tokenizer.pad_token_id)
    labels[subword_mask] = tokenized_ids[subword_mask]

    tokenized_ids[subword_mask] = tokenizer.mask_token_id

    # Now of the masked tokens, choose how many to replace with random and how many to unmask
    rand_or_unmask_prob = random_replace_prob + unmask_replace_prob
    if rand_or_unmask_prob > 0:
        rand_or_unmask = subword_mask & (np.random.rand(len(tokenized_ids)) <
                                         rand_or_unmask_prob)
        if random_replace_prob == 0:
            unmask = rand_or_unmask
            rand_mask = None
        elif unmask_replace_prob == 0:
            unmask = None
            rand_mask = rand_or_unmask
        else:
            unmask_prob = unmask_replace_prob / rand_or_unmask_prob
            decision = np.random.rand(len(tokenized_ids)) < unmask_prob
            unmask = rand_or_unmask & decision
            rand_mask = rand_or_unmask & (~decision)
        if unmask is not None:
            tokenized_ids[unmask] = labels[unmask]
        if rand_mask is not None:
            weights = np.ones(tokenizer.vocab_size)
            weights[tokenizer.all_special_ids] = 0
            probs = weights / weights.sum()
            num_rand = rand_mask.sum()
            tokenized_ids[rand_mask] = np.random.choice(tokenizer.vocab_size,
                                                        num_rand,
                                                        p=probs)
    return tokenized_ids.tolist(), labels.tolist()


class WikiTextMLMDataset(Dataset):
    """A [Map style dataset](https://pytorch.org/docs/stable/data.html)
    for iterating over the wikitext dataset. Note that this assumes
    the dataset can fit in memory. For larger datasets
    you'd want to shard them and use an iterable dataset (eg: see
    [Infinibatch](https://github.com/microsoft/infinibatch))

    Args:
        Dataset (datasets.arrow_dataset.Dataset):
            The wikitext dataset
        masking_function (Callable[[str], Tuple[List[int], List[int]]])
            The masking function. To generate one training instance,
            the masking function is applied to the `text` of a dataset
            record

    """
    def __init__(
        self,
        dataset: datasets.arrow_dataset.Dataset,
        masking_function: Callable[[str], Tuple[List[int], List[int]]],
    ) -> None:
        self.dataset = dataset
        self.masking_function = masking_function

    def __len__(self) -> int:
        return len(self.dataset)

    def __getitem__(self, idx: int) -> Tuple[List[int], List[int]]:
        tokens, labels = self.masking_function(self.dataset[idx]["text"])
        return (tokens, labels)

# TypeVar 声明类型变量T
T = TypeVar("T")


# 使用迭代器，用到的时候再取
class InfiniteIterator(object):
    def __init__(self, iterable: Iterable[T]) -> None:
        self._iterable = iterable
        self._iterator = iter(self._iterable)

    # 返回一个特殊的迭代器对象， 这个迭代器对象实现了 next() 方法并通过 StopIteration 异常标识迭代的完成。
    def __iter__(self):
        return self

    # 会返回下一个迭代器对象。我们就可以通过next函数访问这个对象的下一个元素了，
    # 并且在你不想继续有迭代的情况下抛出一个StopIteration的异常
    def __next__(self) -> T:
        next_item = None
        try:
            next_item = next(self._iterator)
        except StopIteration:
            self._iterator = iter(self._iterable)
            next_item = next(self._iterator)
        return next_item


def create_data_iterator(
        mask_prob: float,
        random_replace_prob: float,
        unmask_replace_prob: float,
        batch_size: int,
        max_seq_length: int = 512,
        tokenizer: str = "roberta-base",
) -> InfiniteIterator:
    """Create the dataloader.

    Args:
        mask_prob (float):
            Fraction of tokens to mask
        random_replace_prob (float):
            Fraction of masked tokens to replace with random token
        unmask_replace_prob (float):
            Fraction of masked tokens to replace with the actual token
        batch_size (int):
            The batch size of the generated tensors
        max_seq_length (int, optional):
            The maximum sequence length for the MLM task. Defaults to 512.
        tokenizer (str, optional):
            The tokenizer to use. Defaults to "roberta-base".

    Returns:
        InfiniteIterator:
            The torch DataLoader, wrapped in an InfiniteIterator class, to
            be able to continuously generate samples

    """
    #wikitext_dataset = datasets.load_dataset("wikitext",
    wikitext_dataset = datasets.load_dataset("/home/guodong.li/code/wikitext.py",
                                             "wikitext-2-v1",
                                             split="train")
    wikitext_dataset = wikitext_dataset.filter(
        lambda record: record["text"] != "").map(
            lambda record: {"text": record["text"].rstrip("\n")})
    #tokenizer = AutoTokenizer.from_pretrained(tokenizer)
    tokenizer = AutoTokenizer.from_pretrained("/home/guodong.li/model/roberta-base")

    masking_function_partial = partial(
        masking_function,
        tokenizer=tokenizer,
        mask_prob=mask_prob,
        random_replace_prob=random_replace_prob,
        unmask_replace_prob=unmask_replace_prob,
        max_length=max_seq_length,
    )

    dataset = WikiTextMLMDataset(wikitext_dataset, masking_function_partial)
    collate_fn_partial = partial(collate_function,
                                 pad_token_id=tokenizer.pad_token_id)


    # 加载数据
    dataloader = DataLoader(dataset,
                            batch_size=batch_size,
                            shuffle=True,
                            collate_fn=collate_fn_partial)


    return InfiniteIterator(dataloader)


######################################################################
############### Model Creation Related Functions #####################
######################################################################


class RobertaLMHeadWithMaskedPredict(RobertaLMHead):
    def __init__(self,
                 config: RobertaConfig,
                 embedding_weight: Optional[torch.Tensor] = None) -> None:
        super(RobertaLMHeadWithMaskedPredict, self).__init__(config)
        if embedding_weight is not None:
            self.decoder.weight = embedding_weight

    def forward(  # pylint: disable=arguments-differ
        self,
        features: torch.Tensor,
        masked_token_indices: Optional[torch.Tensor] = None,
        **kwargs,
    ) -> torch.Tensor:
        """The current `transformers` library does not provide support
        for masked_token_indices. This function provides the support, by
        running the final forward pass only for the masked indices. This saves
        memory

        Args:
            features (torch.Tensor):
                The features to select from. Shape (batch, seq_len, h_dim)
            masked_token_indices (torch.Tensor, optional):
                The indices of masked tokens for index select. Defaults to None.
                Shape: (num_masked_tokens,)

        Returns:
            torch.Tensor:
                The index selected features. Shape (num_masked_tokens, h_dim)

        """
        if masked_token_indices is not None:
            features = torch.index_select(
                features.view(-1, features.shape[-1]), 0, masked_token_indices)
        return super().forward(features)


class RobertaMLMModel(RobertaPreTrainedModel):
    def __init__(self, config: RobertaConfig, encoder: RobertaModel) -> None:
        super().__init__(config)
        self.encoder = encoder
        self.lm_head = RobertaLMHeadWithMaskedPredict(
            config, self.encoder.embeddings.word_embeddings.weight)
        self.lm_head.apply(self._init_weights)

    def forward(
            self,
            src_tokens: torch.Tensor,
            attention_mask: torch.Tensor,
            tgt_tokens: torch.Tensor,
    ) -> torch.Tensor:
        """The forward pass for the MLM task

        Args:
            src_tokens (torch.Tensor):
                The masked token indices. Shape: (batch, seq_len)
            attention_mask (torch.Tensor):
                The attention mask, since the batches are padded
                to the largest sequence. Shape: (batch, seq_len)
            tgt_tokens (torch.Tensor):
                The output tokens (padded with `config.pad_token_id`)

        Returns:
            torch.Tensor:
                The MLM loss
        """
        # shape: (batch, seq_len, h_dim)
        sequence_output, *_ = self.encoder(input_ids=src_tokens,
                                           attention_mask=attention_mask,
                                           return_dict=False)

        pad_token_id = self.config.pad_token_id
        # (labels have also been padded with pad_token_id)
        # filter out all masked labels
        # shape: (num_masked_tokens,)
        masked_token_indexes = torch.nonzero(
            (tgt_tokens != pad_token_id).view(-1)).view(-1)
        # shape: (num_masked_tokens, vocab_size)
        prediction_scores = self.lm_head(sequence_output, masked_token_indexes)
        # shape: (num_masked_tokens,)
        target = torch.index_select(tgt_tokens.view(-1), 0,
                                    masked_token_indexes)

        loss_fct = nn.CrossEntropyLoss(ignore_index=-1)

        masked_lm_loss = loss_fct(
            prediction_scores.view(-1, self.config.vocab_size), target)
        return masked_lm_loss


def create_model(num_layers: int, num_heads: int, ff_dim: int, h_dim: int,
                 dropout: float) -> RobertaMLMModel:
    """Create a Bert model with the specified `num_heads`, `ff_dim`,
    `h_dim` and `dropout`

    创建一个bert模型

    Args:
        num_layers (int):
            The number of layers
        num_heads (int):
            The number of attention heads
        ff_dim (int):
            The intermediate hidden size of
            the feed forward block of the
            transformer
        h_dim (int):
            The hidden dim of the intermediate
            representations of the transformer
        dropout (float):
            The value of dropout to be used.
            Note that we apply the same dropout
            to both the attention layers and the
            FF layers

    Returns:
        RobertaMLMModel:
            A Roberta model for MLM task

    """
    roberta_config_dict = {
        "attention_probs_dropout_prob": dropout,
        "bos_token_id": 0,
        "eos_token_id": 2,
        "hidden_act": "gelu",
        "hidden_dropout_prob": dropout,
        "hidden_size": h_dim,
        "initializer_range": 0.02,
        "intermediate_size": ff_dim,
        "layer_norm_eps": 1e-05,
        "max_position_embeddings": 514,
        "model_type": "roberta",
        "num_attention_heads": num_heads,
        "num_hidden_layers": num_layers,
        "pad_token_id": 1,
        "type_vocab_size": 1,
        "vocab_size": 50265,
    }
    roberta_config = RobertaConfig.from_dict(roberta_config_dict)
    roberta_encoder = RobertaModel(roberta_config)
    roberta_model = RobertaMLMModel(roberta_config, roberta_encoder)
    return roberta_model


######################################################################
########### Experiment Management Related Functions ##################
######################################################################


def get_unique_identifier(length: int = 8) -> str:
    """Create a unique identifier by choosing `length`
    random characters from list of ascii characters and numbers
    """
    alphabet = string.ascii_lowercase + string.digits
    uuid = "".join(alphabet[ix]
                   for ix in np.random.choice(len(alphabet), length))
    return uuid


def create_experiment_dir(checkpoint_dir: pathlib.Path,
                          all_arguments: Dict[str, Any]) -> pathlib.Path:
    """Create an experiment directory and save all arguments in it.
    Additionally, also store the githash and gitdiff. Finally create
    a directory for `Tensorboard` logs. The structure would look something
    like
        checkpoint_dir
            `-experiment-name
                |- hparams.json
                |- githash.log
                |- gitdiff.log
                `- tb_dir/

    Args:
        checkpoint_dir (pathlib.Path):
            The base checkpoint directory
        all_arguments (Dict[str, Any]):
            The arguments to save

    Returns:
        pathlib.Path: The experiment directory
    """
    # experiment name follows the following convention
    # {exp_type}.{YYYY}.{MM}.{DD}.{HH}.{MM}.{SS}.{uuid}
    current_time = datetime.datetime.now(pytz.timezone("US/Pacific"))
    expname = "bert_pretrain.{0}.{1}.{2}.{3}.{4}.{5}.{6}".format(
        current_time.year,
        current_time.month,
        current_time.day,
        current_time.hour,
        current_time.minute,
        current_time.second,
        get_unique_identifier(),
    )
    exp_dir = checkpoint_dir / expname
    exp_dir.mkdir(exist_ok=False)
    hparams_file = exp_dir / "hparams.json"
    with hparams_file.open("w") as handle:
        json.dump(obj=all_arguments, fp=handle, indent=2)
    # Save the git hash
    try:
        gitlog = sh.git.log("-1", format="%H", _tty_out=False, _fg=False)
        with (exp_dir / "githash.log").open("w") as handle:
            handle.write(gitlog.stdout.decode("utf-8"))
    except sh.ErrorReturnCode_128:
        logger.info("Seems like the code is not running from"
                    " within a git repo, so hash will"
                    " not be stored. However, it"
                    " is strongly advised to use"
                    " version control.")
    # And the git diff
    try:
        gitdiff = sh.git.diff(_fg=False, _tty_out=False)
        with (exp_dir / "gitdiff.log").open("w") as handle:
            handle.write(gitdiff.stdout.decode("utf-8"))
    except sh.ErrorReturnCode_129:
        logger.info("Seems like the code is not running from"
                    " within a git repo, so diff will"
                    " not be stored. However, it"
                    " is strongly advised to use"
                    " version control.")
    # Finally create the Tensorboard Dir
    tb_dir = exp_dir / "tb_dir"
    tb_dir.mkdir()
    return exp_dir


######################################################################
################ Checkpoint Related Functions ########################
######################################################################


def load_model_checkpoint(
    load_checkpoint_dir: pathlib.Path,
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
) -> Tuple[int, torch.nn.Module, torch.optim.Optimizer]:
    """Loads the optimizer state dict and model state dict from the load_checkpoint_dir
    into the passed model and optimizer. Searches for the most recent checkpoint to
    load from

    Args:
        load_checkpoint_dir (pathlib.Path):
            The base checkpoint directory to load from
        model (torch.nn.Module):
            The model to load the checkpoint weights into
        optimizer (torch.optim.Optimizer):
            The optimizer to load the checkpoint weigths into

    Returns:
        Tuple[int, torch.nn.Module, torch.optim.Optimizer]:
            The checkpoint step, model with state_dict loaded and
            optimizer with state_dict loaded

    """
    logger.info(
        f"Loading model and optimizer checkpoint from {load_checkpoint_dir}")
    checkpoint_files = list(
        filter(
            lambda path: re.search(r"iter_(?P<iter_no>\d+)\.pt", path.name) is
            not None,
            load_checkpoint_dir.glob("*.pt"),
        ))
    assert len(checkpoint_files) > 0, "No checkpoints found in directory"
    checkpoint_files = sorted(
        checkpoint_files,
        key=lambda path: int(
            re.search(r"iter_(?P<iter_no>\d+)\.pt", path.name).group("iter_no")
        ),
    )
    latest_checkpoint_path = checkpoint_files[-1]
    checkpoint_step = int(
        re.search(r"iter_(?P<iter_no>\d+)\.pt",
                  latest_checkpoint_path.name).group("iter_no"))

    state_dict = torch.load(latest_checkpoint_path)
    model.load_state_dict(state_dict["model"], strict=True)
    optimizer.load_state_dict(state_dict["optimizer"])
    logger.info(
        f"Loading model and optimizer checkpoints done. Loaded from {latest_checkpoint_path}"
    )
    return checkpoint_step, model, optimizer


######################################################################
######################## Driver Functions ############################
######################################################################


def train(
        checkpoint_dir: str = None,
        load_checkpoint_dir: str = None,
        # Dataset Parameters
        mask_prob: float = 0.15,
        random_replace_prob: float = 0.1,
        unmask_replace_prob: float = 0.1,
        max_seq_length: int = 512,
        tokenizer: str = "roberta-base",
        # Model Parameters
        num_layers: int = 6,
        num_heads: int = 8,
        ff_dim: int = 512,
        h_dim: int = 256,
        dropout: float = 0.1,
        # Training Parameters
        batch_size: int = 8,
        num_iterations: int = 10000,
        checkpoint_every: int = 1000,
        log_every: int = 10,
        local_rank: int = -1,
) -> pathlib.Path:
    """Trains a [Bert style](https://arxiv.org/pdf/1810.04805.pdf)
    (transformer encoder only) model for MLM Task

    Args:
        checkpoint_dir (str):
            The base experiment directory to save experiments to
        mask_prob (float, optional):
            The fraction of tokens to mask. Defaults to 0.15.
        random_replace_prob (float, optional):
            The fraction of masked tokens to replace with random token.
            Defaults to 0.1.
        unmask_replace_prob (float, optional):
            The fraction of masked tokens to leave unchanged.
            Defaults to 0.1.
        max_seq_length (int, optional):
            The maximum sequence length of the examples. Defaults to 512.
        tokenizer (str, optional):
            The tokenizer to use. Defaults to "roberta-base".
        num_layers (int, optional):
            The number of layers in the Bert model. Defaults to 6.
        num_heads (int, optional):
            Number of attention heads to use. Defaults to 8.
        ff_dim (int, optional):
            Size of the intermediate dimension in the FF layer.
            Defaults to 512.
        h_dim (int, optional):
            Size of intermediate representations.
            Defaults to 256.
        dropout (float, optional):
            Amout of Dropout to use. Defaults to 0.1.
        batch_size (int, optional):
            The minibatch size. Defaults to 8.
        num_iterations (int, optional):
            Total number of iterations to run the model for.
            Defaults to 10000.
        checkpoint_every (int, optional):
            Save checkpoint after these many steps.

            ..note ::

                You want this to be frequent enough that you can
                resume training in case it crashes, but not so much
                that you fill up your entire storage !

            Defaults to 1000.
        log_every (int, optional):
            Print logs after these many steps. Defaults to 10.
        local_rank (int, optional):
            Which GPU to run on (-1 for CPU). Defaults to -1.

    Returns:
        pathlib.Path: The final experiment directory

    """
    device = (torch.device("cuda", local_rank) if (local_rank > -1)
              and torch.cuda.is_available() else torch.device("cpu"))
    ################################
    ###### Create Exp. Dir #########
    ################################
    if checkpoint_dir is None and load_checkpoint_dir is None:
        logger.error("Need to specify one of checkpoint_dir"
                     " or load_checkpoint_dir")
        return
    if checkpoint_dir is not None and load_checkpoint_dir is not None:
        logger.error("Cannot specify both checkpoint_dir"
                     " and load_checkpoint_dir")
        return
    if checkpoint_dir:
        logger.info("Creating Experiment Directory")
        checkpoint_dir = pathlib.Path(checkpoint_dir)
        checkpoint_dir.mkdir(exist_ok=True)
        all_arguments = {
            # Dataset Params
            "mask_prob": mask_prob,
            "random_replace_prob": random_replace_prob,
            "unmask_replace_prob": unmask_replace_prob,
            "max_seq_length": max_seq_length,
            "tokenizer": tokenizer,
            # Model Params
            "num_layers": num_layers,
            "num_heads": num_heads,
            "ff_dim": ff_dim,
            "h_dim": h_dim,
            "dropout": dropout,
            # Training Params
            "batch_size": batch_size,
            "num_iterations": num_iterations,
            "checkpoint_every": checkpoint_every,
        }
        exp_dir = create_experiment_dir(checkpoint_dir, all_arguments)
        logger.info(f"Experiment Directory created at {exp_dir}")
    else:
        logger.info("Loading from Experiment Directory")
        load_checkpoint_dir = pathlib.Path(load_checkpoint_dir)
        assert load_checkpoint_dir.exists()
        with (load_checkpoint_dir / "hparams.json").open("r") as handle:
            hparams = json.load(handle)
        # Set the hparams
        # Dataset Params
        mask_prob = hparams.get("mask_prob", mask_prob)
        tokenizer = hparams.get("tokenizer", tokenizer)
        random_replace_prob = hparams.get("random_replace_prob",
                                          random_replace_prob)
        unmask_replace_prob = hparams.get("unmask_replace_prob",
                                          unmask_replace_prob)
        max_seq_length = hparams.get("max_seq_length", max_seq_length)
        # Model Params
        ff_dim = hparams.get("ff_dim", ff_dim)
        h_dim = hparams.get("h_dim", h_dim)
        dropout = hparams.get("dropout", dropout)
        num_layers = hparams.get("num_layers", num_layers)
        num_heads = hparams.get("num_heads", num_heads)
        # Training Params
        batch_size = hparams.get("batch_size", batch_size)
        _num_iterations = hparams.get("num_iterations", num_iterations)
        num_iterations = max(num_iterations, _num_iterations)
        checkpoint_every = hparams.get("checkpoint_every", checkpoint_every)
        exp_dir = load_checkpoint_dir
    # Tensorboard writer
    tb_dir = exp_dir / "tb_dir"
    assert tb_dir.exists()
    summary_writer = SummaryWriter(log_dir=tb_dir)


    ################################
    ###### 创建数据集 #########
    ################################
    logger.info("Creating Datasets")
    data_iterator = create_data_iterator(
        mask_prob=mask_prob,
        random_replace_prob=random_replace_prob,
        unmask_replace_prob=unmask_replace_prob,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        batch_size=batch_size,
    )
    logger.info("Dataset Creation Done")
    ################################
    ###### 创建模型 ############
    ################################
    logger.info("Creating Model")
    model = create_model(
        num_layers=num_layers,
        num_heads=num_heads,
        ff_dim=ff_dim,
        h_dim=h_dim,
        dropout=dropout,
    )
    model = model.to(device)
    logger.info("Model Creation Done")


    ################################
    ###### 创建 Optimizer #######
    ################################
    logger.info("Creating Optimizer")
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    logger.info("Optimizer Creation Done")



    ################################
    #### Load Model checkpoint #####
    ################################
    start_step = 1
    if load_checkpoint_dir is not None:
        checkpoint_step, model, optimizer = load_model_checkpoint(
            load_checkpoint_dir, model, optimizer)
        start_step = checkpoint_step + 1

    ################################
    ####### The Training Loop ######
    ################################
    logger.info(
        f"Total number of model parameters: {sum([p.numel() for p in model.parameters()]):,d}"
    )
    model.train()
    losses = []
    for step, batch in enumerate(data_iterator, start=start_step):
        if step >= num_iterations:
            break
        optimizer.zero_grad()
        # Move the tensors to device
        for key, value in batch.items():
            batch[key] = value.to(device)
        # Forward pass
        loss = model(**batch)
        # Backward pass
        loss.backward()

        # Optimizer Step
        optimizer.step()
        
        losses.append(loss.item())
        if step % log_every == 0:
            logger.info("Loss: {0:.4f}".format(np.mean(losses)))
            summary_writer.add_scalar(f"Train/loss", np.mean(losses), step)
        if step % checkpoint_every == 0:
            state_dict = {
                "model": model.state_dict(),
                "optimizer": optimizer.state_dict(),
            }
            # 保存模型及优化器
            torch.save(obj=state_dict,
                       f=str(exp_dir / f"checkpoint.iter_{step}.pt"))
            logger.info("Saved model to {0}".format(
                (exp_dir / f"checkpoint.iter_{step}.pt")))
    # Save the last checkpoint if not saved yet
    if step % checkpoint_every != 0:
        state_dict = {
            "model": model.state_dict(),
            "optimizer": optimizer.state_dict(),
        }
        torch.save(obj=state_dict,
                   f=str(exp_dir / f"checkpoint.iter_{step}.pt"))
        logger.info("Saved model to {0}".format(
            (exp_dir / f"checkpoint.iter_{step}.pt")))

    return exp_dir


if __name__ == "__main__":
    torch.manual_seed(42)
    np.random.seed(0)
    random.seed(0)
    fire.Fire(train)


================================================
FILE: ai-framework/deepspeed/hello_bert/train_bert_ds.py
================================================
import datetime
import json
import pathlib
import re
import string
from functools import partial
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, TypeVar, Union

import random
import datasets
import fire
import loguru
import numpy as np
import pytz
import sh
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from torch.utils.tensorboard import SummaryWriter
from transformers import AutoTokenizer, PreTrainedTokenizer, PreTrainedTokenizerFast
from transformers.models.roberta import RobertaConfig, RobertaModel
from transformers.models.roberta.modeling_roberta import (
    RobertaLMHead,
    RobertaPreTrainedModel,
)

logger = loguru.logger

######################################################################
############### Dataset Creation Related Functions ###################
######################################################################

TokenizerType = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]


def collate_function(batch: List[Tuple[List[int], List[int]]],
                     pad_token_id: int) -> Dict[str, torch.Tensor]:
    """Collect a list of masked token indices, and labels, and
    batch them, padding to max length in the batch.
    """
    max_length = max(len(token_ids) for token_ids, _ in batch)
    padded_token_ids = [
        token_ids +
        [pad_token_id for _ in range(0, max_length - len(token_ids))]
        for token_ids, _ in batch
    ]
    padded_labels = [
        labels + [pad_token_id for _ in range(0, max_length - len(labels))]
        for _, labels in batch
    ]
    src_tokens = torch.LongTensor(padded_token_ids)
    tgt_tokens = torch.LongTensor(padded_labels)
    attention_mask = src_tokens.ne(pad_token_id).type_as(src_tokens)
    return {
        "src_tokens": src_tokens,
        "tgt_tokens": tgt_tokens,
        "attention_mask": attention_mask,
    }


def masking_function(
        text: str,
        tokenizer: TokenizerType,
        mask_prob: float,
        random_replace_prob: float,
        unmask_replace_prob: float,
        max_length: int,
) -> Tuple[List[int], List[int]]:
    """Given a text string, randomly mask wordpieces for Bert MLM
    training.

    Args:
        text (str):
            The input text
        tokenizer (TokenizerType):
            The tokenizer for tokenization
        mask_prob (float):
            What fraction of tokens to mask
        random_replace_prob (float):
            Of the masked tokens, how many should be replaced with
            random tokens (improves performance)
        unmask_replace_prob (float):
            Of the masked tokens, how many should be replaced with
            the original token (improves performance)
        max_length (int):
            The maximum sequence length to consider. Note that for
            Bert style models, this is a function of the number of
            positional embeddings you learn

    Returns:
        Tuple[List[int], List[int]]:
            The masked token ids (based on the tokenizer passed),
            and the output labels (padded with `tokenizer.pad_token_id`)
    """
    # Note: By default, encode does add the BOS and EOS token
    # Disabling that behaviour to make this more clear
    tokenized_ids = ([tokenizer.bos_token_id] +
                     tokenizer.encode(text,
                                      add_special_tokens=False,
                                      truncation=True,
                                      max_length=max_length - 2) +
                     [tokenizer.eos_token_id])
    seq_len = len(tokenized_ids)
    tokenized_ids = np.array(tokenized_ids)
    subword_mask = np.full(len(tokenized_ids), False)

    # Masking the BOS and EOS token leads to slightly worse performance
    low = 1
    high = len(subword_mask) - 1
    mask_choices = np.arange(low, high)
    num_subwords_to_mask = max(
        int((mask_prob * (high - low)) + np.random.rand()), 1)
    subword_mask[np.random.choice(mask_choices,
                                  num_subwords_to_mask,
                                  replace=False)] = True

    # Create the labels first
    labels = np.full(seq_len, tokenizer.pad_token_id)
    labels[subword_mask] = tokenized_ids[subword_mask]

    tokenized_ids[subword_mask] = tokenizer.mask_token_id

    # Now of the masked tokens, choose how many to replace with random and how many to unmask
    rand_or_unmask_prob = random_replace_prob + unmask_replace_prob
    if rand_or_unmask_prob > 0:
        rand_or_unmask = subword_mask & (np.random.rand(len(tokenized_ids)) <
                                         rand_or_unmask_prob)
        if random_replace_prob == 0:
            unmask = rand_or_unmask
            rand_mask = None
        elif unmask_replace_prob == 0:
            unmask = None
            rand_mask = rand_or_unmask
        else:
            unmask_prob = unmask_replace_prob / rand_or_unmask_prob
            decision = np.random.rand(len(tokenized_ids)) < unmask_prob
            unmask = rand_or_unmask & decision
            rand_mask = rand_or_unmask & (~decision)
        if unmask is not None:
            tokenized_ids[unmask] = labels[unmask]
        if rand_mask is not None:
            weights = np.ones(tokenizer.vocab_size)
            weights[tokenizer.all_special_ids] = 0
            probs = weights / weights.sum()
            num_rand = rand_mask.sum()
            tokenized_ids[rand_mask] = np.random.choice(tokenizer.vocab_size,
                                                        num_rand,
                                                        p=probs)
    return tokenized_ids.tolist(), labels.tolist()


class WikiTextMLMDataset(Dataset):
    """A [Map style dataset](https://pytorch.org/docs/stable/data.html)
    for iterating over the wikitext dataset. Note that this assumes
    the dataset can fit in memory. For larger datasets
    you'd want to shard them and use an iterable dataset (eg: see
    [Infinibatch](https://github.com/microsoft/infinibatch))

    Args:
        Dataset (datasets.arrow_dataset.Dataset):
            The wikitext dataset
        masking_function (Callable[[str], Tuple[List[int], List[int]]])
            The masking function. To generate one training instance,
            the masking function is applied to the `text` of a dataset
            record

    """
    def __init__(
        self,
        dataset: datasets.arrow_dataset.Dataset,
        masking_function: Callable[[str], Tuple[List[int], List[int]]],
    ) -> None:
        self.dataset = dataset
        self.masking_function = masking_function

    def __len__(self) -> int:
        return len(self.dataset)

    def __getitem__(self, idx: int) -> Tuple[List[int], List[int]]:
        tokens, labels = self.masking_function(self.dataset[idx]["text"])
        return (tokens, labels)


T = TypeVar("T")


class InfiniteIterator(object):
    def __init__(self, iterable: Iterable[T]) -> None:
        self._iterable = iterable
        self._iterator = iter(self._iterable)

    def __iter__(self):
        return self

    def __next__(self) -> T:
        next_item = None
        try:
            next_item = next(self._iterator)
        except StopIteration:
            self._iterator = iter(self._iterable)
            next_item = next(self._iterator)
        return next_item


def create_data_iterator(
        mask_prob: float,
        random_replace_prob: float,
        unmask_replace_prob: float,
        batch_size: int,
        max_seq_length: int = 512,
        tokenizer: str = "roberta-base",
) -> InfiniteIterator:
    """Create the dataloader.

    Args:
        mask_prob (float):
            Fraction of tokens to mask
        random_replace_prob (float):
            Fraction of masked tokens to replace with random token
        unmask_replace_prob (float):
            Fraction of masked tokens to replace with the actual token
        batch_size (int):
            The batch size of the generated tensors
        max_seq_length (int, optional):
            The maximum sequence length for the MLM task. Defaults to 512.
        tokenizer (str, optional):
            The tokenizer to use. Defaults to "roberta-base".

    Returns:
        InfiniteIterator:
            The torch DataLoader, wrapped in an InfiniteIterator class, to
            be able to continuously generate samples

    """
    #wikitext_dataset = datasets.load_dataset("wikitext",
    wikitext_dataset = datasets.load_dataset("/home/guodong.li/code/wikitext.py",
                                             "wikitext-2-v1",
                                             split="train")
    wikitext_dataset = wikitext_dataset.filter(
        lambda record: record["text"] != "").map(
            lambda record: {"text": record["text"].rstrip("\n")})
    #tokenizer = AutoTokenizer.from_pretrained(tokenizer)
    tokenizer = AutoTokenizer.from_pretrained("/home/guodong.li/model/roberta-base")

    masking_function_partial = partial(
        masking_function,
        tokenizer=tokenizer,
        mask_prob=mask_prob,
        random_replace_prob=random_replace_prob,
        unmask_replace_prob=unmask_replace_prob,
        max_length=max_seq_length,
    )
    dataset = WikiTextMLMDataset(wikitext_dataset, masking_function_partial)
    collate_fn_partial = partial(collate_function,
                                 pad_token_id=tokenizer.pad_token_id)
    dataloader = DataLoader(dataset,
                            batch_size=batch_size,
                            shuffle=True,
                            collate_fn=collate_fn_partial)

    return InfiniteIterator(dataloader)


######################################################################
############### Model Creation Related Functions #####################
######################################################################


class RobertaLMHeadWithMaskedPredict(RobertaLMHead):
    def __init__(self,
                 config: RobertaConfig,
                 embedding_weight: Optional[torch.Tensor] = None) -> None:
        super(RobertaLMHeadWithMaskedPredict, self).__init__(config)
        if embedding_weight is not None:
            self.decoder.weight = embedding_weight

    def forward(  # pylint: disable=arguments-differ
        self,
        features: torch.Tensor,
        masked_token_indices: Optional[torch.Tensor] = None,
        **kwargs,
    ) -> torch.Tensor:
        """The current `transformers` library does not provide support
        for masked_token_indices. This function provides the support, by
        running the final forward pass only for the masked indices. This saves
        memory

        Args:
            features (torch.Tensor):
                The features to select from. Shape (batch, seq_len, h_dim)
            masked_token_indices (torch.Tensor, optional):
                The indices of masked tokens for index select. Defaults to None.
                Shape: (num_masked_tokens,)

        Returns:
            torch.Tensor:
                The index selected features. Shape (num_masked_tokens, h_dim)

        """
        if masked_token_indices is not None:
            features = torch.index_select(
                features.view(-1, features.shape[-1]), 0, masked_token_indices)
        return super().forward(features)


class RobertaMLMModel(RobertaPreTrainedModel):
    def __init__(self, config: RobertaConfig, encoder: RobertaModel) -> None:
        super().__init__(config)
        self.encoder = encoder
        self.lm_head = RobertaLMHeadWithMaskedPredict(
            config, self.encoder.embeddings.word_embeddings.weight)
        self.lm_head.apply(self._init_weights)

    def forward(
            self,
            src_tokens: torch.Tensor,
            attention_mask: torch.Tensor,
            tgt_tokens: torch.Tensor,
    ) -> torch.Tensor:
        """The forward pass for the MLM task

        Args:
            src_tokens (torch.Tensor):
                The masked token indices. Shape: (batch, seq_len)
            attention_mask (torch.Tensor):
                The attention mask, since the batches are padded
                to the largest sequence. Shape: (batch, seq_len)
            tgt_tokens (torch.Tensor):
                The output tokens (padded with `config.pad_token_id`)

        Returns:
            torch.Tensor:
                The MLM loss
        """
        # shape: (batch, seq_len, h_dim)
        sequence_output, *_ = self.encoder(input_ids=src_tokens,
                                           attention_mask=attention_mask,
                                           return_dict=False)

        pad_token_id = self.config.pad_token_id
        # (labels have also been padded with pad_token_id)
        # filter out all masked labels
        # shape: (num_masked_tokens,)
        masked_token_indexes = torch.nonzero(
            (tgt_tokens != pad_token_id).view(-1)).view(-1)
        # shape: (num_masked_tokens, vocab_size)
        prediction_scores = self.lm_head(sequence_output, masked_token_indexes)
        # shape: (num_masked_tokens,)
        target = torch.index_select(tgt_tokens.view(-1), 0,
                                    masked_token_indexes)

        loss_fct = nn.CrossEntropyLoss(ignore_index=-1)

        masked_lm_loss = loss_fct(
            prediction_scores.view(-1, self.config.vocab_size), target)
        return masked_lm_loss


def create_model(num_layers: int, num_heads: int, ff_dim: int, h_dim: int,
                 dropout: float) -> RobertaMLMModel:
    """Create a Bert model with the specified `num_heads`, `ff_dim`,
    `h_dim` and `dropout`

    Args:
        num_layers (int):
            The number of layers
        num_heads (int):
            The number of attention heads
        ff_dim (int):
            The intermediate hidden size of
            the feed forward block of the
            transformer
        h_dim (int):
            The hidden dim of the intermediate
            representations of the transformer
        dropout (float):
            The value of dropout to be used.
            Note that we apply the same dropout
            to both the attention layers and the
            FF layers

    Returns:
        RobertaMLMModel:
            A Roberta model for MLM task

    """
    roberta_config_dict = {
        "attention_probs_dropout_prob": dropout,
        "bos_token_id": 0,
        "eos_token_id": 2,
        "hidden_act": "gelu",
        "hidden_dropout_prob": dropout,
        "hidden_size": h_dim,
        "initializer_range": 0.02,
        "intermediate_size": ff_dim,
        "layer_norm_eps": 1e-05,
        "max_position_embeddings": 514,
        "model_type": "roberta",
        "num_attention_heads": num_heads,
        "num_hidden_layers": num_layers,
        "pad_token_id": 1,
        "type_vocab_size": 1,
        "vocab_size": 50265,
    }
    roberta_config = RobertaConfig.from_dict(roberta_config_dict)
    roberta_encoder = RobertaModel(roberta_config)
    roberta_model = RobertaMLMModel(roberta_config, roberta_encoder)
    return roberta_model


######################################################################
########### Experiment Management Related Functions ##################
######################################################################


def get_unique_identifier(length: int = 8) -> str:
    """Create a unique identifier by choosing `length`
    random characters from list of ascii characters and numbers
    """
    alphabet = string.ascii_lowercase + string.digits
    uuid = "".join(alphabet[ix]
                   for ix in np.random.choice(len(alphabet), length))
    return uuid


def create_experiment_dir(checkpoint_dir: pathlib.Path,
                          all_arguments: Dict[str, Any]) -> pathlib.Path:
    """Create an experiment directory and save all arguments in it.
    Additionally, also store the githash and gitdiff. Finally create
    a directory for `Tensorboard` logs. The structure would look something
    like
        checkpoint_dir
            `-experiment-name
                |- hparams.json
                |- githash.log
                |- gitdiff.log
                `- tb_dir/

    Args:
        checkpoint_dir (pathlib.Path):
            The base checkpoint directory
        all_arguments (Dict[str, Any]):
            The arguments to save

    Returns:
        pathlib.Path: The experiment directory
    """
    # experiment name follows the following convention
    # {exp_type}.{YYYY}.{MM}.{DD}.{HH}.{MM}.{SS}.{uuid}
    current_time = datetime.datetime.now(pytz.timezone("US/Pacific"))
    expname = "bert_pretrain.{0}.{1}.{2}.{3}.{4}.{5}.{6}".format(
        current_time.year,
        current_time.month,
        current_time.day,
        current_time.hour,
        current_time.minute,
        current_time.second,
        get_unique_identifier(),
    )
    exp_dir = checkpoint_dir / expname
    exp_dir.mkdir(exist_ok=False)
    hparams_file = exp_dir / "hparams.json"
    with hparams_file.open("w") as handle:
        json.dump(obj=all_arguments, fp=handle, indent=2)
    # Save the git hash
    try:
        gitlog = sh.git.log("-1", format="%H", _tty_out=False, _fg=False)
        with (exp_dir / "githash.log").open("w") as handle:
            handle.write(gitlog.stdout.decode("utf-8"))
    except sh.ErrorReturnCode_128:
        logger.info("Seems like the code is not running from"
                    " within a git repo, so hash will"
                    " not be stored. However, it"
                    " is strongly advised to use"
                    " version control.")
    # And the git diff
    try:
        gitdiff = sh.git.diff(_fg=False, _tty_out=False)
        with (exp_dir / "gitdiff.log").open("w") as handle:
            handle.write(gitdiff.stdout.decode("utf-8"))
    except sh.ErrorReturnCode_129:
        logger.info("Seems like the code is not running from"
                    " within a git repo, so diff will"
                    " not be stored. However, it"
                    " is strongly advised to use"
                    " version control.")
    # Finally create the Tensorboard Dir
    tb_dir = exp_dir / "tb_dir"
    tb_dir.mkdir()
    return exp_dir


######################################################################
################ Checkpoint Related Functions ########################
######################################################################


def load_model_checkpoint(
    load_checkpoint_dir: pathlib.Path,
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
) -> Tuple[int, torch.nn.Module, torch.optim.Optimizer]:
    """Loads the optimizer state dict and model state dict from the load_checkpoint_dir
    into the passed model and optimizer. Searches for the most recent checkpoint to
    load from

    Args:
        load_checkpoint_dir (pathlib.Path):
            The base checkpoint directory to load from
        model (torch.nn.Module):
            The model to load the checkpoint weights into
        optimizer (torch.optim.Optimizer):
            The optimizer to load the checkpoint weigths into

    Returns:
        Tuple[int, torch.nn.Module, torch.optim.Optimizer]:
            The checkpoint step, model with state_dict loaded and
            optimizer with state_dict loaded

    """
    logger.info(
        f"Loading model and optimizer checkpoint from {load_checkpoint_dir}")
    checkpoint_files = list(
        filter(
            lambda path: re.search(r"iter_(?P<iter_no>\d+)\.pt", path.name) is
            not None,
            load_checkpoint_dir.glob("*.pt"),
        ))
    assert len(checkpoint_files) > 0, "No checkpoints found in directory"
    checkpoint_files = sorted(
        checkpoint_files,
        key=lambda path: int(
            re.search(r"iter_(?P<iter_no>\d+)\.pt", path.name).group("iter_no")
        ),
    )
    latest_checkpoint_path = checkpoint_files[-1]
    checkpoint_step = int(
        re.search(r"iter_(?P<iter_no>\d+)\.pt",
                  latest_checkpoint_path.name).group("iter_no"))

    state_dict = torch.load(latest_checkpoint_path)
    model.load_state_dict(state_dict["model"], strict=True)
    optimizer.load_state_dict(state_dict["optimizer"])
    logger.info(
        f"Loading model and optimizer checkpoints done. Loaded from {latest_checkpoint_path}"
    )
    return checkpoint_step, model, optimizer


######################################################################
######################## Driver Functions ############################
######################################################################


def train(
        checkpoint_dir: str = None,
        load_checkpoint_dir: str = None,
        # Dataset Parameters
        mask_prob: float = 0.15,
        random_replace_prob: float = 0.1,
        unmask_replace_prob: float = 0.1,
        max_seq_length: int = 512,
        tokenizer: str = "roberta-base",
        # Model Parameters
        num_layers: int = 6,
        num_heads: int = 8,
        ff_dim: int = 512,
        h_dim: int = 256,
        dropout: float = 0.1,
        # Training Parameters
        batch_size: int = 8,
        num_iterations: int = 10000,
        checkpoint_every: int = 1000,
        log_every: int = 10,
        local_rank: int = -1,
) -> pathlib.Path:
    """Trains a [Bert style](https://arxiv.org/pdf/1810.04805.pdf)
    (transformer encoder only) model for MLM Task

    Args:
        checkpoint_dir (str):
            The base experiment directory to save experiments to
        mask_prob (float, optional):
            The fraction of tokens to mask. Defaults to 0.15.
        random_replace_prob (float, optional):
            The fraction of masked tokens to replace with random token.
            Defaults to 0.1.
        unmask_replace_prob (float, optional):
            The fraction of masked tokens to leave unchanged.
            Defaults to 0.1.
        max_seq_length (int, optional):
            The maximum sequence length of the examples. Defaults to 512.
        tokenizer (str, optional):
            The tokenizer to use. Defaults to "roberta-base".
        num_layers (int, optional):
            The number of layers in the Bert model. Defaults to 6.
        num_heads (int, optional):
            Number of attention heads to use. Defaults to 8.
        ff_dim (int, optional):
            Size of the intermediate dimension in the FF layer.
            Defaults to 512.
        h_dim (int, optional):
            Size of intermediate representations.
            Defaults to 256.
        dropout (float, optional):
            Amout of Dropout to use. Defaults to 0.1.
        batch_size (int, optional):
            The minibatch size. Defaults to 8.
        num_iterations (int, optional):
            Total number of iterations to run the model for.
            Defaults to 10000.
        checkpoint_every (int, optional):
            Save checkpoint after these many steps.

            ..note ::

                You want this to be frequent enough that you can
                resume training in case it crashes, but not so much
                that you fill up your entire storage !

            Defaults to 1000.
        log_every (int, optional):
            Print logs after these many steps. Defaults to 10.
        local_rank (int, optional):
            Which GPU to run on (-1 for CPU). Defaults to -1.

    Returns:
        pathlib.Path: The final experiment directory

    """
    device = (torch.device("cuda", local_rank) if (local_rank > -1)
              and torch.cuda.is_available() else torch.device("cpu"))
    ################################
    ###### Create Exp. Dir #########
    ################################
    if checkpoint_dir is None and load_checkpoint_dir is None:
        logger.error("Need to specify one of checkpoint_dir"
                     " or load_checkpoint_dir")
        return
    if checkpoint_dir is not None and load_checkpoint_dir is not None:
        logger.error("Cannot specify both checkpoint_dir"
                     " and load_checkpoint_dir")
        return
    if checkpoint_dir:
        logger.info("Creating Experiment Directory")
        checkpoint_dir = pathlib.Path(checkpoint_dir)
        checkpoint_dir.mkdir(exist_ok=True)
        all_arguments = {
            # Dataset Params
            "mask_prob": mask_prob,
            "random_replace_prob": random_replace_prob,
            "unmask_replace_prob": unmask_replace_prob,
            "max_seq_length": max_seq_length,
            "tokenizer": tokenizer,
            # Model Params
            "num_layers": num_layers,
            "num_heads": num_heads,
            "ff_dim": ff_dim,
            "h_dim": h_dim,
            "dropout": dropout,
            # Training Params
            "batch_size": batch_size,
            "num_iterations": num_iterations,
            "checkpoint_every": checkpoint_every,
        }
        exp_dir = create_experiment_dir(checkpoint_dir, all_arguments)
        logger.info(f"Experiment Directory created at {exp_dir}")
    else:
        logger.info("Loading from Experiment Directory")
        load_checkpoint_dir = pathlib.Path(load_checkpoint_dir)
        assert load_checkpoint_dir.exists()
        with (load_checkpoint_dir / "hparams.json").open("r") as handle:
            hparams = json.load(handle)
        # Set the hparams
        # Dataset Params
        mask_prob = hparams.get("mask_prob", mask_prob)
        tokenizer = hparams.get("tokenizer", tokenizer)
        random_replace_prob = hparams.get("random_replace_prob",
                                          random_replace_prob)
        unmask_replace_prob = hparams.get("unmask_replace_prob",
                                          unmask_replace_prob)
        max_seq_length = hparams.get("max_seq_length", max_seq_length)
        # Model Params
        ff_dim = hparams.get("ff_dim", ff_dim)
        h_dim = hparams.get("h_dim", h_dim)
        dropout = hparams.get("dropout", dropout)
        num_layers = hparams.get("num_layers", num_layers)
        num_heads = hparams.get("num_heads", num_heads)
        # Training Params
        batch_size = hparams.get("batch_size", batch_size)
        _num_iterations = hparams.get("num_iterations", num_iterations)
        num_iterations = max(num_iterations, _num_iterations)
        checkpoint_every = hparams.get("checkpoint_every", checkpoint_every)
        exp_dir = load_checkpoint_dir
    # Tensorboard writer
    tb_dir = exp_dir / "tb_dir"
    assert tb_dir.exists()
    summary_writer = SummaryWriter(log_dir=tb_dir)
    ################################
    ###### Create Datasets #########
    ################################
    logger.info("Creating Datasets")
    data_iterator = create_data_iterator(
        mask_prob=mask_prob,
        random_replace_prob=random_replace_prob,
        unmask_replace_prob=unmask_replace_prob,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        batch_size=batch_size,
    )
    logger.info("Dataset Creation Done")
    ################################
    ###### Create Model ############
    ################################
    logger.info("Creating Model")
    model = create_model(
        num_layers=num_layers,
        num_heads=num_heads,
        ff_dim=ff_dim,
        h_dim=h_dim,
        dropout=dropout,
    )
    model = model.to(device)
    logger.info("Model Creation Done")
    ################################
    ###### Create Optimizer #######
    ################################
    logger.info("Creating Optimizer")
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    logger.info("Optimizer Creation Done")
    ################################
    #### Load Model checkpoint #####
    ################################
    start_step = 1
    if load_checkpoint_dir is not None:
        checkpoint_step, model, optimizer = load_model_checkpoint(
            load_checkpoint_dir, model, optimizer)
        start_step = checkpoint_step + 1

    ################################
    ####### The Training Loop ######
    ################################
    logger.info(
        f"Total number of model parameters: {sum([p.numel() for p in model.parameters()]):,d}"
    )
    model.train()
    losses = []
    for step, batch in enumerate(data_iterator, start=start_step):
        if step >= num_iterations:
            break
        optimizer.zero_grad()
        # Move the tensors to device
        for key, value in batch.items():
            batch[key] = value.to(device)
        # Forward pass
        loss = model(**batch)
        # Backward pass
        loss.backward()
        # Optimizer Step
        optimizer.step()
        losses.append(loss.item())
        if step % log_every == 0:
            logger.info("Loss: {0:.4f}".format(np.mean(losses)))
            summary_writer.add_scalar(f"Train/loss", np.mean(losses), step)
        if step % checkpoint_every == 0:
            state_dict = {
                "model": model.state_dict(),
                "optimizer": optimizer.state_dict(),
            }
            torch.save(obj=state_dict,
                       f=str(exp_dir / f"checkpoint.iter_{step}.pt"))
            logger.info("Saved model to {0}".format(
                (exp_dir / f"checkpoint.iter_{step}.pt")))
    # Save the last checkpoint if not saved yet
    if step % checkpoint_every != 0:
        state_dict = {
            "model": model.state_dict(),
            "optimizer": optimizer.state_dict(),
        }
        torch.save(obj=state_dict,
                   f=str(exp_dir / f"checkpoint.iter_{step}.pt"))
        logger.info("Saved model to {0}".format(
            (exp_dir / f"checkpoint.iter_{step}.pt")))

    return exp_dir


if __name__ == "__main__":
    torch.manual_seed(42)
    np.random.seed(0)
    random.seed(0)
    fire.Fire(train)
(llama-venv-py310-cu117) [guodong.li@ai-app-2-46 HelloDeepSpeed]$ cat train_bert_ds.py
"""
Modified version of train_bert.py that adds DeepSpeed
"""

import os
import datetime
import json
import pathlib
import re
import string
from functools import partial
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, TypeVar, Union

import random
import datasets
import fire
import logging
import loguru
import numpy as np
import pytz
import sh
import torch
import torch.nn as nn
import deepspeed
from torch.utils.data import DataLoader, Dataset
from torch.utils.tensorboard import SummaryWriter
from transformers import AutoTokenizer, PreTrainedTokenizer, PreTrainedTokenizerFast
from transformers.models.roberta import RobertaConfig, RobertaModel
from transformers.models.roberta.modeling_roberta import (
    RobertaLMHead,
    RobertaPreTrainedModel,
)


def is_rank_0() -> bool:
    return int(os.environ.get("RANK", "0")) == 0


######################################################################
####################### Logging Functions ############################
######################################################################

logger = loguru.logger


def log_dist(message: str,
             ranks: List[int] = [],
             level: int = logging.INFO) -> None:
    """Log messages for specified ranks only"""
    my_rank = int(os.environ.get("RANK", "0"))
    if my_rank in ranks:
        if level == logging.INFO:
            logger.info(f'[Rank {my_rank}] {message}')
        if level == logging.ERROR:
            logger.error(f'[Rank {my_rank}] {message}')
        if level == logging.DEBUG:
            logger.debug(f'[Rank {my_rank}] {message}')


######################################################################
############### Dataset Creation Related Functions ###################
######################################################################

TokenizerType = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]


def collate_function(batch: List[Tuple[List[int], List[int]]],
                     pad_token_id: int) -> Dict[str, torch.Tensor]:
    """Collect a list of masked token indices, and labels, and
    batch them, padding to max length in the batch.
    """
    max_length = max(len(token_ids) for token_ids, _ in batch)
    padded_token_ids = [
        token_ids +
        [pad_token_id for _ in range(0, max_length - len(token_ids))]
        for token_ids, _ in batch
    ]
    padded_labels = [
        labels + [pad_token_id for _ in range(0, max_length - len(labels))]
        for _, labels in batch
    ]
    src_tokens = torch.LongTensor(padded_token_ids)
    tgt_tokens = torch.LongTensor(padded_labels)
    attention_mask = src_tokens.ne(pad_token_id).type_as(src_tokens)
    return {
        "src_tokens": src_tokens,
        "tgt_tokens": tgt_tokens,
        "attention_mask": attention_mask,
    }


def masking_function(
        text: str,
        tokenizer: TokenizerType,
        mask_prob: float,
        random_replace_prob: float,
        unmask_replace_prob: float,
        max_length: int,
) -> Tuple[List[int], List[int]]:
    """Given a text string, randomly mask wordpieces for Bert MLM
    training.

    Args:
        text (str):
            The input text
        tokenizer (TokenizerType):
            The tokenizer for tokenization
        mask_prob (float):
            What fraction of tokens to mask
        random_replace_prob (float):
            Of the masked tokens, how many should be replaced with
            random tokens (improves performance)
        unmask_replace_prob (float):
            Of the masked tokens, how many should be replaced with
            the original token (improves performance)
        max_length (int):
            The maximum sequence length to consider. Note that for
            Bert style models, this is a function of the number of
            positional embeddings you learn

    Returns:
        Tuple[List[int], List[int]]:
            The masked token ids (based on the tokenizer passed),
            and the output labels (padded with `tokenizer.pad_token_id`)
    """
    # Note: By default, encode does add the BOS and EOS token
    # Disabling that behaviour to make this more clear
    tokenized_ids = ([tokenizer.bos_token_id] +
                     tokenizer.encode(text,
                                      add_special_tokens=False,
                                      truncation=True,
                                      max_length=max_length - 2) +
                     [tokenizer.eos_token_id])
    seq_len = len(tokenized_ids)
    tokenized_ids = np.array(tokenized_ids)
    subword_mask = np.full(len(tokenized_ids), False)

    # Masking the BOS and EOS token leads to slightly worse performance
    low = 1
    high = len(subword_mask) - 1
    mask_choices = np.arange(low, high)
    num_subwords_to_mask = max(
        int((mask_prob * (high - low)) + np.random.rand()), 1)
    subword_mask[np.random.choice(mask_choices,
                                  num_subwords_to_mask,
                                  replace=False)] = True

    # Create the labels first
    labels = np.full(seq_len, tokenizer.pad_token_id)
    labels[subword_mask] = tokenized_ids[subword_mask]

    tokenized_ids[subword_mask] = tokenizer.mask_token_id

    # Now of the masked tokens, choose how many to replace with random and how many to unmask
    rand_or_unmask_prob = random_replace_prob + unmask_replace_prob
    if rand_or_unmask_prob > 0:
        rand_or_unmask = subword_mask & (np.random.rand(len(tokenized_ids)) <
                                         rand_or_unmask_prob)
        if random_replace_prob == 0:
            unmask = rand_or_unmask
            rand_mask = None
        elif unmask_replace_prob == 0:
            unmask = None
            rand_mask = rand_or_unmask
        else:
            unmask_prob = unmask_replace_prob / rand_or_unmask_prob
            decision = np.random.rand(len(tokenized_ids)) < unmask_prob
            unmask = rand_or_unmask & decision
            rand_mask = rand_or_unmask & (~decision)
        if unmask is not None:
            tokenized_ids[unmask] = labels[unmask]
        if rand_mask is not None:
            weights = np.ones(tokenizer.vocab_size)
            weights[tokenizer.all_special_ids] = 0
            probs = weights / weights.sum()
            num_rand = rand_mask.sum()
            tokenized_ids[rand_mask] = np.random.choice(tokenizer.vocab_size,
                                                        num_rand,
                                                        p=probs)
    return tokenized_ids.tolist(), labels.tolist()


class WikiTextMLMDataset(Dataset):
    """A [Map style dataset](https://pytorch.org/docs/stable/data.html)
    for iterating over the wikitext dataset. Note that this assumes
    the dataset can fit in memory. For larger datasets
    you'd want to shard them and use an iterable dataset (eg: see
    [Infinibatch](https://github.com/microsoft/infinibatch))

    Args:
        Dataset (datasets.arrow_dataset.Dataset):
            The wikitext dataset
        masking_function (Callable[[str], Tuple[List[int], List[int]]])
            The masking function. To generate one training instance,
            the masking function is applied to the `text` of a dataset
            record

    """
    def __init__(
        self,
        dataset: datasets.arrow_dataset.Dataset,
        masking_function: Callable[[str], Tuple[List[int], List[int]]],
    ) -> None:
        self.dataset = dataset
        self.masking_function = masking_function

    def __len__(self) -> int:
        return len(self.dataset)

    def __getitem__(self, idx: int) -> Tuple[List[int], List[int]]:
        tokens, labels = self.masking_function(self.dataset[idx]["text"])
        return (tokens, labels)


T = TypeVar("T")


class InfiniteIterator(object):
    def __init__(self, iterable: Iterable[T]) -> None:
        self._iterable = iterable
        self._iterator = iter(self._iterable)

    def __iter__(self):
        return self

    def __next__(self) -> T:
        next_item = None
        try:
            next_item = next(self._iterator)
        except StopIteration:
            self._iterator = iter(self._iterable)
            next_item = next(self._iterator)
        return next_item


def create_data_iterator(
        mask_prob: float,
        random_replace_prob: float,
        unmask_replace_prob: float,
        batch_size: int,
        max_seq_length: int = 512,
        tokenizer: str = "roberta-base",
) -> InfiniteIterator:
    """Create the dataloader.

    Args:
        mask_prob (float):
            Fraction of tokens to mask
        random_replace_prob (float):
            Fraction of masked tokens to replace with random token
        unmask_replace_prob (float):
            Fraction of masked tokens to replace with the actual token
        batch_size (int):
            The batch size of the generated tensors
        max_seq_length (int, optional):
            The maximum sequence length for the MLM task. Defaults to 512.
        tokenizer (str, optional):
            The tokenizer to use. Defaults to "roberta-base".

    Returns:
        InfiniteIterator:
            The torch DataLoader, wrapped in an InfiniteIterator class, to
            be able to continuously generate samples

    """
    #wikitext_dataset = datasets.load_dataset("wikitext",
    wikitext_dataset = datasets.load_dataset("/home/guodong.li/code/wikitext.py",
                                             "wikitext-2-v1",
                                             split="train")
    wikitext_dataset = wikitext_dataset.filter(
        lambda record: record["text"] != "").map(
            lambda record: {"text": record["text"].rstrip("\n")})
    #tokenizer = AutoTokenizer.from_pretrained(tokenizer)
    tokenizer = AutoTokenizer.from_pretrained("/home/guodong.li/model/roberta-base")

    masking_function_partial = partial(
        masking_function,
        tokenizer=tokenizer,
        mask_prob=mask_prob,
        random_replace_prob=random_replace_prob,
        unmask_replace_prob=unmask_replace_prob,
        max_length=max_seq_length,
    )
    dataset = WikiTextMLMDataset(wikitext_dataset, masking_function_partial)
    collate_fn_partial = partial(collate_function,
                                 pad_token_id=tokenizer.pad_token_id)
    dataloader = DataLoader(dataset,
                            batch_size=batch_size,
                            shuffle=True,
                            collate_fn=collate_fn_partial)

    return InfiniteIterator(dataloader)


######################################################################
############### Model Creation Related Functions #####################
######################################################################


class RobertaLMHeadWithMaskedPredict(RobertaLMHead):
    def __init__(self,
                 config: RobertaConfig,
                 embedding_weight: Optional[torch.Tensor] = None) -> None:
        super(RobertaLMHeadWithMaskedPredict, self).__init__(config)
        if embedding_weight is not None:
            self.decoder.weight = embedding_weight

    def forward(  # pylint: disable=arguments-differ
        self,
        features: torch.Tensor,
        masked_token_indices: Optional[torch.Tensor] = None,
        **kwargs,
    ) -> torch.Tensor:
        """The current `transformers` library does not provide support
        for masked_token_indices. This function provides the support, by
        running the final forward pass only for the masked indices. This saves
        memory

        Args:
            features (torch.Tensor):
                The features to select from. Shape (batch, seq_len, h_dim)
            masked_token_indices (torch.Tensor, optional):
                The indices of masked tokens for index select. Defaults to None.
                Shape: (num_masked_tokens,)

        Returns:
            torch.Tensor:
                The index selected features. Shape (num_masked_tokens, h_dim)

        """
        if masked_token_indices is not None:
            features = torch.index_select(
                features.view(-1, features.shape[-1]), 0, masked_token_indices)
        return super().forward(features)


class RobertaMLMModel(RobertaPreTrainedModel):
    def __init__(self, config: RobertaConfig, encoder: RobertaModel) -> None:
        super().__init__(config)
        self.encoder = encoder
        self.lm_head = RobertaLMHeadWithMaskedPredict(
            config, self.encoder.embeddings.word_embeddings.weight)
        self.lm_head.apply(self._init_weights)

    def forward(
            self,
            src_tokens: torch.Tensor,
            attention_mask: torch.Tensor,
            tgt_tokens: torch.Tensor,
    ) -> torch.Tensor:
        """The forward pass for the MLM task

        Args:
            src_tokens (torch.Tensor):
                The masked token indices. Shape: (batch, seq_len)
            attention_mask (torch.Tensor):
                The attention mask, since the batches are padded
                to the largest sequence. Shape: (batch, seq_len)
            tgt_tokens (torch.Tensor):
                The output tokens (padded with `config.pad_token_id`)

        Returns:
            torch.Tensor:
                The MLM loss
        """
        # shape: (batch, seq_len, h_dim)
        sequence_output, *_ = self.encoder(input_ids=src_tokens,
                                           attention_mask=attention_mask,
                                           return_dict=False)

        pad_token_id = self.config.pad_token_id
        # (labels have also been padded with pad_token_id)
        # filter out all masked labels
        # shape: (num_masked_tokens,)
        masked_token_indexes = torch.nonzero(
            (tgt_tokens != pad_token_id).view(-1)).view(-1)
        # shape: (num_masked_tokens, vocab_size)
        prediction_scores = self.lm_head(sequence_output, masked_token_indexes)
        # shape: (num_masked_tokens,)
        target = torch.index_select(tgt_tokens.view(-1), 0,
                                    masked_token_indexes)

        loss_fct = nn.CrossEntropyLoss(ignore_index=-1)

        masked_lm_loss = loss_fct(
            prediction_scores.view(-1, self.config.vocab_size), target)
        return masked_lm_loss


def create_model(num_layers: int, num_heads: int, ff_dim: int, h_dim: int,
                 dropout: float) -> RobertaMLMModel:
    """Create a Bert model with the specified `num_heads`, `ff_dim`,
    `h_dim` and `dropout`

    Args:
        num_layers (int):
            The number of layers
        num_heads (int):
            The number of attention heads
        ff_dim (int):
            The intermediate hidden size of
            the feed forward block of the
            transformer
        h_dim (int):
            The hidden dim of the intermediate
            representations of the transformer
        dropout (float):
            The value of dropout to be used.
            Note that we apply the same dropout
            to both the attention layers and the
            FF layers

    Returns:
        RobertaMLMModel:
            A Roberta model for MLM task

    """
    roberta_config_dict = {
        "attention_probs_dropout_prob": dropout,
        "bos_token_id": 0,
        "eos_token_id": 2,
        "hidden_act": "gelu",
        "hidden_dropout_prob": dropout,
        "hidden_size": h_dim,
        "initializer_range": 0.02,
        "intermediate_size": ff_dim,
        "layer_norm_eps": 1e-05,
        "max_position_embeddings": 514,
        "model_type": "roberta",
        "num_attention_heads": num_heads,
        "num_hidden_layers": num_layers,
        "pad_token_id": 1,
        "type_vocab_size": 1,
        "vocab_size": 50265,
    }
    roberta_config = RobertaConfig.from_dict(roberta_config_dict)
    roberta_encoder = RobertaModel(roberta_config)
    roberta_model = RobertaMLMModel(roberta_config, roberta_encoder)
    return roberta_model


######################################################################
########### Experiment Management Related Functions ##################
######################################################################


def get_unique_identifier(length: int = 8) -> str:
    """Create a unique identifier by choosing `length`
    random characters from list of ascii characters and numbers
    """
    alphabet = string.ascii_lowercase + string.digits
    uuid = "".join(alphabet[ix]
                   for ix in np.random.choice(len(alphabet), length))
    return uuid


def create_experiment_dir(checkpoint_dir: pathlib.Path,
                          all_arguments: Dict[str, Any]) -> pathlib.Path:
    """Create an experiment directory and save all arguments in it.
    Additionally, also store the githash and gitdiff. Finally create
    a directory for `Tensorboard` logs. The structure would look something
    like
        checkpoint_dir
            `-experiment-name
                |- hparams.json
                |- githash.log
                |- gitdiff.log
                `- tb_dir/

    Args:
        checkpoint_dir (pathlib.Path):
            The base checkpoint directory
        all_arguments (Dict[str, Any]):
            The arguments to save

    Returns:
        pathlib.Path: The experiment directory
    """
    # experiment name follows the following convention
    # {exp_type}.{YYYY}.{MM}.{DD}.{HH}.{MM}.{SS}.{uuid}
    current_time = datetime.datetime.now(pytz.timezone("US/Pacific"))
    expname = "bert_pretrain.{0}.{1}.{2}.{3}.{4}.{5}.{6}".format(
        current_time.year,
        current_time.month,
        current_time.day,
        current_time.hour,
        current_time.minute,
        current_time.second,
        get_unique_identifier(),
    )
    exp_dir = checkpoint_dir / expname
    if not is_rank_0():
        return exp_dir
    exp_dir.mkdir(exist_ok=False)
    hparams_file = exp_dir / "hparams.json"
    with hparams_file.open("w") as handle:
        json.dump(obj=all_arguments, fp=handle, indent=2)
    # Save the git hash
    try:
        gitlog = sh.git.log("-1", format="%H", _tty_out=False, _fg=False)
        with (exp_dir / "githash.log").open("w") as handle:
            handle.write(gitlog.stdout.decode("utf-8"))
    except sh.ErrorReturnCode_128:
        log_dist(
            "Seems like the code is not running from"
            " within a git repo, so hash will"
            " not be stored. However, it"
            " is strongly advised to use"
            " version control.",
            ranks=[0],
            level=logging.INFO)
    # And the git diff
    try:
        gitdiff = sh.git.diff(_fg=False, _tty_out=False)
        with (exp_dir / "gitdiff.log").open("w") as handle:
            handle.write(gitdiff.stdout.decode("utf-8"))
    except sh.ErrorReturnCode_129:
        log_dist(
            "Seems like the code is not running from"
            " within a git repo, so diff will"
            " not be stored. However, it"
            " is strongly advised to use"
            " version control.",
            ranks=[0],
            level=logging.INFO)
    # Finally create the Tensorboard Dir
    tb_dir = exp_dir / "tb_dir"
    tb_dir.mkdir(exist_ok=False)
    return exp_dir


######################################################################
################ Checkpoint Related Functions ########################
######################################################################


def load_model_checkpoint(
    load_checkpoint_dir: pathlib.Path,
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
) -> Tuple[int, torch.nn.Module, torch.optim.Optimizer]:
    """Loads the optimizer state dict and model state dict from the load_checkpoint_dir
    into the passed model and optimizer. Searches for the most recent checkpoint to
    load from

    Args:
        load_checkpoint_dir (pathlib.Path):
            The base checkpoint directory to load from
        model (torch.nn.Module):
            The model to load the checkpoint weights into
        optimizer (torch.optim.Optimizer):
            The optimizer to load the checkpoint weigths into

    Returns:
        Tuple[int, torch.nn.Module, torch.optim.Optimizer]:
            The checkpoint step, model with state_dict loaded and
            optimizer with state_dict loaded

    """
    log_dist(
        f"Loading model and optimizer checkpoint from {load_checkpoint_dir}",
        ranks=[0],
        level=logging.INFO)
    checkpoint_files = list(
        filter(
            lambda path: re.search(r"iter_(?P<iter_no>\d+)\.pt", path.name) is
            not None,
            load_checkpoint_dir.glob("*.pt"),
        ))
    assert len(checkpoint_files) > 0, "No checkpoints found in directory"
    checkpoint_files = sorted(
        checkpoint_files,
        key=lambda path: int(
            re.search(r"iter_(?P<iter_no>\d+)\.pt", path.name).group("iter_no")
        ),
    )
    latest_checkpoint_path = checkpoint_files[-1]
    checkpoint_step = int(
        re.search(r"iter_(?P<iter_no>\d+)\.pt",
                  latest_checkpoint_path.name).group("iter_no"))

    state_dict = torch.load(latest_checkpoint_path)
    model.load_state_dict(state_dict["model"], strict=True)
    optimizer.load_state_dict(state_dict["optimizer"])
    log_dist(
        f"Loading model and optimizer checkpoints done. Loaded from {latest_checkpoint_path}",
        ranks=[0],
        level=logging.INFO)
    return checkpoint_step, model, optimizer


######################################################################
######################## Driver Functions ############################
######################################################################


def train(
        checkpoint_dir: str = None,
        load_checkpoint_dir: str = None,
        # Dataset Parameters
        mask_prob: float = 0.15,
        random_replace_prob: float = 0.1,
        unmask_replace_prob: float = 0.1,
        max_seq_length: int = 512,
        tokenizer: str = "roberta-base",
        # Model Parameters
        num_layers: int = 6,
        num_heads: int = 8,
        ff_dim: int = 512,
        h_dim: int = 256,
        dropout: float = 0.1,
        # Training Parameters
        batch_size: int = 8,
        num_iterations: int = 10000,
        checkpoint_every: int = 1000,
        log_every: int = 10,
        local_rank: int = -1,
) -> pathlib.Path:
    """Trains a [Bert style](https://arxiv.org/pdf/1810.04805.pdf)
    (transformer encoder only) model for MLM Task

    Args:
        checkpoint_dir (str):
            The base experiment directory to save experiments to
        mask_prob (float, optional):
            The fraction of tokens to mask. Defaults to 0.15.
        random_replace_prob (float, optional):
            The fraction of masked tokens to replace with random token.
            Defaults to 0.1.
        unmask_replace_prob (float, optional):
            The fraction of masked tokens to leave unchanged.
            Defaults to 0.1.
        max_seq_length (int, optional):
            The maximum sequence length of the examples. Defaults to 512.
        tokenizer (str, optional):
            The tokenizer to use. Defaults to "roberta-base".
        num_layers (int, optional):
            The number of layers in the Bert model. Defaults to 6.
        num_heads (int, optional):
            Number of attention heads to use. Defaults to 8.
        ff_dim (int, optional):
            Size of the intermediate dimension in the FF layer.
            Defaults to 512.
        h_dim (int, optional):
            Size of intermediate representations.
            Defaults to 256.
        dropout (float, optional):
            Amout of Dropout to use. Defaults to 0.1.
        batch_size (int, optional):
            The minibatch size. Defaults to 8.
        num_iterations (int, optional):
            Total number of iterations to run the model for.
            Defaults to 10000.
        checkpoint_every (int, optional):
            Save checkpoint after these many steps.

            ..note ::

                You want this to be frequent enough that you can
                resume training in case it crashes, but not so much
                that you fill up your entire storage !

            Defaults to 1000.
        log_every (int, optional):
            Print logs after these many steps. Defaults to 10.
        local_rank (int, optional):
            Which GPU to run on (-1 for CPU). Defaults to -1.

    Returns:
        pathlib.Path: The final experiment directory

    """
    device = (torch.device("cuda", local_rank) if (local_rank > -1)
              and torch.cuda.is_available() else torch.device("cpu"))
    ################################
    ###### Create Exp. Dir #########
    ################################
    if checkpoint_dir is None and load_checkpoint_dir is None:
        log_dist(
            "Need to specify one of checkpoint_dir"
            " or load_checkpoint_dir",
            ranks=[0],
            level=logging.ERROR)
        return
    if checkpoint_dir is not None and load_checkpoint_dir is not None:
        log_dist(
            "Cannot specify both checkpoint_dir"
            " and load_checkpoint_dir",
            ranks=[0],
            level=logging.ERROR)
        return
    if checkpoint_dir:
        log_dist("Creating Experiment Directory",
                 ranks=[0],
                 level=logging.INFO)
        checkpoint_dir = pathlib.Path(checkpoint_dir)
        checkpoint_dir.mkdir(exist_ok=True)
        all_arguments = {
            # Dataset Params
            "mask_prob": mask_prob,
            "random_replace_prob": random_replace_prob,
            "unmask_replace_prob": unmask_replace_prob,
            "max_seq_length": max_seq_length,
            "tokenizer": tokenizer,
            # Model Params
            "num_layers": num_layers,
            "num_heads": num_heads,
            "ff_dim": ff_dim,
            "h_dim": h_dim,
            "dropout": dropout,
            # Training Params
            "batch_size": batch_size,
            "num_iterations": num_iterations,
            "checkpoint_every": checkpoint_every,
        }
        exp_dir = create_experiment_dir(checkpoint_dir, all_arguments)
        log_dist(f"Experiment Directory created at {exp_dir}",
                 ranks=[0],
                 level=logging.INFO)
    else:
        log_dist("Loading from Experiment Directory",
                 ranks=[0],
                 level=logging.INFO)
        load_checkpoint_dir = pathlib.Path(load_checkpoint_dir)
        assert load_checkpoint_dir.exists()
        with (load_checkpoint_dir / "hparams.json").open("r") as handle:
            hparams = json.load(handle)
        # Set the hparams
        # Dataset Params
        mask_prob = hparams.get("mask_prob", mask_prob)
        tokenizer = hparams.get("tokenizer", tokenizer)
        random_replace_prob = hparams.get("random_replace_prob",
                                          random_replace_prob)
        unmask_replace_prob = hparams.get("unmask_replace_prob",
                                          unmask_replace_prob)
        max_seq_length = hparams.get("max_seq_length", max_seq_length)
        # Model Params
        ff_dim = hparams.get("ff_dim", ff_dim)
        h_dim = hparams.get("h_dim", h_dim)
        dropout = hparams.get("dropout", dropout)
        num_layers = hparams.get("num_layers", num_layers)
        num_heads = hparams.get("num_heads", num_heads)
        # Training Params
        batch_size = hparams.get("batch_size", batch_size)
        _num_iterations = hparams.get("num_iterations", num_iterations)
        num_iterations = max(num_iterations, _num_iterations)
        checkpoint_every = hparams.get("checkpoint_every", checkpoint_every)
        exp_dir = load_checkpoint_dir
    # Tensorboard writer
    if is_rank_0():
        tb_dir = exp_dir / "tb_dir"
        assert tb_dir.exists()
        summary_writer = SummaryWriter(log_dir=tb_dir)
    ################################
    ###### Create Datasets #########
    ################################
    log_dist("Creating Datasets", ranks=[0], level=logging.INFO)
    data_iterator = create_data_iterator(
        mask_prob=mask_prob,
        random_replace_prob=random_replace_prob,
        unmask_replace_prob=unmask_replace_prob,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        batch_size=batch_size,
    )
    log_dist("Dataset Creation Done", ranks=[0], level=logging.INFO)
    ################################
    ###### Create Model ############
    ################################
    log_dist("Creating Model", ranks=[0], level=logging.INFO)
    model = create_model(
        num_layers=num_layers,
        num_heads=num_heads,
        ff_dim=ff_dim,
        h_dim=h_dim,
        dropout=dropout,
    )
    log_dist("Model Creation Done", ranks=[0], level=logging.INFO)
    ################################
    ###### DeepSpeed engine ########
    ################################
    log_dist("Creating DeepSpeed engine", ranks=[0], level=logging.INFO)
    ds_config = {
        "train_micro_batch_size_per_gpu": batch_size,
        "optimizer": {
            "type": "Adam",
            "params": {
                "lr": 1e-4
            }
        },
        "fp16": {
            "enabled": True
        },
        "zero_optimization": {
            "stage": 1,
            "offload_optimizer": {
                "device": "cpu"
            }
        }
    }
    log_dist("-----------------------", ranks=[0], level=logging.INFO)
    log_dist(str(ds_config), ranks=[0], level=logging.INFO)

    model, _, _, _ = deepspeed.initialize(model=model,
                                          model_parameters=model.parameters(),
                                          config=ds_config)
    log_dist("DeepSpeed engine created", ranks=[0], level=logging.INFO)
    ################################
    #### Load Model checkpoint #####
    ################################
    start_step = 1
    if load_checkpoint_dir is not None:
        _, client_state = model.load_checkpoint(load_dir=load_checkpoint_dir)
        checkpoint_step = client_state['checkpoint_step']
        start_step = checkpoint_step + 1

    ################################
    ####### The Training Loop ######
    ################################
    log_dist(
        f"Total number of model parameters: {sum([p.numel() for p in model.parameters()]):,d}",
        ranks=[0],
        level=logging.INFO)
    model.train()
    losses = []
    for step, batch in enumerate(data_iterator, start=start_step):
        if step >= num_iterations:
            break
        # Move the tensors to device
        for key, value in batch.items():
            batch[key] = value.to(device)
        # Forward pass
        loss = model(**batch)
        # Backward pass
        model.backward(loss)
        # Optimizer Step
        model.step()
        losses.append(loss.item())
        if step % log_every == 0:
            log_dist("Loss: {0:.4f}".format(np.mean(losses)),
                     ranks=[0],
                     level=logging.INFO)
            if is_rank_0():
                summary_writer.add_scalar(f"Train/loss", np.mean(losses), step)
        if step % checkpoint_every == 0:
            model.save_checkpoint(save_dir=exp_dir,
                                  client_state={'checkpoint_step': step})
            log_dist("Saved model to {0}".format(exp_dir),
                     ranks=[0],
                     level=logging.INFO)
    # Save the last checkpoint if not saved yet
    if step % checkpoint_every != 0:
        model.save_checkpoint(save_dir=exp_dir,
                              client_state={'checkpoint_step': step})
        log_dist("Saved model to {0}".format(exp_dir),
                 ranks=[0],
                 level=logging.INFO)

    return exp_dir


if __name__ == "__main__":
    torch.manual_seed(42)
    np.random.seed(0)
    random.seed(0)
    fire.Fire(train)


================================================
FILE: ai-framework/deepspeed/training/pipeline_parallelism/README.md
================================================







```
deepspeed --include localhost:3,4,5,6 train.py --deepspeed_config=ds_config.json -p 2 --steps=200
```










================================================
FILE: ai-framework/dlrover.md
================================================



https://github.com/intelligent-machine-learning/dlrover

DLRover: An Automatic Distributed Deep Learning System







================================================
FILE: ai-framework/huggingface-accelerate/README.md
================================================


- https://huggingface.co/docs/accelerate/package_reference/cli

```
accelerate env 

# 
accelerate config default [arguments]



accelerate config update --config_file




```



## huggingface 加载大模型

- 使用HuggingFace的Accelerate库加载和运行超大模型: https://zhuanlan.zhihu.com/p/605640431


```
import torch
from transformers import AutoModelForCausalLM

checkpoint = "facebook/opt-13b"
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map="auto", offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float16
)

```



================================================
FILE: ai-framework/huggingface-peft/README.md
================================================





================================================
FILE: ai-framework/huggingface-transformers/API.md
================================================





- https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py
- TrainingArguments




## 保存模型

- model.save_pretrained('./path_to_model/')
- model.config.to_json_file("config.json")




## RoPE



rope_scaling
```
Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
accordingly.
Expected contents:
    `rope_type` (`str`):
        The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
        'llama3'], with 'default' being the original RoPE implementation.
    `factor` (`float`, *optional*):
        Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
        most scaling types, a `factor` of x will enable the model to handle sequences of length x *
        original maximum pre-trained length.
```




















================================================
FILE: ai-framework/huggingface-transformers/FSDP.md
================================================




- https://pytorch.org/docs/stable/fsdp.html
- https://huggingface.co/docs/accelerate/usage_guides/fsdp


transformers
- https://zhuanlan.zhihu.com/p/648094197
- https://github.com/ifromeast/LLMTrainer/blob/main/02_fsdp/fsdp.json


accelerate
- 使用 PyTorch FSDP 微调 Llama 2 70B:https://zhuanlan.zhihu.com/p/671742753
- https://huggingface.co/docs/transformers/v4.41.0/en/fsdp#fsdp-configuration
- https://huggingface.co/docs/transformers/v4.41.0/en/main_classes/trainer#transformers.TrainingArguments
- https://github.com/pacman100/LLM-Workshop/blob/main/chat_assistant/sft/training/configs/fsdp_config.yaml




================================================
FILE: ai-framework/huggingface-transformers/README.md
================================================


## 量化

transformers 已经集成并 原生 支持了 bitsandbytes 和 auto-gptq 这两个量化库。


- https://huggingface.co/docs/transformers/v4.35.2/en/main_classes/quantization
- 更多量化方案：https://github.com/huggingface/optimum



### GPTQ量化


```
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"

quantization_config = GPTQConfig(
     bits=4,
     group_size=128,
     dataset="c4",
     desc_act=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map='auto')
```


### LLM.int8()-bitsandbytes

```
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "bigscience/bloomz-7b1-mt"


tokenizer = AutoTokenizer.from_pretrained(model_name)
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
```

```
from transformers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config)
```








================================================
FILE: ai-framework/huggingface-trl/README.md
================================================











================================================
FILE: ai-framework/jax/README.md
================================================



Jax 是我看过那么多项目中，唯一一个让我看了之后觉得「哇，软件还可以这么写，一切都很有道理」的项目。我觉得 Google 还是吸取了很多 Tensorflow 的经验，把它们都用到了 Jax 里面。






================================================
FILE: ai-framework/jax/reference.md
================================================




- https://jax.readthedocs.io/en/latest/notebooks/neural_network_with_tfds_data.html
- https://github.com/google/jax




================================================
FILE: ai-framework/llama-cpp/README.md
================================================



- https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file
- https://github.com/ggerganov/llama.cpp





GGUF量化格式

- ctransformers、llama.cpp



```
CMAKE_ARGS="-DGGML_METAL=on" pip install -U llama-cpp-python --no-cache-dir
pip install 'llama-cpp-python[server]'
```


```
export MODEL=/Users/liguodong/model/qwen2/qwen2-0_5b-instruct-q2_k.gguf
python3 -m llama_cpp.server --model $MODEL  --n_gpu_layers 1
```




## 量化

- https://github.com/ggerganov/llama.cpp/pull/1684  值得阅读
- 模型量化技术概述及 GGUF/GGML 文件格式解析，了解 Q4_0、Q4_1、Q4_K 和 Q4_K_M 的区别：https://blog.csdn.net/weixin_42426841/article/details/142706753
- LLM量化大比拼：哪种 LLM 量化最适合你？：https://zhuanlan.zhihu.com/p/8936080946
- llama.cpp中的量化方法简介：https://zhuanlan.zhihu.com/p/12729759086



- https://github.com/ggerganov/llama.cpp/blob/3e693197724c31d53a9b69018c2f1bd0b93ebab2/examples/quantize/quantize.cpp#L18

Q4_K_M, Q5_K_S and Q5_K_M are considered "recommended".



量化方法：
```
static const std::vector<struct quant_option> QUANT_OPTIONS = {
    { "Q4_0",     LLAMA_FTYPE_MOSTLY_Q4_0,     " 4.34G, +0.4685 ppl @ Llama-3-8B",  },
    { "Q4_1",     LLAMA_FTYPE_MOSTLY_Q4_1,     " 4.78G, +0.4511 ppl @ Llama-3-8B",  },
    { "Q5_0",     LLAMA_FTYPE_MOSTLY_Q5_0,     " 5.21G, +0.1316 ppl @ Llama-3-8B",  },
    { "Q5_1",     LLAMA_FTYPE_MOSTLY_Q5_1,     " 5.65G, +0.1062 ppl @ Llama-3-8B",  },
    { "IQ2_XXS",  LLAMA_FTYPE_MOSTLY_IQ2_XXS,  " 2.06 bpw quantization",            },
    { "IQ2_XS",   LLAMA_FTYPE_MOSTLY_IQ2_XS,   " 2.31 bpw quantization",            },
    { "IQ2_S",    LLAMA_FTYPE_MOSTLY_IQ2_S,    " 2.5  bpw quantization",            },
    { "IQ2_M",    LLAMA_FTYPE_MOSTLY_IQ2_M,    " 2.7  bpw quantization",            },
    { "IQ1_S",    LLAMA_FTYPE_MOSTLY_IQ1_S,    " 1.56 bpw quantization",            },
    { "IQ1_M",    LLAMA_FTYPE_MOSTLY_IQ1_M,    " 1.75 bpw quantization",            },
    { "TQ1_0",    LLAMA_FTYPE_MOSTLY_TQ1_0,    " 1.69 bpw ternarization",           },
    { "TQ2_0",    LLAMA_FTYPE_MOSTLY_TQ2_0,    " 2.06 bpw ternarization",           },
    { "Q2_K",     LLAMA_FTYPE_MOSTLY_Q2_K,     " 2.96G, +3.5199 ppl @ Llama-3-8B",  },
    { "Q2_K_S",   LLAMA_FTYPE_MOSTLY_Q2_K_S,   " 2.96G, +3.1836 ppl @ Llama-3-8B",  },
    { "IQ3_XXS",  LLAMA_FTYPE_MOSTLY_IQ3_XXS,  " 3.06 bpw quantization",            },
    { "IQ3_S",    LLAMA_FTYPE_MOSTLY_IQ3_S,    " 3.44 bpw quantization",            },
    { "IQ3_M",    LLAMA_FTYPE_MOSTLY_IQ3_M,    " 3.66 bpw quantization mix",        },
    { "Q3_K",     LLAMA_FTYPE_MOSTLY_Q3_K_M,   "alias for Q3_K_M"                   },
    { "IQ3_XS",   LLAMA_FTYPE_MOSTLY_IQ3_XS,   " 3.3 bpw quantization",             },
    { "Q3_K_S",   LLAMA_FTYPE_MOSTLY_Q3_K_S,   " 3.41G, +1.6321 ppl @ Llama-3-8B",  },
    { "Q3_K_M",   LLAMA_FTYPE_MOSTLY_Q3_K_M,   " 3.74G, +0.6569 ppl @ Llama-3-8B",  },
    { "Q3_K_L",   LLAMA_FTYPE_MOSTLY_Q3_K_L,   " 4.03G, +0.5562 ppl @ Llama-3-8B",  },
    { "IQ4_NL",   LLAMA_FTYPE_MOSTLY_IQ4_NL,   " 4.50 bpw non-linear quantization", },
    { "IQ4_XS",   LLAMA_FTYPE_MOSTLY_IQ4_XS,   " 4.25 bpw non-linear quantization", },
    { "Q4_K",     LLAMA_FTYPE_MOSTLY_Q4_K_M,   "alias for Q4_K_M",                  },
    { "Q4_K_S",   LLAMA_FTYPE_MOSTLY_Q4_K_S,   " 4.37G, +0.2689 ppl @ Llama-3-8B",  },
    { "Q4_K_M",   LLAMA_FTYPE_MOSTLY_Q4_K_M,   " 4.58G, +0.1754 ppl @ Llama-3-8B",  },
    { "Q5_K",     LLAMA_FTYPE_MOSTLY_Q5_K_M,   "alias for Q5_K_M",                  },
    { "Q5_K_S",   LLAMA_FTYPE_MOSTLY_Q5_K_S,   " 5.21G, +0.1049 ppl @ Llama-3-8B",  },
    { "Q5_K_M",   LLAMA_FTYPE_MOSTLY_Q5_K_M,   " 5.33G, +0.0569 ppl @ Llama-3-8B",  },
    { "Q6_K",     LLAMA_FTYPE_MOSTLY_Q6_K,     " 6.14G, +0.0217 ppl @ Llama-3-8B",  },
    { "Q8_0",     LLAMA_FTYPE_MOSTLY_Q8_0,     " 7.96G, +0.0026 ppl @ Llama-3-8B",  },
    { "F16",      LLAMA_FTYPE_MOSTLY_F16,      "14.00G, +0.0020 ppl @ Mistral-7B",  },
    { "BF16",     LLAMA_FTYPE_MOSTLY_BF16,     "14.00G, -0.0050 ppl @ Mistral-7B",  },
    { "F32",      LLAMA_FTYPE_ALL_F32,         "26.00G              @ 7B",          },
    // Note: Ensure COPY comes after F32 to avoid ftype 0 from matching.
    { "COPY",     LLAMA_FTYPE_ALL_F32,         "only copy tensors, no quantizing",  },
};
```











================================================
FILE: ai-framework/megatron-deepspeed/README.md
================================================


================================================
FILE: ai-framework/megatron-lm/README.md
================================================


================================================
FILE: ai-framework/mxnet/README.md
================================================



## 安装

```
pip install --upgrade mxnet gluonnlp

pip install  mxnet==1.9.1 gluonnlp==0.10.0
```

## docker 

```
# GPU Instance
docker pull gluonai/gluon-nlp:gpu-latest
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 --shm-size=2g gluonai/gluon-nlp:gpu-latest

# CPU Instance
docker pull gluonai/gluon-nlp:cpu-latest
docker run --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 --shm-size=2g gluonai/gluon-nlp:cpu-latest
```



```
docker run --gpus all  -itd \
--ipc=host \
--network host \
--shm-size=4g \
-v /home/guodong.li/workspace/:/workspace/ \
--name mxnet_dev \
gluonai/gluon-nlp:gpu-latest \
/bin/bash


docker exec -it mxnet_dev bash

pip uninstall mxnet-cu102
pip install  mxnet==1.9.1 gluonnlp==0.10.0 -i http://nexus3.xxx.com/repository/pypi/simple --trusted-host nexus3.xxx.com
```



================================================
FILE: ai-framework/mxnet/mnist.py
================================================


# pylint: skip-file
from __future__ import print_function

import argparse
import logging
logging.basicConfig(level=logging.DEBUG)

import numpy as np
import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon import nn

# Parse CLI arguments

parser = argparse.ArgumentParser(description='MXNet Gluon MNIST Example')
parser.add_argument('--batch-size', type=int, default=100,
                    help='batch size for training and testing (default: 100)')
parser.add_argument('--epochs', type=int, default=10,
                    help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.1,
                    help='learning rate (default: 0.1)')
parser.add_argument('--momentum', type=float, default=0.9,
                    help='SGD momentum (default: 0.9)')
parser.add_argument('--cuda', action='store_true', default=False,
                    help='Train on GPU with CUDA')
parser.add_argument('--log-interval', type=int, default=100, metavar='N',
                    help='how many batches to wait before logging training status')
opt = parser.parse_args()


# define network

net = nn.Sequential()
with net.name_scope():
    net.add(nn.Dense(128, activation='relu'))
    net.add(nn.Dense(64, activation='relu'))
    net.add(nn.Dense(10))

# data

def transformer(data, label):
    data = data.reshape((-1,)).astype(np.float32)/255
    return data, label

train_data = gluon.data.DataLoader(
    gluon.data.vision.MNIST('./data', train=True, transform=transformer),
    batch_size=opt.batch_size, shuffle=True, last_batch='discard')

val_data = gluon.data.DataLoader(
    gluon.data.vision.MNIST('./data', train=False, transform=transformer),
    batch_size=opt.batch_size, shuffle=False)

# train

def test(ctx):
    metric = mx.metric.Accuracy()
    for data, label in val_data:
        data = data.as_in_context(ctx)
        label = label.as_in_context(ctx)
        output = net(data)
        metric.update([label], [output])

    return metric.get()


def train(epochs, ctx):
    # Collect all parameters from net and its children, then initialize them.
    net.initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
    # Trainer is for updating parameters with gradient.
    trainer = gluon.Trainer(net.collect_params(), 'sgd',
                            {'learning_rate': opt.lr, 'momentum': opt.momentum})
    metric = mx.metric.Accuracy()
    loss = gluon.loss.SoftmaxCrossEntropyLoss()

    for epoch in range(epochs):
        # reset data iterator and metric at begining of epoch.
        metric.reset()
        for i, (data, label) in enumerate(train_data):
            # Copy data to ctx if necessary
            data = data.as_in_context(ctx)
            label = label.as_in_context(ctx)
            # Start recording computation graph with record() section.
            # Recorded graphs can then be differentiated with backward.
            with autograd.record():
                output = net(data)
                L = loss(output, label)
                L.backward()
            # take a gradient step with batch_size equal to data.shape[0]
            trainer.step(data.shape[0])
            # update metric at last.
            metric.update([label], [output])

            if i % opt.log_interval == 0 and i > 0:
                name, acc = metric.get()
                print('[Epoch %d Batch %d] Training: %s=%f'%(epoch, i, name, acc))

        name, acc = metric.get()
        print('[Epoch %d] Training: %s=%f'%(epoch, name, acc))

        name, val_acc = test(ctx)
        print('[Epoch %d] Validation: %s=%f'%(epoch, name, val_acc))

    net.save_parameters('mnist.params')


if __name__ == '__main__':
    if opt.cuda:
        ctx = mx.gpu(0)
    else:
        ctx = mx.cpu()
    train(opt.epochs, ctx)

================================================
FILE: ai-framework/mxnet/mxnet_cnn_mnist.py
================================================
from __future__ import print_function

import argparse
import logging
logging.basicConfig(level=logging.INFO)

import numpy as np
import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon import nn
from mxnet.gluon.data.vision import transforms
from mxnet import gluon, autograd as ag, nd
import os

"""
python /workspace/code/mxnet_cnn_mnist.py \
--train-dataset-path '/workspace/data/mxnet/' \
--test-dataset-path '/workspace/data/mxnet/' \
--output-path "/workspace/output/mxnet_model"

"""

import struct
import gzip
import matplotlib.pyplot as plt
from PIL import Image



def get_mnist(path='data'):
    def read_data(label_url, image_url):
        if not os.path.isdir(path):
            os.makedirs(path)
        with gzip.open(label_url) as flbl:
            struct.unpack(">II", flbl.read(8))
            label = np.frombuffer(flbl.read(), dtype=np.int8)
        with gzip.open(image_url, 'rb') as fimg:
            _, _, rows, cols = struct.unpack(">IIII", fimg.read(16))
            image = np.frombuffer(fimg.read(), dtype=np.uint8).reshape(len(label), rows, cols)
            image = image.reshape(image.shape[0], 1, 28, 28).astype(np.float32)/255
        return (label, image)

    # changed to mxnet.io for more stable hosting

    (train_lbl, train_img) = read_data(
        path+'train-labels-idx1-ubyte.gz', path+'train-images-idx3-ubyte.gz')
    (test_lbl, test_img) = read_data(
        path+'t10k-labels-idx1-ubyte.gz', path+'t10k-images-idx3-ubyte.gz')
    return {'train_data':train_img, 'train_label':train_lbl,
            'test_data':test_img, 'test_label':test_lbl}



transform = transforms.Compose([
    # 将PIL Image或numpy.ndarray转换为tensor，并除255归一化到[0,1]之间
    transforms.ToTensor(), 
    # 标准化处理-->转换为标准正太分布，使模型更容易收敛
    transforms.Normalize((0.5,),(0.5,))
    ])


# define network
class NeuralNetwork(gluon.Block):
    def __init__(self):
        super(NeuralNetwork, self).__init__()

        # 定义卷积层，输出特征通道out_channels设置为20，卷积核的大小kernel_size为5，卷积步长stride=1，padding=2
        self.conv1 = nn.Conv2D(20, kernel_size=(5,5), strides=(1,1), padding=(2,2), activation='relu')
        # 定义池化层，池化核的大小kernel_size为2，池化步长为2
        self.max_pool1 = nn.MaxPool2D(pool_size=(2,2), strides=(2,2))

        # 定义卷积层，输出特征通道out_channels设置为20，卷积核的大小kernel_size为5，卷积步长stride=1，padding=2
        self.conv2 = nn.Conv2D(20, kernel_size=5, strides=(1,1), padding=(2,2), activation='relu')
        # 定义池化层，池化核的大小kernel_size为2，池化步长为2
        self.max_pool2 = nn.MaxPool2D(pool_size=(2,2), strides=(2,2))
        # 定义一层全连接层，输出维度是10
        self.fc = nn.Dense(10)


    def forward(self, x):
        x = self.conv1(x)
        x = self.max_pool1(x)
        x = self.conv2(x)
        x = self.max_pool2(x)
        x = x.reshape((x.shape[0], -1))
        x = self.fc(x)
        return x

net = NeuralNetwork()


# train
def test(ctx, val_data):
    metric = mx.metric.Accuracy()
    
    # 重置
    val_data.reset()

    for batch in val_data:
        # data = data.as_in_context(ctx)
        # label = label.as_in_context(ctx)
        data = gluon.utils.split_and_load(batch.data[0], ctx_list=ctx, batch_axis=0)
        # Splits train labels into multiple slices along batch_axis
        # and copy each slice into a context.
        label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0)
        outputs = []
        for x in data:
            output = net(x)
            outputs.append(output)
        metric.update(label, outputs)

    return metric.get()




train_acc_list = []
val_acc_list = []



def train(args, ctx, train_data, val_data):
    # Collect all parameters from net and its children, then initialize them.
    net.initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
    # Trainer is for updating parameters with gradient.
    trainer = gluon.Trainer(net.collect_params(), 'sgd',
                            {'learning_rate': args.lr, 'momentum': args.momentum})
    metric = mx.metric.Accuracy()
    loss = gluon.loss.SoftmaxCrossEntropyLoss()
    
    for epoch in range(args.epochs):
        # reset data iterator and metric at begining of epoch.
        # 每次迭代后需要重置
        metric.reset()
        train_data.reset()
        # for i, batch in enumerate(train_data):
        #     data = gluon.utils.split_and_load(batch.data[0], ctx_list=ctx, batch_axis=0)
        #         # Splits train labels into multiple slices along batch_axis
        #     # and copy each slice into a context.
        #     label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0)

        for i, batch in enumerate(train_data):
            # Copy data to ctx if necessary
            # data = data.as_in_context(ctx)
            # label = label.as_in_context(ctx)
            data = gluon.utils.split_and_load(batch.data[0], ctx_list=ctx, batch_axis=0)
            # Splits train labels into multiple slices along batch_axis
            # and copy each slice into a context.
            label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0)

            outputs = []
            # Start recording computation graph with record() section.
            # Recorded graphs can then be differentiated with backward.
            with autograd.record():
                for x, y in zip(data, label):
                    z = net(x)
                    L = loss(z, y)
                    L.backward()
                    outputs.append(z)

            # take a gradient step with batch_size equal to data.shape[0]
            trainer.step(batch.data[0].shape[0])
            # update metric at last.
            metric.update(label, outputs)

            if i % args.log_interval == 0 and i > 0:
                name, acc = metric.get()
                print('[Epoch %d Batch %d] Training: %s=%f'%(epoch, i, name, acc))

        name, acc = metric.get()
        print('[Epoch %d] Training: %s=%f'%(epoch, name, acc))


        name, val_acc = test(ctx, val_data)
        print('[Epoch %d] Validation: %s=%f'%(epoch, name, val_acc))
        train_acc_list.append(acc)
        val_acc_list.append(val_acc)



def plot(train_acc_list, val_acc_list, output_path):
    fig, ax = plt.subplots()

    train_freqs = [i for i in range(len(train_acc_list))]
    val_freqs = [i for i in range(len(val_acc_list))]

    # 绘制训练损失变化曲线
    ax.plot(train_freqs, train_acc_list, color='#e4007f', label=" train/accuracy curve")
    ax.plot(val_freqs, val_acc_list, color='#fff000', label="val/accuracy curve")

    # 绘制坐标轴和图例
    ax.set_ylabel("accuracy", fontsize='large')
    ax.set_xlabel("epoch", fontsize='large')
    ax.set_title("image classification")
    ax.legend(loc='upper right', fontsize='x-large')

    plt.savefig(output_path+'/mxnet_cnn_image_classification_accuracy_curve.png')
    # plt.show()



def main(ctx):
    parser = argparse.ArgumentParser(description='MXNet Gluon MNIST Example')
    parser.add_argument("--pretrain-model-path", dest="pretrain_model_path", required=False, type=str, default=None, help="预训练模型路径")
    parser.add_argument("--train-dataset-path", type=str, default="/Users/liguodong/data/mnist", help="训练集路径")
    parser.add_argument("--test-dataset-path", type=str, default="/Users/liguodong/data/mnist", help="测试集路径")
    parser.add_argument("--output-path", type=str, default="/Users/liguodong/output/oneflow_model",help="模型输出路径")
    parser.add_argument('--batch-size', type=int, default=100,
                        help='batch size for training and testing (default: 100)')
    parser.add_argument('--epochs', type=int, default=10,
                        help='number of epochs to train (default: 10)')
    parser.add_argument('--lr', type=float, default=0.1,
                        help='learning rate (default: 0.1)')
    parser.add_argument('--momentum', type=float, default=0.9,
                        help='SGD momentum (default: 0.9)')
    parser.add_argument('--cuda', action='store_true', default=False,
                        help='Train on GPU with CUDA')
    parser.add_argument('--log-interval', type=int, default=100, metavar='N',
                        help='how many batches to wait before logging training status')

    args = parser.parse_args()
    train_dataset_path = args.train_dataset_path
    test_dataset_path = args.test_dataset_path
    output_path = args.output_path
    pretrain_model_path = args.pretrain_model_path

    if not os.path.exists(output_path):
        os.makedirs(output_path)

    # train_dataset = gluon.data.vision.MNIST(train_dataset_path, train=True)
    # train_data = gluon.data.DataLoa

Download .txt

gitextract_50r359j2/

├── .gitignore
├── LICENSE
├── README.md
├── ai-compiler/
│   ├── README.md
│   ├── Treebeard/
│   │   └── README.md
│   ├── treelit/
│   │   ├── README.md
│   │   └── xgb.md
│   └── triton-lang/
│       └── README.md
├── ai-framework/
│   ├── README.md
│   ├── TensorRT-Model-Optimizer.md
│   ├── cuda/
│   │   └── README.md
│   ├── deepspeed/
│   │   ├── 1.DeepSpeed入门.md
│   │   ├── 2.安装DeepSpeed.md
│   │   ├── 3.基于CIFAR-10使用DeepSpeed进行分布式训练 .md
│   │   ├── DeepSpeed配置JSON文件.md
│   │   ├── README.md
│   │   ├── config-json/
│   │   │   ├── README.md
│   │   │   └── deepspeed-nvme.md
│   │   ├── deepspeed-slurm.md
│   │   ├── hello_bert/
│   │   │   ├── README.md
│   │   │   ├── train_bert.py
│   │   │   └── train_bert_ds.py
│   │   └── training/
│   │       └── pipeline_parallelism/
│   │           └── README.md
│   ├── dlrover.md
│   ├── huggingface-accelerate/
│   │   └── README.md
│   ├── huggingface-peft/
│   │   └── README.md
│   ├── huggingface-transformers/
│   │   ├── API.md
│   │   ├── FSDP.md
│   │   └── README.md
│   ├── huggingface-trl/
│   │   └── README.md
│   ├── jax/
│   │   ├── README.md
│   │   └── reference.md
│   ├── llama-cpp/
│   │   └── README.md
│   ├── megatron-deepspeed/
│   │   └── README.md
│   ├── megatron-lm/
│   │   └── README.md
│   ├── mxnet/
│   │   ├── README.md
│   │   ├── mnist.py
│   │   ├── mxnet_cnn_mnist.py
│   │   ├── mxnet_mlp_mnist.py
│   │   ├── oneflow_cnn_mnist.py
│   │   ├── oneflow_mlp_mnist.py
│   │   └── reference.md
│   ├── oneflow/
│   │   ├── README.md
│   │   ├── oneflow_mlp_mnist.py
│   │   └── reference.md
│   ├── openai-triton/
│   │   └── README.md
│   ├── paddlepaddle/
│   │   ├── README.md
│   │   └── reference.md
│   ├── pai-megatron-patch/
│   │   └── README.md
│   ├── pai-torchacc.md
│   ├── pytorch/
│   │   ├── README.md
│   │   ├── install.md
│   │   └── reference.md
│   ├── tensorflow/
│   │   ├── README.md
│   │   └── reference.md
│   ├── transformer-engine/
│   │   └── mnist/
│   │       ├── README.md
│   │       ├── main.py
│   │       └── main_stat.py
│   └── unsloth-微调.md
├── ai-infra/
│   ├── ai-cluster/
│   │   └── README.md
│   ├── ai-hardware/
│   │   ├── AI芯片软件生态.md
│   │   ├── CUDA.md
│   │   ├── GPU-network.md
│   │   ├── GPU相关环节变量.md
│   │   ├── NIXL.md
│   │   ├── OEM-DGX.md
│   │   ├── README.md
│   │   ├── TSMC-台积电.md
│   │   ├── cuda镜像.md
│   │   ├── gpudirect.md
│   │   └── 硬件对比.md
│   ├── communication.md
│   ├── 存储/
│   │   ├── README.md
│   │   ├── REF.md
│   │   ├── nvme-ssd.md
│   │   ├── 固态硬盘.md
│   │   └── 存储.md
│   ├── 算力/
│   │   ├── AI芯片.md
│   │   ├── GPU工作原理.md
│   │   ├── NVIDIA-GPU型号.md
│   │   ├── 推理芯片.md
│   │   └── 昇腾NPU.md
│   └── 网络/
│       ├── HPC性能测试.md
│       ├── IB-docker.md
│       ├── IB流量监控.md
│       ├── IB软件.md
│       ├── InfiniBand.md
│       ├── NCCL.md
│       ├── README.md
│       ├── REF.md
│       ├── Spine-Leaf和InfiniBand网络架构区别简述.md
│       ├── nccl-test-集合通讯的性能测试.md
│       ├── nvbandwidth.md
│       ├── roce.md
│       ├── 网络硬件.md
│       ├── 通信软件.md
│       └── 集合通信原语.md
├── blog/
│   ├── TODO.md
│   ├── ai-infra/
│   │   ├── AI 集群基础设施 InfiniBand 详解.md
│   │   └── AI 集群基础设施 NVMe SSD 详解.md
│   ├── distribution-parallelism/
│   │   ├── 大模型分布式训练并行技术（一）-概述.md
│   │   ├── 大模型分布式训练并行技术（九）-总结.md
│   │   └── 大模型分布式训练并行技术（六）-多维混合并行.md
│   ├── llm-algo/
│   │   ├── moe.md
│   │   └── 大白话Transformer架构.md
│   ├── llm-compression/
│   │   ├── 大模型量化技术原理-ZeroQuant系列.md
│   │   └── 大模型量化技术原理：QoQ量化及QServe推理服务系统.md
│   ├── llm-inference/
│   │   └── 大模型推理框架概述.md
│   ├── llm-localization/
│   │   ├── 大模型国产化适配1-华为昇腾AI全栈软硬件平台总结.md
│   │   └── 大模型国产化适配4-基于昇腾910使用LLaMA-13B进行多机多卡训练.md
│   ├── llm-peft/
│   │   ├── 大模型参数高效微调技术原理综述（一）-背景、参数高效微调简介.md
│   │   └── 大模型参数高效微调技术原理综述（五）-LoRA、AdaLoRA、QLoRA.md
│   └── reference/
│       └── 高性能 LLM 推理框架的设计与实现.md
├── docs/
│   ├── README.md
│   ├── conda.md
│   ├── flash-attention/
│   │   └── FlashAttention.md
│   ├── llm-base/
│   │   ├── FLOPS.md
│   │   ├── NVIDIA-Nsight-Systems性能分析.md
│   │   ├── README.md
│   │   ├── a800-env-install.md
│   │   ├── ai-algo.md
│   │   ├── autoregressive-lm-decoding-methods.md
│   │   ├── dcgmi.md
│   │   ├── distribution-parallelism/
│   │   │   ├── README.md
│   │   │   ├── auto-parallel/
│   │   │   │   ├── Alpa.md
│   │   │   │   ├── Flexflow.md
│   │   │   │   ├── Galvatron.md
│   │   │   │   ├── Mesh-Tensorflow.md
│   │   │   │   ├── README.md
│   │   │   │   ├── Unity.md
│   │   │   │   ├── auto-parallel.md
│   │   │   │   ├── gspmd.md
│   │   │   │   ├── 分布式训练自动并行概述.md
│   │   │   │   └── 飞桨面向异构场景下的自动并行设计与实践.md
│   │   │   ├── data-parallelism/
│   │   │   │   └── README.md
│   │   │   ├── moe-parallel/
│   │   │   │   ├── README.md
│   │   │   │   ├── moe-framework.md
│   │   │   │   ├── moe-parallel.md
│   │   │   │   └── paddle_moe.py
│   │   │   ├── multidimensional-hybrid-parallel/
│   │   │   │   └── README.md
│   │   │   ├── pipeline-parallelism/
│   │   │   │   └── README.md
│   │   │   ├── tensor-parallel/
│   │   │   │   ├── README.md
│   │   │   │   └── tensor-parallel.md
│   │   │   └── 并行技术.drawio
│   │   ├── distribution-training/
│   │   │   ├── Bloom-176B训练经验.md
│   │   │   ├── FP16-BF16.md
│   │   │   ├── GLM-130B训练经验.md
│   │   │   ├── OPT-175B训练经验.md
│   │   │   ├── README.md
│   │   │   └── 自动混合精度.md
│   │   ├── gpu-env-var.md
│   │   ├── h800-env-install.md
│   │   ├── monitor.md
│   │   ├── multimodal/
│   │   │   └── sora.md
│   │   ├── nvidia-smi-dmon.md
│   │   ├── nvidia-smi.md
│   │   ├── rlhf/
│   │   │   └── README.md
│   │   ├── scenes/
│   │   │   ├── README.md
│   │   │   ├── cv/
│   │   │   │   ├── README.md
│   │   │   │   ├── paddle/
│   │   │   │   │   └── README.md
│   │   │   │   ├── pytorch/
│   │   │   │   │   └── README.md
│   │   │   │   └── reference.md
│   │   │   └── multi-modal/
│   │   │       ├── README.md
│   │   │       └── reference.md
│   │   ├── singularity命令.md
│   │   ├── slurm.md
│   │   ├── 分布式训练加速技术.md
│   │   ├── 多机RDMA性能测试.txt
│   │   └── 机器学习中常用的数据类型.md
│   ├── llm-experience.md
│   ├── llm-inference/
│   │   ├── DeepSpeed-Inference.md
│   │   ├── KV-Cache.md
│   │   ├── LLM服务框架对比.md
│   │   ├── README.md
│   │   ├── blog.md
│   │   ├── flexflow/
│   │   │   └── 投机采样.md
│   │   ├── llm推理优化技术.md
│   │   ├── llm推理框架.md
│   │   └── vllm.md
│   ├── llm-peft/
│   │   ├── LoRA-FA.md
│   │   ├── MAM_Adapter.md
│   │   ├── README.md
│   │   └── ReLoRA.md
│   ├── llm-summarize/
│   │   ├── README.md
│   │   ├── distribution_dl_roadmap.md
│   │   ├── 大模型实践总结-20230930.md
│   │   ├── 大模型实践总结.md
│   │   ├── 文档大模型.md
│   │   ├── 金融大模型.md
│   │   └── 领域大模型.md
│   └── transformer内存估算.md
├── faq/
│   └── FAQ.md
├── git-pull-push.sh
├── llm-algo/
│   ├── FLOPs.md
│   ├── InternLM-20B.md
│   ├── README.md
│   ├── baichuan2/
│   │   └── baichuan.md
│   ├── bert/
│   │   └── 模型架构.md
│   ├── bert.md
│   ├── bloom/
│   │   └── README.md
│   ├── bloom.md
│   ├── chatglm/
│   │   ├── README.md
│   │   └── 模型架构.md
│   ├── chatglm2/
│   │   ├── README.md
│   │   └── 模型架构.md
│   ├── chatglm3/
│   │   ├── README.md
│   │   └── reference.md
│   ├── chatgpt/
│   │   └── README.md
│   ├── deepseek/
│   │   ├── DeepSeek-R1.md
│   │   ├── DeepSeek-V2.md
│   │   ├── DeepSeek-V3.md
│   │   └── README.md
│   ├── glm-130b/
│   │   └── README.md
│   ├── glm4.md
│   ├── gpt/
│   │   └── README.md
│   ├── gpt2/
│   │   ├── README.md
│   │   ├── hf_modeling_gpt2.py
│   │   └── 模型架构.md
│   ├── gpt3/
│   │   └── README.md
│   ├── llama/
│   │   ├── README.md
│   │   └── 模型架构.md
│   ├── llama.md
│   ├── mixtral/
│   │   └── README.md
│   ├── mlp.md
│   ├── moe/
│   │   └── README.md
│   ├── qwen/
│   │   ├── README.md
│   │   └── 参数说明及函数说明.md
│   ├── qwen2.md
│   ├── t5/
│   │   └── README.md
│   ├── transformer/
│   │   ├── README.md 
│   │   ├── Transformer中FFN的记忆功能.md
│   │   └── 模型架构.md
│   ├── transformer.md
│   ├── 基本概念.md
│   ├── 旋转编码RoPE.md
│   ├── 模型架构类图.drawio
│   └── 训练范式.md
├── llm-alignment/
│   ├── DPO.md
│   ├── README.md
│   ├── RLHF.md
│   └── 基本概念.md
├── llm-application/
│   ├── Higress.md
│   ├── README.md
│   ├── agent/
│   │   ├── OpenClaw.md
│   │   └── OpenCode/
│   │       └── README.md
│   ├── embbedding-model.md
│   ├── gradio/
│   │   └── README.md
│   ├── langchain/
│   │   ├── README.md
│   │   ├── serve.py
│   │   └── tutorials/
│   │       ├── client.py
│   │       └── serve.py
│   ├── one-api.md
│   ├── pre-post-handle/
│   │   └── README.md
│   ├── rag/
│   │   ├── README.md
│   │   ├── embedding.md
│   │   ├── 存在的一些问题.md
│   │   └── 方案.md
│   ├── vector-db/
│   │   ├── README.md
│   │   └── reference.md
│   └── 应用场景.md
├── llm-compression/
│   ├── PaddleSlim/
│   │   ├──  quantization.md
│   │   └── README.md
│   ├── README.md
│   ├── distillation/
│   │   ├── GKD.md
│   │   ├── MINILLM.md
│   │   ├── README.md
│   │   ├── SCOTT.md
│   │   └── 大模型蒸馏概述.md
│   ├── gptqmodel/
│   │   └── README.md
│   ├── llm-compressor/
│   │   ├── README.md
│   │   ├── source-code.md
│   │   ├── 剪枝.md
│   │   └── 量化方案.md
│   ├── quantization/
│   │   ├── FP6-LLM.md
│   │   ├── GPTQ.md
│   │   ├── LLM-int8.md
│   │   ├── PEQA.md
│   │   ├── QQQ-W4A8.md
│   │   ├── README.md
│   │   ├── SmoothQuant.md
│   │   ├── SpinQuant.md
│   │   ├── ZeroQuant(4+2).md
│   │   ├── ZeroQuant.md
│   │   ├── fp4.md
│   │   ├── fp6.md
│   │   ├── fp8.md
│   │   ├── kv-cache-quant.md
│   │   ├── llm-qat/
│   │   │   ├── LLM-QAT.md
│   │   │   ├── README.md
│   │   │   ├── cfd70ff/
│   │   │   │   ├── README.md
│   │   │   │   ├── generate_data.py
│   │   │   │   ├── inference.py
│   │   │   │   ├── merge_gen_data.py
│   │   │   │   ├── pip.conf
│   │   │   │   ├── run_train.sh
│   │   │   │   ├── train.py
│   │   │   │   └── utils.py
│   │   │   ├── f4d873a/
│   │   │   │   ├── datautils.py
│   │   │   │   ├── run_train.sh
│   │   │   │   └── train.py
│   │   │   └── log.md
│   │   ├── moe模型量化.md
│   │   ├── tools.md
│   │   ├── 可视化/
│   │   │   ├── README.md
│   │   │   ├── qwen_activate_visual.ipynb
│   │   │   └── qwen_visual.ipynb
│   │   ├── 大模型量化概述.md
│   │   └── 量化基础.md
│   ├── sparsity/
│   │   └── README.md
│   ├── tools.md
│   ├── 大模型压缩综述.md
│   └── 经验.md
├── llm-data-engineering/
│   ├── README.md
│   ├── dataset/
│   │   ├── README.md
│   │   ├── baichuan2.md
│   │   ├── chinese-corpus-all.md
│   │   └── english-corpus-all.md
│   ├── reference.md
│   └── sft-dataset/
│       ├── baichuan2_test.py
│       ├── evol-instruct.md
│       ├── firefly-template.py
│       ├── jinja-demo.py
│       ├── jinja-llm-baichuan.py
│       ├── jinja-llm-baichuan2.py
│       ├── jinja-llm-bloom.py
│       ├── jinja-llm-chatglm3.py
│       ├── jinja-llm.py
│       ├── jinja.md
│       ├── 数据格式设计.md
│       └── 数据集格式.md
├── llm-eval/
│   ├── EvalScope.md
│   ├── README.md
│   ├── eval-data/
│   │   ├── longtext_L115433-question.txt
│   │   ├── longtext_L115433.txt
│   │   ├── longtext_L32503_answer.txt
│   │   ├── longtext_L32503_question.txt
│   │   ├── longtext_L64031.txt
│   │   └── longtext_L64031_question.txt
│   ├── llm-performance/
│   │   ├── AI芯片性能.md
│   │   ├── README.md
│   │   ├── hardware-performance/
│   │   │   ├── gpu-monitor-ui.py
│   │   │   └── pynvml-stat-memory.py
│   │   ├── llmperf.md
│   │   ├── mindie/
│   │   │   ├── lantency/
│   │   │   │   ├── README.md
│   │   │   │   ├── perfermance-stat.py
│   │   │   │   ├── performance-stream-baichuan2.py
│   │   │   │   ├── performance-stream-chatglm3.py
│   │   │   │   ├── performance-stream-qwen1.5.py
│   │   │   │   ├── performance-stream-qwen1.py
│   │   │   │   ├── performance-stream.py
│   │   │   │   └── stat_input_token.py
│   │   │   └── locust-lantency-throughput/
│   │   │       ├── README.md
│   │   │       ├── hello.py
│   │   │       ├── llm-910b4-baichuan2-7b-2tp.py
│   │   │       ├── llm-910b4-chatglm3-6b-2tp.py
│   │   │       ├── llm-910b4-qwen-72b-8tp.py
│   │   │       ├── llm-910b4-qwen1.5-4tp.py
│   │   │       ├── qwen1.5-72b-8tp.html
│   │   │       └── 示例.py
│   │   ├── perfetto.md
│   │   ├── stat_gpu_memory.py
│   │   ├── tgi-benchmark.md
│   │   ├── vllm/
│   │   │   ├── README.md
│   │   │   ├── vllm-locust-qwen1.5-7b-long.py
│   │   │   └── vllm-performance-stream-qwen1.5-long.py
│   │   ├── vllm-benchmark.md
│   │   ├── wrk-性能测试工具.md
│   │   ├── 大模型场景下训练和推理性能指标名词解释.md
│   │   ├── 推理性能测试.md
│   │   └── 训练性能测试.md
│   ├── llm-precision/
│   │   ├── C-Eval.md
│   │   ├── README.md
│   │   └── 模型质量评估.md
│   ├── opencompass.md
│   └── 大模型测评集.md
├── llm-inference/
│   ├── DeepSpeed-Inference.md
│   ├── Flash-Decoding.md
│   ├── FlashInfer.md
│   ├── FlexFlow-Serve.md
│   ├── GuidedGeneration.md
│   ├── KV-Cache优化.md
│   ├── Mooncake.md
│   ├── NanoFlow.md
│   ├── PD分离.md
│   ├── README.md
│   ├── RTP-LLM.md
│   ├── ascend/
│   │   └── mindformers/
│   │       ├── README.md
│   │       ├── baichuan2/
│   │       │   ├── README.md
│   │       │   ├── baichuan-inference.py
│   │       │   └── baichuan-stat.py
│   │       ├── chatglm3/
│   │       │   ├── README.md
│   │       │   ├── chatglm-gen.py
│   │       │   ├── chatglm-inference.py
│   │       │   └── chatglm-stat.py
│   │       ├── mindsporelite-inference.py
│   │       ├── mindsporelite-stat.py
│   │       └── text_generator_infer.py
│   ├── chatgpt.md
│   ├── deepspeed-mii/
│   │   └── README.md
│   ├── faster-transformer/
│   │   ├── README.md
│   │   ├── bloom/
│   │   │   ├── README.md
│   │   │   └── firefly_lambada_1w_stat_token.py
│   │   ├── gpt/
│   │   │   └── README.md
│   │   ├── llama/
│   │   │   └── README.md
│   │   └── megatron-gpt2/
│   │       ├── gpt_summarization.py
│   │       ├── gpt_summarization_stat.py
│   │       └── megatron-gpt2-fp8.md
│   ├── flexflow-serve/
│   │   └── benchmark-batch1.py
│   ├── huggingface-tgi/
│   │   └── README.md
│   ├── huggingface-transformer/
│   │   └── README.md
│   ├── lightllm/
│   │   └── README.md
│   ├── lmdeploy/
│   │   ├── README.md
│   │   ├── 功能.md
│   │   └── 服务启动参数.md
│   ├── native-model/
│   │   └── chatglm3-6b/
│   │       └── cli_demo.py
│   ├── offload.md
│   ├── openai.md
│   ├── sglang/
│   │   ├── README.md
│   │   ├── source-code.md
│   │   ├── 服务器启动参数.md
│   │   └── 项目代码结构.md
│   ├── tensorrt/
│   │   ├── README.md
│   │   └── install.md
│   ├── tensorrt-llm/
│   │   ├── FP8.md
│   │   ├── Memory Usage of TensorRT-LLM.md
│   │   ├── README.md
│   │   ├── TRT-LLM引擎构建参数.md
│   │   ├── Triton服务启动参数.md
│   │   └── 安装.md
│   ├── triton/
│   │   ├── REAEME.md
│   │   ├── onnx/
│   │   │   └── README.md
│   │   └── resnet50/
│   │       ├── client.py
│   │       ├── config.pbtxt
│   │       ├── labels.txt
│   │       └── resnet50_convert_torchscript.py
│   ├── vllm/
│   │   ├── FAQ.md
│   │   ├── FP8.md
│   │   ├── README.md
│   │   ├── REF.md
│   │   ├── api_client.py
│   │   ├── cmd.md
│   │   ├── vllm.md
│   │   ├── 服务启动参数.md
│   │   ├── 源码.md
│   │   ├── 请求处理流程.md
│   │   └── 长文本推理.md
│   ├── web/
│   │   ├── fastapi/
│   │   │   ├── README.md
│   │   │   └── llm-qwen-mindspore-lite.py
│   │   ├── flask/
│   │   │   ├── README.md
│   │   │   └── llm-qwen-mindspore-lite.py
│   │   └── sanic/
│   │       └── README.md
│   ├── xinference/
│   │   └── README.md
│   ├── 分离式推理架构.md
│   ├── 大模型推理张量并行.md
│   └── 解码策略.md
├── llm-interview/
│   ├── README.md
│   ├── base.md
│   ├── comprehensive.md
│   ├── llm-algo.md
│   ├── llm-app.md
│   ├── llm-compress.md
│   ├── llm-eval.md
│   ├── llm-ft.md
│   ├── llm-inference.md
│   ├── llm-rlhf.md
│   └── llm-train.md
├── llm-localization/
│   ├── README.md
│   ├── ascend/
│   │   ├── FAQ.md
│   │   ├── README.md
│   │   ├── ascend-c/
│   │   │   └── README.md
│   │   ├── ascend-infra/
│   │   │   ├── HCCL.md
│   │   │   ├── MacOS环境.md
│   │   │   ├── ascend-dmi.md
│   │   │   ├── ascend-docker-runtime.md
│   │   │   ├── ascend-docker.md
│   │   │   ├── ascend-llm下载.md
│   │   │   ├── ascend-npu-smi.md
│   │   │   ├── docker环境升级cann.md
│   │   │   ├── network.md
│   │   │   ├── npu监控.md
│   │   │   ├── 操作系统.md
│   │   │   ├── 昇腾卡-soc版本.md
│   │   │   ├── 昇腾卡注意事项.md
│   │   │   ├── 昇腾镜像.md
│   │   │   ├── 服务器配置.md
│   │   │   ├── 环境安装.md
│   │   │   └── 达芬奇架构.md
│   │   ├── ascend910-env-install.md
│   │   ├── fabric-insight/
│   │   │   └── README.md
│   │   ├── firefly-ascend.md
│   │   ├── mindformers/
│   │   │   ├── README.md
│   │   │   ├── baichuan2/
│   │   │   │   ├── baichuan2训练.md
│   │   │   │   ├── run_baichuan2_7b.yaml
│   │   │   │   ├── run_baichuan2_7b_910b.yaml
│   │   │   │   └── run_baichuan2_7b_lora_910b.yaml
│   │   │   ├── chatglm/
│   │   │   │   ├── README.md
│   │   │   │   ├── chat_glm.py
│   │   │   │   ├── glm_6b.yaml
│   │   │   │   ├── glm_6b_chat.yaml
│   │   │   │   ├── merge_ckpt.py
│   │   │   │   ├── merge_ckpt_lora.py
│   │   │   │   ├── pt2ms.py
│   │   │   │   ├── run_glm_6b_finetune.yaml
│   │   │   │   ├── run_glm_6b_infer.yaml
│   │   │   │   ├── run_glm_6b_lora.yaml
│   │   │   │   └── run_glm_6b_lora_infer.yaml
│   │   │   ├── env.md
│   │   │   ├── llama/
│   │   │   │   └── README.md
│   │   │   ├── qwen/
│   │   │   │   ├── qwen1训练.md
│   │   │   │   ├── run_qwen_7b.yaml
│   │   │   │   └── run_qwen_7b_910b.yaml
│   │   │   ├── qwen1.5/
│   │   │   │   ├── qwen1.5训练.md
│   │   │   │   ├── run_qwen1_5_7b_finetune.yaml
│   │   │   │   └── run_qwen1_5_7b_infer.yaml
│   │   │   ├── trick.md
│   │   │   └── 权重格式转换.md
│   │   ├── mindie/
│   │   │   ├── 2.0.RC2/
│   │   │   │   └── qwen.md
│   │   │   ├── README.md
│   │   │   ├── config/
│   │   │   │   ├── chatglm3-6b.json
│   │   │   │   ├── qwen-72b.json
│   │   │   │   └── run.sh
│   │   │   ├── config-1.0.RC1.json
│   │   │   ├── docker/
│   │   │   │   ├── README.md
│   │   │   │   ├── TEST.md
│   │   │   │   ├── baichuan2-13b.json
│   │   │   │   ├── baichuan2-7b.json
│   │   │   │   ├── deploy.sh
│   │   │   │   ├── install_and_enable_cann.sh
│   │   │   │   ├── llm-server.sh
│   │   │   │   ├── mindie-1.0.Dockerfile
│   │   │   │   ├── mindie-all-1.0.Dockerfile
│   │   │   │   ├── mindie-env-1.0.Dockerfile
│   │   │   │   ├── qwen-72b.json
│   │   │   │   ├── qwen1.5-14b.json
│   │   │   │   ├── qwen1.5-72b.json
│   │   │   │   └── qwen1.5-7b.json
│   │   │   ├── llm-server.sh
│   │   │   ├── mindid-1.0-offical.md
│   │   │   ├── mindid-performance.md
│   │   │   ├── mindie-1.0.Dockerfile
│   │   │   ├── mindie-1.0.RC2.md
│   │   │   ├── mindie-1.0.md
│   │   │   ├── mindie-1.0.rc2-config.json
│   │   │   ├── mindie-1.0.rc2-llm-server.sh
│   │   │   ├── mindie-2.0.rc2.md
│   │   │   ├── mindie-20240411.md
│   │   │   ├── mindie-api.md
│   │   │   ├── model-test.md
│   │   │   ├── script/
│   │   │   │   ├── model-test.py
│   │   │   │   └── run.sh
│   │   │   ├── 性能调优.md
│   │   │   └── 日志分析.txt
│   │   ├── mindspore/
│   │   │   ├── MindSpore-note.md
│   │   │   ├── README.md
│   │   │   ├── bert.md
│   │   │   ├── reference.md
│   │   │   └── 镜像.md
│   │   ├── modellink/
│   │   │   ├── README.md
│   │   │   ├── dataset.md
│   │   │   ├── llm.md
│   │   │   ├── qwen.md
│   │   │   ├── 环境-20240521.md
│   │   │   └── 环境安装.md
│   │   ├── msmodelslim/
│   │   │   ├── README.md
│   │   │   └── llm_quant/
│   │   │       ├── baichuan2-w8a8.py
│   │   │       ├── calib_set.json
│   │   │       └── qwen1.5-72b-w8a16.py
│   │   ├── openmind/
│   │   │   └── README.md
│   │   ├── peft/
│   │   │   ├── README.md
│   │   │   └── finetune-lora.py
│   │   ├── pytorch/
│   │   │   ├── README.md
│   │   │   └── llm-lora.py
│   │   ├── standford-alpaca/
│   │   │   ├── README.md
│   │   │   ├── ds_config_zero2.json
│   │   │   ├── ds_config_zero3.json
│   │   │   ├── requirements.txt
│   │   │   ├── train.py
│   │   │   └── utils.py
│   │   ├── transformers/
│   │   │   └── README.md
│   │   ├── vllm-ascend/
│   │   │   └── README.md
│   │   ├── 优质学习资料.md
│   │   ├── 昇腾LLM支持概览.md
│   │   └── 昇腾卡注意事项.md
│   ├── modelscope/
│   │   └── README.md
│   ├── paddle/
│   │   └── PaddleNLP.md
│   └── tianshuzhixin/
│       ├── README.md
│       └── ixsmi.md
├── llm-maas/
│   ├── OpenAI-ChatGPT.md
│   └── README.md
├── llm-optimizer/
│   ├── FlashAttention.md
│   ├── README.md
│   ├── SplitFuse.md
│   ├── kv-cache.md
│   ├── xformers.md
│   └── 计算通信重叠.md
├── llm-pipeline/
│   └── REAEMD.md
├── llm-tools/
│   ├── Pytorch-Profiler.md
│   ├── README.md
│   ├── base-profiler.py
│   ├── nsight/
│   │   └── README.md
│   ├── nsight.md
│   ├── nvtx.md
│   ├── profiler-recipe.py
│   ├── tensorboard-profiler.py
│   └── 可视化.md
├── llm-train/
│   ├── README.md
│   ├── alpa/
│   │   └── train/
│   │       ├── pipeshard_parallelism.ipynb
│   │       └── pipeshard_parallelism.py
│   ├── alpaca/
│   │   ├── README.md
│   │   ├── ds_config.json
│   │   ├── ds_config_zero2.json
│   │   ├── ds_config_zero2_ddp.json
│   │   ├── inference.py
│   │   ├── train.py
│   │   └── train_ddp.py
│   ├── alpaca-lora/
│   │   ├── README.md
│   │   ├── export_hf_checkpoint.py
│   │   ├── export_state_dict_checkpoint.py
│   │   ├── finetune.py
│   │   ├── finetune_metrics_epoch.py
│   │   ├── generate.py
│   │   └── inference.py
│   ├── chatglm/
│   │   ├── README.md
│   │   ├── deepspeed.json
│   │   ├── ds_train_finetune.sh
│   │   ├── evaluate.sh
│   │   ├── evaluate_finetune.sh
│   │   ├── inference.py
│   │   ├── main.py
│   │   ├── train.sh
│   │   └── train_ptuningv2_dp.sh
│   ├── chatglm-lora/
│   │   ├── README.md
│   │   ├── finetune.py
│   │   ├── finetune_ddp.py
│   │   └── inference.py
│   ├── chinese-llama-alpaca/
│   │   ├── README.md
│   │   ├── inference_hf.py
│   │   ├── merge_llama_with_chinese_lora.py
│   │   ├── merge_tokenizers.py
│   │   ├── run_clm_pt_with_peft.py
│   │   ├── run_clm_sft_with_peft.py
│   │   ├── run_pt.sh
│   │   └── run_sft.sh
│   ├── deepspeedchat/
│   │   ├── README.md
│   │   ├── llama/
│   │   │   └── README.md
│   │   └── training/
│   │       ├── step1_supervised_finetuning/
│   │       │   └── training_scripts/
│   │       │       └── single_node/
│   │       │           └── run_13b.sh
│   │       ├── step2_reward_model_finetuning/
│   │       │   └── training_scripts/
│   │       │       └── single_node/
│   │       │           └── run_350m.sh
│   │       ├── step3_rlhf_finetuning/
│   │       │   └── training_scripts/
│   │       │       └── single_node/
│   │       │           └── run_13b.sh
│   │       └── utils/
│   │           └── data/
│   │               └── raw_datasets.py
│   ├── firefly/
│   │   ├── README.md
│   │   ├── bootstrap-s3.sh
│   │   ├── bootstrap.sh
│   │   ├── dockerfile.md
│   │   └── test_bash_getopts.sh
│   ├── fp8.md
│   ├── galore/
│   │   └── torchrun_main.py
│   ├── megatron/
│   │   ├── README.md
│   │   ├── codegeex/
│   │   │   └── README.md
│   │   ├── gpt2/
│   │   │   ├── README.md
│   │   │   ├── data/
│   │   │   │   ├── cMinhash.cpp
│   │   │   │   ├── download.py
│   │   │   │   ├── file_utils.py
│   │   │   │   └── merge_data.py
│   │   │   ├── gpt-data-preprocess.md
│   │   │   ├── merge_ck_and_inference/
│   │   │   │   ├── README.md
│   │   │   │   ├── checkpoint_loader_megatron.py
│   │   │   │   ├── checkpoint_saver_megatron.py
│   │   │   │   ├── checkpoint_util.py
│   │   │   │   ├── eval_gpt2_lambada.sh
│   │   │   │   ├── run_text_generation_server.py
│   │   │   │   ├── run_text_generation_server_345M.sh
│   │   │   │   ├── run_text_generation_server_345M_2tp_2dp.sh
│   │   │   │   ├── run_text_generation_server_345M_4_tensor_parallel.sh
│   │   │   │   └── text_generation_cli.py
│   │   │   ├── model_merge_eval_inference.md
│   │   │   ├── model_train.md
│   │   │   ├── requirements.txt
│   │   │   └── train/
│   │   │       ├── pretrain_gpt.sh
│   │   │       ├── pretrain_gpt_distributed.sh
│   │   │       ├── pretrain_gpt_distributed_with_4pp.sh
│   │   │       ├── pretrain_gpt_distributed_with_4tp.sh
│   │   │       └── pretrain_gpt_distributed_with_mp.sh
│   │   ├── megatron.drawio
│   │   ├── pretrain.xmind
│   │   ├── project.md
│   │   └── source-code.md
│   ├── megatron-deepspeed/
│   │   ├── README.md
│   │   ├── bigscience/
│   │   │   └── bloom-note.md
│   │   ├── bloom-megatron-deepspeed.md
│   │   ├── microsoft/
│   │   │   ├── H800多机多卡训练坑点.md
│   │   │   ├── README.md
│   │   │   ├── llama-note.md
│   │   │   ├── pip.conf
│   │   │   ├── pretrain_llama2_13b_distributed_fp16.sh
│   │   │   ├── pretrain_llama2_distributed.sh
│   │   │   ├── pretrain_llama_13b_distributed_fp16.sh
│   │   │   ├── pretrain_llama_7b_distributed_fp16.sh
│   │   │   ├── pretrain_llama_distributed_fp16.sh
│   │   │   ├── slurm/
│   │   │   │   ├── README.md
│   │   │   │   ├── llama-multinode-ib.sh
│   │   │   │   ├── megatron-deepspeed-multinode-ib-part2-30b-fp16.slurm
│   │   │   │   └── megatron-deepspeed-multinode-ib-part2-65b-fp16.slurm
│   │   │   ├── 代码.md
│   │   │   ├── 环境准备.md
│   │   │   ├── 训练日志分析.md
│   │   │   └── 项目结构-202312228.md
│   │   └── source-code.md
│   ├── paddle/
│   │   ├── README.md
│   │   └── paddlenlp/
│   │       ├── README.md
│   │       ├── baichuan2/
│   │       │   └── README.md
│   │       └── bloom/
│   │           ├── README.md
│   │           └── sft_argument.json
│   ├── peft/
│   │   ├── LoRA-QLoRA.md
│   │   ├── PEFT-API.md
│   │   ├── Prefix-Tuning.md
│   │   ├── Prompt-Tuning.md
│   │   ├── README.md
│   │   ├── clm/
│   │   │   ├── accelerate_ds_zero3_cpu_offload_config.yaml
│   │   │   ├── peft_ia3_clm.ipynb
│   │   │   ├── peft_lora_clm.ipynb
│   │   │   ├── peft_lora_clm_accelerate_ds_zero3_offload.py
│   │   │   ├── peft_p_tuning_clm.ipynb
│   │   │   ├── peft_p_tuning_lstm_clm.ipynb
│   │   │   ├── peft_p_tuning_v2_clm.ipynb
│   │   │   ├── peft_prefix_tuning_clm.ipynb
│   │   │   └── peft_prompt_tuning_clm.ipynb
│   │   ├── conditional_generation/
│   │   │   └── README.md
│   │   └── multimodal/
│   │       ├── blip2_lora_inference.py
│   │       ├── blip2_lora_int8_fine_tune.py
│   │       └── finetune_bloom_bnb_peft.ipynb
│   ├── pytorch/
│   │   ├── Pytorch源码解读.md
│   │   ├── README.md
│   │   ├── api.md
│   │   ├── distribution/
│   │   │   ├── README.md
│   │   │   ├── api.md
│   │   │   ├── data-parallel/
│   │   │   │   ├── README.md
│   │   │   │   ├── ddp_launch.py
│   │   │   │   ├── ddp_main.py
│   │   │   │   ├── elastic_ddp.py
│   │   │   │   ├── minGPT-ddp/
│   │   │   │   │   ├── README.md
│   │   │   │   │   ├── multinode.sh
│   │   │   │   │   ├── sbatch_run.sh
│   │   │   │   │   ├── sbatch_run_sig.sh
│   │   │   │   │   └── sbatch_run_sig_opt.sh
│   │   │   │   ├── sbatch_run.sh
│   │   │   │   └── 使用DDP训练真实世界的模型.md
│   │   │   ├── pipeline-parallel/
│   │   │   │   ├── 1-流水线.md
│   │   │   │   ├── 2-使用torchtext训练transformer模型.md
│   │   │   │   ├── 3-使用流水线并行训练Transformer模型.md
│   │   │   │   ├── 4-使用DDP与流水线并行训练Transformer模型.md
│   │   │   │   ├── README.md
│   │   │   │   ├── ddp_pipeline.py
│   │   │   │   ├── pipeline_tutorial.ipynb
│   │   │   │   └── transformer_tutorial.ipynb
│   │   │   ├── rpc/
│   │   │   │   └── README.md
│   │   │   ├── sequence-parallelism/
│   │   │   │   └── README.md
│   │   │   ├── tensor-parallel/
│   │   │   │   ├── 2d_parallel_example.py
│   │   │   │   ├── README.md
│   │   │   │   ├── sequence_parallel_example.py
│   │   │   │   ├── tensor_parallel_example.py
│   │   │   │   └── utils.py
│   │   │   ├── torchrun.md
│   │   │   ├── 分布式通信包.md
│   │   │   ├── 多机多卡.md
│   │   │   └── 多机训练.md
│   │   ├── resource.md
│   │   └── torchrun.md
│   ├── qlora/
│   │   ├── README.md
│   │   ├── accuracy.py
│   │   ├── export_hf_checkpoint.py
│   │   ├── inference.py
│   │   ├── inference_merge.py
│   │   ├── inference_qlora.py
│   │   └── qlora.py
│   ├── slurm/
│   │   ├── README.md
│   │   ├── deepspeed/
│   │   │   ├── pp-multinode-machine.slurm
│   │   │   ├── pp-multinode-singularity.slurm
│   │   │   ├── pp-mutinode-singularity-pmix.slurm
│   │   │   ├── pp-standalone-singularity-v2.slurm
│   │   │   └── pp-standalone-singularity.slurm
│   │   ├── megatron-deepspeed/
│   │   │   └── megatron-deepspeed-multinode-ib-part2-65b-fp16.slurm
│   │   └── pytorch/
│   │       ├── alpaca-docker.slurm
│   │       ├── alpaca-machine.slurm
│   │       ├── alpaca-singularity.slurm
│   │       ├── mingpt-singularity-multinode-2.slurm
│   │       └── mingpt-singularity-multinode.slurm
│   └── vicuna/
│       └── README.md
├── llmops/
│   ├── FAQ.md
│   ├── README.md
│   ├── kubernetes.md
│   ├── tq-llm/
│   │   └── train/
│   │       ├── FAQ.md
│   │       ├── README.md
│   │       ├── bootstrap-llm-zero3-offload.sh
│   │       ├── bootstrap-llm.sh
│   │       ├── bootstrap-llm2.sh
│   │       ├── zero2-offload.json
│   │       └── zero3-offload.json
│   ├── 使用docker进行多机多卡训练.md
│   ├── 千帆大模型平台.md
│   └── 模型推理平台方案.md
├── mkdir-dir-file.sh
├── paper/
│   ├── A Survey on Efficient Training of Transformers.md
│   ├── LESS-选择有影响力的数据进行目标指令精调.md
│   ├── LLM增强LLMS.md
│   ├── PagedAttention.md
│   ├── README.md
│   ├── data/
│   │   ├── LESS 实践：仅用少量的数据完成目标指令微调.md
│   │   ├── LESS-选择有影响力的数据进行目标指令精调.md
│   │   └── LESS.md
│   ├── inference/
│   │   ├── llm-in-a-flash.md
│   │   ├── orca.md
│   │   └── 迈向高效的生成式大语言模型服务综述-从算法到系统.md
│   ├── llm对齐综述.md
│   ├── moe/
│   │   └── README.md
│   ├── parameter-pruning/
│   │   ├── LLM-Pruner.md
│   │   ├── SparseGPT.md
│   │   ├── Wanda.md
│   │   └── 公式.md
│   └── training/
│       ├── A Survey on Efficient Training of Transformers.md
│       ├── GaLore.md
│       └── Reducing Activation Recomputation in Large Transformer Models.md
└── template/
    └── server.md

Download .txt

SYMBOL INDEX (1065 symbols across 89 files)

FILE: ai-framework/deepspeed/hello_bert/train_bert.py
  function collate_function (line 36) | def collate_function(batch: List[Tuple[List[int], List[int]]],
  function masking_function (line 61) | def masking_function(
  class WikiTextMLMDataset (line 152) | class WikiTextMLMDataset(Dataset):
    method __init__ (line 168) | def __init__(
    method __len__ (line 176) | def __len__(self) -> int:
    method __getitem__ (line 179) | def __getitem__(self, idx: int) -> Tuple[List[int], List[int]]:
  class InfiniteIterator (line 188) | class InfiniteIterator(object):
    method __init__ (line 189) | def __init__(self, iterable: Iterable[T]) -> None:
    method __iter__ (line 194) | def __iter__(self):
    method __next__ (line 199) | def __next__(self) -> T:
  function create_data_iterator (line 209) | def create_data_iterator(
  class RobertaLMHeadWithMaskedPredict (line 278) | class RobertaLMHeadWithMaskedPredict(RobertaLMHead):
    method __init__ (line 279) | def __init__(self,
    method forward (line 286) | def forward(  # pylint: disable=arguments-differ
  class RobertaMLMModel (line 315) | class RobertaMLMModel(RobertaPreTrainedModel):
    method __init__ (line 316) | def __init__(self, config: RobertaConfig, encoder: RobertaModel) -> None:
    method forward (line 323) | def forward(
  function create_model (line 368) | def create_model(num_layers: int, num_heads: int, ff_dim: int, h_dim: int,
  function get_unique_identifier (line 427) | def get_unique_identifier(length: int = 8) -> str:
  function create_experiment_dir (line 437) | def create_experiment_dir(checkpoint_dir: pathlib.Path,
  function load_model_checkpoint (line 509) | def load_model_checkpoint(
  function train (line 566) | def train(

FILE: ai-framework/deepspeed/hello_bert/train_bert_ds.py
  function collate_function (line 36) | def collate_function(batch: List[Tuple[List[int], List[int]]],
  function masking_function (line 61) | def masking_function(
  class WikiTextMLMDataset (line 152) | class WikiTextMLMDataset(Dataset):
    method __init__ (line 168) | def __init__(
    method __len__ (line 176) | def __len__(self) -> int:
    method __getitem__ (line 179) | def __getitem__(self, idx: int) -> Tuple[List[int], List[int]]:
    method __init__ (line 992) | def __init__(
    method __len__ (line 1000) | def __len__(self) -> int:
    method __getitem__ (line 1003) | def __getitem__(self, idx: int) -> Tuple[List[int], List[int]]:
  class InfiniteIterator (line 187) | class InfiniteIterator(object):
    method __init__ (line 188) | def __init__(self, iterable: Iterable[T]) -> None:
    method __iter__ (line 192) | def __iter__(self):
    method __next__ (line 195) | def __next__(self) -> T:
    method __init__ (line 1012) | def __init__(self, iterable: Iterable[T]) -> None:
    method __iter__ (line 1016) | def __iter__(self):
    method __next__ (line 1019) | def __next__(self) -> T:
  function create_data_iterator (line 205) | def create_data_iterator(
  class RobertaLMHeadWithMaskedPredict (line 269) | class RobertaLMHeadWithMaskedPredict(RobertaLMHead):
    method __init__ (line 270) | def __init__(self,
    method forward (line 277) | def forward(  # pylint: disable=arguments-differ
    method __init__ (line 1094) | def __init__(self,
    method forward (line 1101) | def forward(  # pylint: disable=arguments-differ
  class RobertaMLMModel (line 306) | class RobertaMLMModel(RobertaPreTrainedModel):
    method __init__ (line 307) | def __init__(self, config: RobertaConfig, encoder: RobertaModel) -> None:
    method forward (line 314) | def forward(
    method __init__ (line 1131) | def __init__(self, config: RobertaConfig, encoder: RobertaModel) -> None:
    method forward (line 1138) | def forward(
  function create_model (line 359) | def create_model(num_layers: int, num_heads: int, ff_dim: int, h_dim: int,
  function get_unique_identifier (line 416) | def get_unique_identifier(length: int = 8) -> str:
  function create_experiment_dir (line 426) | def create_experiment_dir(checkpoint_dir: pathlib.Path,
  function load_model_checkpoint (line 498) | def load_model_checkpoint(
  function train (line 555) | def train(
  function is_rank_0 (line 828) | def is_rank_0() -> bool:
  function log_dist (line 839) | def log_dist(message: str,
  function collate_function (line 860) | def collate_function(batch: List[Tuple[List[int], List[int]]],
  function masking_function (line 885) | def masking_function(
  class WikiTextMLMDataset (line 976) | class WikiTextMLMDataset(Dataset):
    method __init__ (line 168) | def __init__(
    method __len__ (line 176) | def __len__(self) -> int:
    method __getitem__ (line 179) | def __getitem__(self, idx: int) -> Tuple[List[int], List[int]]:
    method __init__ (line 992) | def __init__(
    method __len__ (line 1000) | def __len__(self) -> int:
    method __getitem__ (line 1003) | def __getitem__(self, idx: int) -> Tuple[List[int], List[int]]:
  class InfiniteIterator (line 1011) | class InfiniteIterator(object):
    method __init__ (line 188) | def __init__(self, iterable: Iterable[T]) -> None:
    method __iter__ (line 192) | def __iter__(self):
    method __next__ (line 195) | def __next__(self) -> T:
    method __init__ (line 1012) | def __init__(self, iterable: Iterable[T]) -> None:
    method __iter__ (line 1016) | def __iter__(self):
    method __next__ (line 1019) | def __next__(self) -> T:
  function create_data_iterator (line 1029) | def create_data_iterator(
  class RobertaLMHeadWithMaskedPredict (line 1093) | class RobertaLMHeadWithMaskedPredict(RobertaLMHead):
    method __init__ (line 270) | def __init__(self,
    method forward (line 277) | def forward(  # pylint: disable=arguments-differ
    method __init__ (line 1094) | def __init__(self,
    method forward (line 1101) | def forward(  # pylint: disable=arguments-differ
  class RobertaMLMModel (line 1130) | class RobertaMLMModel(RobertaPreTrainedModel):
    method __init__ (line 307) | def __init__(self, config: RobertaConfig, encoder: RobertaModel) -> None:
    method forward (line 314) | def forward(
    method __init__ (line 1131) | def __init__(self, config: RobertaConfig, encoder: RobertaModel) -> None:
    method forward (line 1138) | def forward(
  function create_model (line 1183) | def create_model(num_layers: int, num_heads: int, ff_dim: int, h_dim: int,
  function get_unique_identifier (line 1240) | def get_unique_identifier(length: int = 8) -> str:
  function create_experiment_dir (line 1250) | def create_experiment_dir(checkpoint_dir: pathlib.Path,
  function load_model_checkpoint (line 1330) | def load_model_checkpoint(
  function train (line 1390) | def train(

FILE: ai-framework/mxnet/mnist.py
  function transformer (line 43) | def transformer(data, label):
  function test (line 57) | def test(ctx):
  function train (line 68) | def train(epochs, ctx):

FILE: ai-framework/mxnet/mxnet_cnn_mnist.py
  function get_mnist (line 30) | def get_mnist(path='data'):
  class NeuralNetwork (line 63) | class NeuralNetwork(gluon.Block):
    method __init__ (line 64) | def __init__(self):
    method forward (line 80) | def forward(self, x):
  function test (line 93) | def test(ctx, val_data):
  function train (line 122) | def train(args, ctx, train_data, val_data):
  function plot (line 181) | def plot(train_acc_list, val_acc_list, output_path):
  function main (line 202) | def main(ctx):

FILE: ai-framework/mxnet/mxnet_mlp_mnist.py
  function transformer (line 41) | def transformer(data, label):
  function test (line 55) | def test(ctx):
  function train (line 66) | def train(epochs, ctx):

FILE: ai-framework/mxnet/oneflow_cnn_mnist.py
  class NeuralNetwork (line 40) | class NeuralNetwork(nn.Module):
    method __init__ (line 41) | def __init__(self):
    method forward (line 56) | def forward(self, x):
  function train (line 70) | def train(epoch, iter, model, loss_fn, optimizer):
  function test (line 101) | def test(iter, model, loss_fn):
  function plot (line 122) | def plot(loss_list, output_path):
  function main (line 138) | def main():

FILE: ai-framework/mxnet/oneflow_mlp_mnist.py
  class NeuralNetwork (line 51) | class NeuralNetwork(nn.Module):
    method __init__ (line 52) | def __init__(self):
    method forward (line 63) | def forward(self, x):
  function train (line 76) | def train(iter, model, loss_fn, optimizer):
  function test (line 97) | def test(iter, model, loss_fn):

FILE: ai-framework/oneflow/oneflow_mlp_mnist.py
  class NeuralNetwork (line 50) | class NeuralNetwork(nn.Module):
    method __init__ (line 51) | def __init__(self):
    method forward (line 62) | def forward(self, x):
  function train (line 75) | def train(iter, model, loss_fn, optimizer):
  function test (line 96) | def test(iter, model, loss_fn):

FILE: ai-framework/transformer-engine/mnist/main.py
  class Net (line 13) | class Net(nn.Module):
    method __init__ (line 14) | def __init__(self, use_te=False):
    method forward (line 28) | def forward(self, x):
  function train (line 46) | def train(args, model, device, train_loader, optimizer, epoch, use_fp8):
  function calibrate (line 68) | def calibrate(model, device, test_loader):
  function test (line 79) | def test(model, device, test_loader, use_fp8):
  function main (line 106) | def main():

FILE: ai-framework/transformer-engine/mnist/main_stat.py
  class Net (line 13) | class Net(nn.Module):
    method __init__ (line 14) | def __init__(self, use_te=False):
    method forward (line 28) | def forward(self, x):
  function train (line 47) | def train(args, model, device, train_loader, optimizer, epoch, use_fp8):
  function calibrate (line 77) | def calibrate(model, device, test_loader):
  function test (line 89) | def test(model, device, test_loader, use_fp8):
  function main (line 123) | def main():

FILE: docs/llm-base/distribution-parallelism/moe-parallel/paddle_moe.py
  class ExpertLayer (line 17) | class ExpertLayer(Layer):
    method __init__ (line 18) | def __init__(self, d_model, d_hidden, name=None):
    method forward (line 23) | def forward(self, x):
  class Model (line 47) | class Model(Layer):
    method __init__ (line 48) | def __init__(self, d_model, d_hidden, name=None):
    method forward (line 60) | def forward(self, x):

FILE: llm-algo/gpt2/hf_modeling_gpt2.py
  class GPT2Attention (line 5) | class GPT2Attention(nn.Module):
    method __init__ (line 7) | def __init__(self, config, is_cross_attention=False, layer_idx=None):
    method _attn (line 55) | def _attn(self, query, key, value, attention_mask=None, head_mask=None):
    method forward (line 104) | def forward(
  class GPT2MLP (line 163) | class GPT2MLP(nn.Module):
    method __init__ (line 164) | def __init__(self, intermediate_size, config):
    method forward (line 172) | def forward(self, hidden_states: Optional[Tuple[torch.FloatTensor]]) -...

FILE: llm-compression/quantization/llm-qat/cfd70ff/train.py
  function train (line 41) | def train():

FILE: llm-compression/quantization/llm-qat/cfd70ff/utils.py
  function get_logger (line 17) | def get_logger(logger_name):
  function safe_save_model_for_hf_trainer (line 39) | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output...
  function get_local_rank (line 69) | def get_local_rank():

FILE: llm-compression/quantization/llm-qat/f4d873a/datautils.py
  function set_seed (line 26) | def set_seed(seed):
  function get_train_val_dataset (line 31) | def get_train_val_dataset(train_path, valid_path=None):
  class CustomJsonDataset (line 57) | class CustomJsonDataset(torch.utils.data.IterableDataset):
    method __init__ (line 58) | def __init__(self, dataset, tokenizer, block_size=1024):
    method __len__ (line 74) | def __len__(self):
    method __getitem__ (line 77) | def __getitem__(self, i):
    method __iter__ (line 80) | def __iter__(self):
    method tokenize_function (line 83) | def tokenize_function(self, examples):
    method group_texts (line 86) | def group_texts(self, examples):
  function jload (line 117) | def jload(filename, mode="r"):

FILE: llm-compression/quantization/llm-qat/f4d873a/train.py
  function train (line 42) | def train():

FILE: llm-data-engineering/sft-dataset/baichuan2_test.py
  function build_chat_input (line 6) | def build_chat_input(messages: List[dict]):

FILE: llm-data-engineering/sft-dataset/firefly-template.py
  class Template (line 6) | class Template:
  function register_template (line 19) | def register_template(template_name, system_format, user_format, assista...

FILE: llm-data-engineering/sft-dataset/jinja-demo.py
  function raise_exception (line 23) | def raise_exception(message):

FILE: llm-data-engineering/sft-dataset/jinja-llm-baichuan.py
  function raise_exception (line 23) | def raise_exception(message):

FILE: llm-data-engineering/sft-dataset/jinja-llm-baichuan2.py
  function raise_exception (line 48) | def raise_exception(message):

FILE: llm-data-engineering/sft-dataset/jinja-llm-bloom.py
  function raise_exception (line 26) | def raise_exception(message):

FILE: llm-data-engineering/sft-dataset/jinja-llm-chatglm3.py
  function raise_exception (line 26) | def raise_exception(message):

FILE: llm-data-engineering/sft-dataset/jinja-llm.py
  function raise_exception (line 29) | def raise_exception(message):

FILE: llm-eval/llm-performance/mindie/locust-lantency-throughput/hello.py
  class QuickstartUser (line 8) | class QuickstartUser(HttpUser):
    method hello_world (line 12) | def hello_world(self):

FILE: llm-eval/llm-performance/mindie/locust-lantency-throughput/llm-910b4-baichuan2-7b-2tp.py
  class QuickstartUser (line 16) | class QuickstartUser(HttpUser):
    method hello_world (line 20) | def hello_world(self):
    method on_stop (line 48) | def on_stop(self):

FILE: llm-eval/llm-performance/mindie/locust-lantency-throughput/llm-910b4-chatglm3-6b-2tp.py
  class QuickstartUser (line 16) | class QuickstartUser(HttpUser):
    method hello_world (line 20) | def hello_world(self):
    method on_stop (line 48) | def on_stop(self):

FILE: llm-eval/llm-performance/mindie/locust-lantency-throughput/llm-910b4-qwen-72b-8tp.py
  class QuickstartUser (line 15) | class QuickstartUser(HttpUser):
    method hello_world (line 19) | def hello_world(self):
    method on_stop (line 47) | def on_stop(self):

FILE: llm-eval/llm-performance/mindie/locust-lantency-throughput/llm-910b4-qwen1.5-4tp.py
  class QuickstartUser (line 14) | class QuickstartUser(HttpUser):
    method hello_world (line 18) | def hello_world(self):
    method on_stop (line 45) | def on_stop(self):

FILE: llm-eval/llm-performance/mindie/locust-lantency-throughput/示例.py
  class QuickstartUser (line 8) | class QuickstartUser(HttpUser):
    method hello_world (line 15) | def hello_world(self):
    method view_items (line 25) | def view_items(self):
    method on_start (line 34) | def on_start(self):
  class WebsiteUser (line 41) | class WebsiteUser(HttpLocust):

FILE: llm-eval/llm-performance/vllm/vllm-locust-qwen1.5-7b-long.py
  class QuickstartUser (line 17) | class QuickstartUser(HttpUser):
    method hello_world (line 21) | def hello_world(self):
    method on_stop (line 61) | def on_stop(self):

FILE: llm-inference/ascend/mindformers/mindsporelite-inference.py
  function pipeline_from_model_paths (line 26) | def pipeline_from_model_paths(args_, tokenizer):
  function pipeline_from_model_name (line 43) | def pipeline_from_model_name(args_, tokenizer):
  function pipeline_from_model_dir (line 59) | def pipeline_from_model_dir(args_, tokenizer):
  function pipeline_from_infer_config (line 76) | def pipeline_from_infer_config(args_, tokenizer):
  function get_tokenizer (line 110) | def get_tokenizer(model_name: str, tokenizer_path: str) -> Tokenizer:
  function build_prompt (line 126) | def build_prompt(inputs, model_name, prompt):
  function infer_main (line 146) | def infer_main(args_):
  function infer_stream_main (line 177) | def infer_stream_main(args_):

FILE: llm-inference/ascend/mindformers/mindsporelite-stat.py
  function pipeline_from_model_paths (line 33) | def pipeline_from_model_paths(args_, tokenizer):
  function pipeline_from_model_name (line 50) | def pipeline_from_model_name(args_, tokenizer):
  function pipeline_from_model_dir (line 66) | def pipeline_from_model_dir(args_, tokenizer):
  function pipeline_from_infer_config (line 83) | def pipeline_from_infer_config(args_, tokenizer):
  function get_tokenizer (line 117) | def get_tokenizer(model_name: str, tokenizer_path: str) -> Tokenizer:
  function build_prompt (line 133) | def build_prompt(inputs, model_name, prompt):
  function inference_stat (line 154) | def inference_stat(first_token_time_list, total_token_time_list, new_tok...
  function infer_main (line 205) | def infer_main(args_):
  function infer_stream_main (line 280) | def infer_stream_main(args_):

FILE: llm-inference/ascend/mindformers/text_generator_infer.py
  class BaseInputsOfInfer (line 23) | class BaseInputsOfInfer:
    method get_inputs (line 28) | def get_inputs(self, model: Model, **kwargs):
    method get_lite_tensor_list (line 31) | def get_lite_tensor_list(self, inputs, model):
  class CommonInputsOfInfer (line 43) | class CommonInputsOfInfer(BaseInputsOfInfer):
    method get_inputs (line 48) | def get_inputs(self, model: Model, input_ids=None, current_index=None,...
  class LlamaInputsOfInfer (line 62) | class LlamaInputsOfInfer(BaseInputsOfInfer):
    method get_inputs (line 67) | def get_inputs(self, model: Model, input_ids=None, current_index=None,...
    method get_lite_tensor_list (line 81) | def get_lite_tensor_list(self, inputs):
  class GLMInputsOfInfer (line 88) | class GLMInputsOfInfer(BaseInputsOfInfer):
    method get_masks_np (line 92) | def get_masks_np(self, input_ids, tokenizer: BaseTokenizer):
    method get_position_ids_np (line 102) | def get_position_ids_np(self, input_ids, mask_positions, tokenizer: Ba...
    method create_position_ids_np (line 126) | def create_position_ids_np(self, input_ids, tokenizer, position_encodi...
    method get_inputs (line 142) | def get_inputs(self, model: Model, input_ids=None, current_index=None,...
  class InputOfInfer (line 169) | class InputOfInfer:
    method get_inputs (line 187) | def get_inputs(cls, model_name: str, model, **kwargs):
  class TextGeneratorInfer (line 212) | class TextGeneratorInfer(BaseInfer):
    method infer (line 217) | def infer(self,
    method preprocess (line 281) | def preprocess(self, input_data, add_special_tokens=False, **kwargs):
    method postprocess (line 296) | def postprocess(self, predict_data, **kwargs):
    method _get_logits_processor (line 301) | def _get_logits_processor(self,
    method _merge_processor_list (line 319) | def _merge_processor_list(self,
    method _get_logits_warper (line 340) | def _get_logits_warper(self, generation_config: GenerationConfig):
    method generate (line 362) | def generate(self, input_ids, do_sample, top_k, top_p, temperature, re...
    method _inc_infer (line 514) | def _inc_infer(self, input_ids, current_index, valid_length, is_first_...
    method _full_infer (line 528) | def _full_infer(self, input_ids, current_index, is_npu_acceleration, *...
    method get_predict_inputs (line 539) | def get_predict_inputs(self, mode: Model, input_ids, current_index=None,

FILE: llm-inference/faster-transformer/bloom/firefly_lambada_1w_stat_token.py
  class TensorEncoder (line 23) | class TensorEncoder(json.JSONEncoder):
    method default (line 24) | def default(self, obj):
  class LambadaDataset (line 30) | class LambadaDataset(torch.utils.data.Dataset):
    method __init__ (line 33) | def __init__(self,
    method __len__ (line 50) | def __len__(self):
    method __getitem__ (line 53) | def __getitem__(self, idx):
  class Metric (line 62) | class Metric:
  class RequestAndResult (line 67) | class RequestAndResult:
    method asdict (line 79) | def asdict(self):
  class Timer (line 83) | class Timer:
    method __init__ (line 85) | def __init__(self):
    method start (line 89) | def start(self, tag='__default'):
    method stop (line 92) | def stop(self, tag='__default'):
    method elapsed_time_in_sec (line 99) | def elapsed_time_in_sec(self, tag='__default'):
    method reset (line 104) | def reset(self):
  function get_args (line 109) | def get_args():
  function get_model_and_tokenizer (line 175) | def get_model_and_tokenizer(args: argparse.Namespace):
  function split_inputs_and_targets (line 264) | def split_inputs_and_targets(entries: Dict[str, torch.LongTensor],
  function main (line 307) | def main():

FILE: llm-inference/faster-transformer/megatron-gpt2/gpt_summarization.py
  function main (line 21) | def main():

FILE: llm-inference/faster-transformer/megatron-gpt2/gpt_summarization_stat.py
  function main (line 24) | def main():

FILE: llm-inference/native-model/chatglm3-6b/cli_demo.py
  function build_prompt (line 18) | def build_prompt(history):
  function main (line 26) | def main():

FILE: llm-inference/triton/resnet50/client.py
  function rn50_preprocess (line 8) | def rn50_preprocess(img_path="img1.jpg"):

FILE: llm-inference/vllm/api_client.py
  function clear_line (line 11) | def clear_line(n: int = 1) -> None:
  function post_http_request (line 18) | def post_http_request(prompt: str,
  function get_streaming_response (line 35) | def get_streaming_response(response: requests.Response) -> Iterable[List...
  function get_response (line 45) | def get_response(response: requests.Response) -> List[str]:

FILE: llm-inference/web/fastapi/llm-qwen-mindspore-lite.py
  class InferParam (line 22) | class InferParam:
  function get_mindir_path (line 44) | def get_mindir_path(export_path='output', full=True):
  function create_mslite_pipeline (line 64) | def create_mslite_pipeline(args):
  function expand_input_list (line 109) | def expand_input_list(input_list, batch_size):
  function run_mslite_infer (line 118) | def run_mslite_infer(pipeline_task, prompt, args):
  class Item (line 162) | class Item(BaseModel):
  function generate_stream (line 173) | async def generate_stream(data:Item):
  function predict (line 192) | async def predict(request: Request):

FILE: llm-inference/web/flask/llm-qwen-mindspore-lite.py
  class InferParam (line 17) | class InferParam:
  function get_mindir_path (line 39) | def get_mindir_path(export_path='output', full=True):
  function create_mslite_pipeline (line 59) | def create_mslite_pipeline(args):
  function expand_input_list (line 104) | def expand_input_list(input_list, batch_size):
  function run_mslite_infer (line 113) | def run_mslite_infer(pipeline_task, prompt, args):
  function generate_stream (line 158) | def generate_stream():
  function predict (line 177) | def predict():

FILE: llm-localization/ascend/mindformers/chatglm/chat_glm.py
  function chat_glm (line 14) | def chat_glm():

FILE: llm-localization/ascend/mindformers/chatglm/merge_ckpt.py
  function merge_ckpt (line 6) | def merge_ckpt(args_opt):
  function clean_ckpt (line 27) | def clean_ckpt(ms_ckpt_path):

FILE: llm-localization/ascend/mindformers/chatglm/merge_ckpt_lora.py
  function merge_ckpt (line 6) | def merge_ckpt(args_opt):
  function clean_ckpt (line 27) | def clean_ckpt(ms_ckpt_path):

FILE: llm-localization/ascend/mindie/script/model-test.py
  class ModelTest (line 130) | class ModelTest:
    method __init__ (line 131) | def __init__(self, model_type, data_type, test_mode, model_name, data_...
    method create_instance (line 163) | def create_instance(cls):
    method run (line 168) | def run(self):
    method get_chip_num (line 175) | def get_chip_num(self):
    method set_fa_tokenizer_params (line 178) | def set_fa_tokenizer_params(self):
    method get_model (line 187) | def get_model(self, hardware_type, model_type, data_type):
    method prepare_environ (line 190) | def prepare_environ(self):
    method get_dataset_list (line 193) | def get_dataset_list(self):
    method clear (line 196) | def clear(self):
    method __prepare_and_check (line 201) | def __prepare_and_check(self):
    method __run (line 287) | def __run(self):
    method __run_performance (line 297) | def __run_performance(self):
    method __run_precision (line 471) | def __run_precision(self):
    method __run_simplified_dataset (line 535) | def __run_simplified_dataset(self):
    method __run_full_dataset_ceval_0_shot (line 587) | def __run_full_dataset_ceval_0_shot(self):
    method __run_full_dataset_ceval_5_shot (line 669) | def __run_full_dataset_ceval_5_shot(self):
    method __run_full_dataset_mmlu (line 789) | def __run_full_dataset_mmlu(self):
    method __run_full_dataset_gsm8k (line 889) | def __run_full_dataset_gsm8k(self):
    method __run_full_dataset_truthfulqa (line 977) | def __run_full_dataset_truthfulqa(self):
    method __run_full_dataset_boolq (line 1075) | def __run_full_dataset_boolq(self):
    method __run_full_dataset_humaneval (line 1153) | def __run_full_dataset_humaneval(self):
    method __compare_results (line 1224) | def __compare_results(self):
    method __compare_simplified_dataset_results (line 1237) | def __compare_simplified_dataset_results(self):
    method __compare_results_helper (line 1264) | def __compare_results_helper(self, type):
    method __compare_full_dataset_results (line 1370) | def __compare_full_dataset_results(self):
    method __get_rank (line 1441) | def __get_rank(self):
    method __get_device_type (line 1447) | def __get_device_type(self):
    method __patch_hf_transformers_utils (line 1456) | def __patch_hf_transformers_utils(self):
    method __setup_model_parallel (line 1481) | def __setup_model_parallel(self):
    method get_fa_tokenizer (line 1496) | def get_fa_tokenizer(self, **kwargs):
    method __npu_adapt (line 1499) | def __npu_adapt(self):
    method __save_result (line 1510) | def __save_result(self, result):
    method __get_log (line 1553) | def __get_log(self, type):
  function parse_args (line 1586) | def parse_args():
  function get_args (line 1626) | def get_args():

FILE: llm-localization/ascend/msmodelslim/llm_quant/baichuan2-w8a8.py
  function init_tokenizer (line 8) | def init_tokenizer(input_model_path:str):
  function init_model (line 16) | def init_model(input_model_path:str):
  function get_calib_dataset (line 24) | def get_calib_dataset(
  function load_dataset (line 37) | def load_dataset(calib_set_path = "./calib_set.json"):
  function parse_arguments (line 41) | def parse_arguments():
  function disable_quant_module (line 49) | def disable_quant_module(input_model_path:str):

FILE: llm-localization/ascend/msmodelslim/llm_quant/qwen1.5-72b-w8a16.py
  function load_tokenizer_and_model (line 9) | def load_tokenizer_and_model(fp16_path):
  function main (line 24) | def main(fp16_path, quant_save_path, calib_set_path):
  function parse_arguments (line 50) | def parse_arguments():

FILE: llm-localization/ascend/peft/finetune-lora.py
  function preprocess_function (line 68) | def preprocess_function(examples):
  function test_preprocess_function (line 117) | def test_preprocess_function(examples):

FILE: llm-localization/ascend/pytorch/llm-lora.py
  function preprocess_function (line 68) | def preprocess_function(examples):
  function test_preprocess_function (line 117) | def test_preprocess_function(examples):

FILE: llm-localization/ascend/standford-alpaca/train.py
  class ModelArguments (line 32) | class ModelArguments:
  class DataArguments (line 37) | class DataArguments:
  class TrainingArguments (line 42) | class TrainingArguments(transformers.TrainingArguments):
  function smart_tokenizer_and_embedding_resize (line 51) | def smart_tokenizer_and_embedding_resize(
  function _tokenize_fn (line 74) | def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrai...
  function preprocess (line 98) | def preprocess(
  class SupervisedDataset (line 113) | class SupervisedDataset(Dataset):
    method __init__ (line 116) | def __init__(self, data_path: str, tokenizer: transformers.PreTrainedT...
    method __len__ (line 135) | def __len__(self):
    method __getitem__ (line 138) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  class DataCollatorForSupervisedDataset (line 143) | class DataCollatorForSupervisedDataset(object):
    method __call__ (line 148) | def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
  function make_supervised_data_module (line 161) | def make_supervised_data_module(tokenizer: transformers.PreTrainedTokeni...
  function train (line 168) | def train():

FILE: llm-localization/ascend/standford-alpaca/utils.py
  function _make_w_io_base (line 15) | def _make_w_io_base(f, mode: str):
  function _make_r_io_base (line 24) | def _make_r_io_base(f, mode: str):
  function jdump (line 30) | def jdump(obj, f, mode="w", indent=4, default=str):
  function jload (line 50) | def jload(f, mode="r"):

FILE: llm-tools/base-profiler.py
  class MyModule (line 10) | class MyModule(nn.Module):
    method __init__ (line 11) | def __init__(self, in_features: int, out_features: int, bias: bool = T...
    method forward (line 15) | def forward(self, input, mask):
    method __init__ (line 152) | def __init__(self, in_features: int, out_features: int, bias: bool = T...
    method forward (line 156) | def forward(self, input, mask):
  class MyModule (line 151) | class MyModule(nn.Module):
    method __init__ (line 11) | def __init__(self, in_features: int, out_features: int, bias: bool = T...
    method forward (line 15) | def forward(self, input, mask):
    method __init__ (line 152) | def __init__(self, in_features: int, out_features: int, bias: bool = T...
    method forward (line 156) | def forward(self, input, mask):

FILE: llm-tools/tensorboard-profiler.py
  function train (line 37) | def train(data):

FILE: llm-train/alpa/train/pipeshard_parallelism.py
  class MLPModel (line 63) | class MLPModel(nn.Module):
    method __call__ (line 67) | def __call__(self, x):
  function train_step (line 102) | def train_step(state, batch):
  class ManualPipelineMLPModel (line 128) | class ManualPipelineMLPModel(nn.Module):
    method __call__ (line 132) | def __call__(self, x):
  function manual_pipeline_train_step (line 161) | def manual_pipeline_train_step(state, batch):
  function auto_pipeline_train_step (line 224) | def auto_pipeline_train_step(state, batch):

FILE: llm-train/alpaca-lora/export_state_dict_checkpoint.py
  function permute (line 64) | def permute(w):
  function unpermute (line 72) | def unpermute(w):
  function translate_state_dict_key (line 80) | def translate_state_dict_key(k):  # noqa: C901

FILE: llm-train/alpaca-lora/finetune.py
  function train (line 28) | def train(

FILE: llm-train/alpaca-lora/finetune_metrics_epoch.py
  function train (line 28) | def train(

FILE: llm-train/alpaca-lora/generate.py
  function main (line 25) | def main(

FILE: llm-train/alpaca/train.py
  class ModelArguments (line 33) | class ModelArguments:
  class DataArguments (line 39) | class DataArguments:
  class TrainingArguments (line 44) | class TrainingArguments(transformers.TrainingArguments):
  function safe_save_model_for_hf_trainer (line 53) | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output...
  function smart_tokenizer_and_embedding_resize (line 62) | def smart_tokenizer_and_embedding_resize(
  function _tokenize_fn (line 85) | def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrai...
  function preprocess (line 109) | def preprocess(
  class SupervisedDataset (line 124) | class SupervisedDataset(Dataset):
    method __init__ (line 127) | def __init__(self, data_path: str, tokenizer: transformers.PreTrainedT...
    method __len__ (line 146) | def __len__(self):
    method __getitem__ (line 149) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  class DataCollatorForSupervisedDataset (line 154) | class DataCollatorForSupervisedDataset(object):
    method __call__ (line 159) | def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
  function make_supervised_data_module (line 172) | def make_supervised_data_module(tokenizer: transformers.PreTrainedTokeni...
  function train (line 179) | def train():

FILE: llm-train/alpaca/train_ddp.py
  class ModelArguments (line 34) | class ModelArguments:
  class DataArguments (line 40) | class DataArguments:
  class TrainingArguments (line 45) | class TrainingArguments(transformers.TrainingArguments):
  function safe_save_model_for_hf_trainer (line 54) | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output...
  function smart_tokenizer_and_embedding_resize (line 63) | def smart_tokenizer_and_embedding_resize(
  function _tokenize_fn (line 86) | def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrai...
  function preprocess (line 110) | def preprocess(
  class SupervisedDataset (line 125) | class SupervisedDataset(Dataset):
    method __init__ (line 128) | def __init__(self, data_path: str, tokenizer: transformers.PreTrainedT...
    method __len__ (line 147) | def __len__(self):
    method __getitem__ (line 150) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  class DataCollatorForSupervisedDataset (line 155) | class DataCollatorForSupervisedDataset(object):
    method __call__ (line 160) | def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
  function make_supervised_data_module (line 173) | def make_supervised_data_module(tokenizer: transformers.PreTrainedTokeni...
  function train (line 180) | def train():

FILE: llm-train/chatglm-lora/finetune.py
  class FinetuneArguments (line 19) | class FinetuneArguments:
  class CastOutputToFloat (line 25) | class CastOutputToFloat(nn.Sequential):
    method forward (line 26) | def forward(self, x):
  function data_collator (line 30) | def data_collator(features: list) -> dict:
  class ModifiedTrainer (line 53) | class ModifiedTrainer(Trainer):
    method compute_loss (line 54) | def compute_loss(self, model, inputs, return_outputs=False):
    method save_model (line 60) | def save_model(self, output_dir=None, _internal_call=False):
  function main (line 71) | def main():

FILE: llm-train/chatglm-lora/finetune_ddp.py
  class FinetuneArguments (line 30) | class FinetuneArguments:
  class CastOutputToFloat (line 36) | class CastOutputToFloat(nn.Sequential):
    method forward (line 37) | def forward(self, x):
  function data_collator (line 41) | def data_collator(features: list) -> dict:
  class ModifiedTrainer (line 67) | class ModifiedTrainer(Trainer):
    method compute_loss (line 68) | def compute_loss(self, model, inputs, return_outputs=False):
    method save_model (line 74) | def save_model(self, output_dir=None, _internal_call=False):
  function main (line 85) | def main():

FILE: llm-train/chatglm/main.py
  function main (line 38) | def main():
  function _mp_fn (line 413) | def _mp_fn(index):

FILE: llm-train/chinese-llama-alpaca/inference_hf.py
  function generate_prompt (line 44) | def generate_prompt(instruction, input=None):

FILE: llm-train/chinese-llama-alpaca/merge_llama_with_chinese_lora.py
  function transpose (line 62) | def transpose(weight, fan_in_fan_out):
  function translate_state_dict_key (line 67) | def translate_state_dict_key(k):
  function unpermute (line 105) | def unpermute(w):
  function save_shards (line 111) | def save_shards(model_sd, num_shards: int):

FILE: llm-train/chinese-llama-alpaca/run_clm_pt_with_peft.py
  function accuracy (line 61) | def accuracy(predictions, references, normalize=True, sample_weight=None):
  function compute_metrics (line 67) | def compute_metrics(eval_preds):
  function preprocess_logits_for_metrics (line 75) | def preprocess_logits_for_metrics(logits, labels):
  function fault_tolerance_data_collator (line 83) | def fault_tolerance_data_collator(features: List) -> Dict[str, Any]:
  class GroupTextsBuilder (line 129) | class GroupTextsBuilder:
    method __init__ (line 130) | def __init__(self,max_seq_length):
    method __call__ (line 132) | def __call__(self, examples):
  class ModelArguments (line 153) | class ModelArguments:
    method __post_init__ (line 225) | def __post_init__(self):
  class DataTrainingArguments (line 233) | class DataTrainingArguments:
    method __post_init__ (line 296) | def __post_init__(self):
  class MyTrainingArguments (line 301) | class MyTrainingArguments(TrainingArguments):
  function main (line 312) | def main():

FILE: llm-train/chinese-llama-alpaca/run_clm_sft_with_peft.py
  class ModelArguments (line 73) | class ModelArguments:
    method __post_init__ (line 142) | def __post_init__(self):
  class DataTrainingArguments (line 150) | class DataTrainingArguments:
  class MyTrainingArguments (line 186) | class MyTrainingArguments(TrainingArguments):
  function main (line 197) | def main():
  function smart_tokenizer_and_embedding_resize (line 433) | def smart_tokenizer_and_embedding_resize(

FILE: llm-train/deepspeedchat/training/utils/data/raw_datasets.py
  class PromptRawDataset (line 12) | class PromptRawDataset(object):
    method __init__ (line 14) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 22) | def get_train_data(self):
    method get_eval_data (line 25) | def get_eval_data(self):
    method get_prompt (line 29) | def get_prompt(self, sample):
    method get_chosen (line 33) | def get_chosen(self, sample):
    method get_rejected (line 38) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 41) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 44) | def get_prompt_and_rejected(self, sample):
  class DahoasRmstaticDataset (line 49) | class DahoasRmstaticDataset(PromptRawDataset):
    method __init__ (line 51) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 58) | def get_train_data(self):
    method get_eval_data (line 61) | def get_eval_data(self):
    method get_prompt (line 64) | def get_prompt(self, sample):
    method get_chosen (line 67) | def get_chosen(self, sample):
    method get_rejected (line 70) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 73) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 76) | def get_prompt_and_rejected(self, sample):
  class DahoasFullhhrlhfDataset (line 81) | class DahoasFullhhrlhfDataset(PromptRawDataset):
    method __init__ (line 83) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 91) | def get_train_data(self):
    method get_eval_data (line 94) | def get_eval_data(self):
    method get_prompt (line 97) | def get_prompt(self, sample):
    method get_chosen (line 100) | def get_chosen(self, sample):
    method get_rejected (line 103) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 106) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 109) | def get_prompt_and_rejected(self, sample):
  class DahoasSyntheticinstructgptjpairwiseDataset (line 114) | class DahoasSyntheticinstructgptjpairwiseDataset(PromptRawDataset):
    method __init__ (line 116) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_eval_data (line 133) | def get_eval_data(self):
    method get_prompt (line 143) | def get_prompt(self, sample):
    method get_chosen (line 146) | def get_chosen(self, sample):
    method get_rejected (line 149) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 152) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 155) | def get_prompt_and_rejected(self, sample):
  class YitingxieRlhfrewarddatasetsDataset (line 161) | class YitingxieRlhfrewarddatasetsDataset(PromptRawDataset):
    method __init__ (line 163) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 170) | def get_train_data(self):
    method get_eval_data (line 173) | def get_eval_data(self):
    method get_prompt (line 176) | def get_prompt(self, sample):
    method get_chosen (line 179) | def get_chosen(self, sample):
    method get_rejected (line 182) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 185) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 188) | def get_prompt_and_rejected(self, sample):
  class OpenaiWebgptcomparisonsDataset (line 193) | class OpenaiWebgptcomparisonsDataset(PromptRawDataset):
    method __init__ (line 195) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 200) | def get_train_data(self):
    method get_eval_data (line 210) | def get_eval_data(self):
    method get_prompt (line 220) | def get_prompt(self, sample):
    method get_chosen (line 223) | def get_chosen(self, sample):
    method get_rejected (line 235) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 244) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 254) | def get_prompt_and_rejected(self, sample):
  class StanfordnlpSHPDataset (line 266) | class StanfordnlpSHPDataset(PromptRawDataset):
    method __init__ (line 268) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 273) | def get_train_data(self):
    method get_eval_data (line 276) | def get_eval_data(self):
    method get_prompt (line 279) | def get_prompt(self, sample):
    method get_chosen (line 282) | def get_chosen(self, sample):
    method get_rejected (line 289) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 296) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 303) | def get_prompt_and_rejected(self, sample):
  class Wangrui6ZhihuKOLDataset (line 312) | class Wangrui6ZhihuKOLDataset(PromptRawDataset):
    method __init__ (line 314) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 319) | def get_train_data(self):
    method get_eval_data (line 329) | def get_eval_data(self):
    method get_prompt (line 339) | def get_prompt(self, sample):
    method get_chosen (line 344) | def get_chosen(self, sample):
    method get_rejected (line 349) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 355) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 361) | def get_prompt_and_rejected(self, sample):
  class CohereMiraclzhqueries2212Dataset (line 369) | class CohereMiraclzhqueries2212Dataset(PromptRawDataset):
    method __init__ (line 371) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 376) | def get_train_data(self):
    method get_eval_data (line 379) | def get_eval_data(self):
    method get_prompt (line 382) | def get_prompt(self, sample):
    method get_chosen (line 385) | def get_chosen(self, sample):
    method get_rejected (line 388) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 391) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 395) | def get_prompt_and_rejected(self, sample):
  class HelloSimpleAIHC3ChineseDataset (line 401) | class HelloSimpleAIHC3ChineseDataset(PromptRawDataset):
    method __init__ (line 403) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 408) | def get_train_data(self):
    method get_eval_data (line 418) | def get_eval_data(self):
    method get_prompt (line 428) | def get_prompt(self, sample):
    method get_chosen (line 433) | def get_chosen(self, sample):
    method get_rejected (line 438) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 444) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 451) | def get_prompt_and_rejected(self, sample):
  class MkqaChineseDataset (line 459) | class MkqaChineseDataset(PromptRawDataset):
    method __init__ (line 461) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 466) | def get_train_data(self):
    method get_eval_data (line 476) | def get_eval_data(self):
    method get_prompt (line 486) | def get_prompt(self, sample):
    method get_chosen (line 491) | def get_chosen(self, sample):
    method get_rejected (line 496) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 502) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 510) | def get_prompt_and_rejected(self, sample):
  class MkqaJapaneseDataset (line 518) | class MkqaJapaneseDataset(PromptRawDataset):
    method __init__ (line 520) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 525) | def get_train_data(self):
    method get_eval_data (line 535) | def get_eval_data(self):
    method get_prompt (line 545) | def get_prompt(self, sample):
    method get_chosen (line 550) | def get_chosen(self, sample):
    method get_rejected (line 555) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 561) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 568) | def get_prompt_and_rejected(self, sample):
  class CohereMiracljaqueries2212Dataset (line 576) | class CohereMiracljaqueries2212Dataset(PromptRawDataset):
    method __init__ (line 578) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 583) | def get_train_data(self):
    method get_eval_data (line 586) | def get_eval_data(self):
    method get_prompt (line 589) | def get_prompt(self, sample):
    method get_chosen (line 592) | def get_chosen(self, sample):
    method get_rejected (line 595) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 598) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 602) | def get_prompt_and_rejected(self, sample):
  class LmqgQgjaquadDataset (line 608) | class LmqgQgjaquadDataset(PromptRawDataset):
    method __init__ (line 610) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 615) | def get_train_data(self):
    method get_eval_data (line 618) | def get_eval_data(self):
    method get_prompt (line 621) | def get_prompt(self, sample):
    method get_chosen (line 624) | def get_chosen(self, sample):
    method get_rejected (line 627) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 633) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 637) | def get_prompt_and_rejected(self, sample):
  class LmqgQagjaquadDataset (line 645) | class LmqgQagjaquadDataset(PromptRawDataset):
    method __init__ (line 647) | def __init__(self, output_path, seed, local_rank, dataset_name):
    method get_train_data (line 652) | def get_train_data(self):
    method get_eval_data (line 655) | def get_eval_data(self):
    method get_prompt (line 658) | def get_prompt(self, sample):
    method get_chosen (line 661) | def get_chosen(self, sample):
    method get_rejected (line 664) | def get_rejected(self, sample):
    method get_prompt_and_chosen (line 670) | def get_prompt_and_chosen(self, sample):
    method get_prompt_and_rejected (line 674) | def get_prompt_and_rejected(self, sample):

FILE: llm-train/galore/torchrun_main.py
  function parse_args (line 34) | def parse_args(args):
  function evaluate_model (line 85) | def evaluate_model(model, preprocess_batched, pad_idx, global_rank, worl...
  function main (line 134) | def main(args):

FILE: llm-train/megatron/gpt2/data/cMinhash.cpp
  function __Pyx_call_destructor (line 243) | void __Pyx_call_destructor(T& x) {
  class __Pyx_FakeReference (line 247) | class __Pyx_FakeReference {
    method __Pyx_FakeReference (line 249) | __Pyx_FakeReference() : ptr(NULL) { }
    method __Pyx_FakeReference (line 250) | __Pyx_FakeReference(const T& ref) : ptr(const_cast<T*>(&ref)) { }
    method T (line 251) | T *operator->() { return ptr; }
  function CYTHON_INLINE (line 264) | static CYTHON_INLINE float __PYX_NAN() {
  function CYTHON_INLINE (line 395) | static CYTHON_INLINE size_t __Pyx_Py_UNICODE_strlen(const Py_UNICODE *u)
  function __Pyx_init_sys_getdefaultencoding_params (line 428) | static int __Pyx_init_sys_getdefaultencoding_params(void) {
  function __Pyx_init_sys_getdefaultencoding_params (line 478) | static int __Pyx_init_sys_getdefaultencoding_params(void) {
  type __Pyx_StructField_ (line 553) | struct __Pyx_StructField_
  type __Pyx_StructField_ (line 557) | struct __Pyx_StructField_
  type __Pyx_StructField_ (line 565) | struct __Pyx_StructField_ {
  type __pyx_memoryview_obj (line 588) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 590) | struct __pyx_memoryview_obj
  type __pyx_array_obj (line 859) | struct __pyx_array_obj
  type __pyx_MemviewEnum_obj (line 860) | struct __pyx_MemviewEnum_obj
  type __pyx_memoryview_obj (line 861) | struct __pyx_memoryview_obj
  type __pyx_memoryviewslice_obj (line 862) | struct __pyx_memoryviewslice_obj
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  type __pyx_array_obj (line 907) | struct __pyx_array_obj {
  type __pyx_MemviewEnum_obj (line 932) | struct __pyx_MemviewEnum_obj {
  type __pyx_memoryview_obj (line 945) | struct __pyx_memoryview_obj {
  type __pyx_memoryviewslice_obj (line 968) | struct __pyx_memoryviewslice_obj {
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  type __pyx_vtabstruct_array (line 986) | struct __pyx_vtabstruct_array {
    type __pyx_array_obj (line 987) | struct __pyx_array_obj
  type __pyx_vtabstruct_array (line 989) | struct __pyx_vtabstruct_array
    type __pyx_array_obj (line 987) | struct __pyx_array_obj
  type __pyx_vtabstruct_memoryview (line 1000) | struct __pyx_vtabstruct_memoryview {
    type __pyx_memoryview_obj (line 1001) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1002) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1003) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1004) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1004) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1005) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1006) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1007) | struct __pyx_memoryview_obj
  type __pyx_vtabstruct_memoryview (line 1009) | struct __pyx_vtabstruct_memoryview
    type __pyx_memoryview_obj (line 1001) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1002) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1003) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1004) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1004) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1005) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1006) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1007) | struct __pyx_memoryview_obj
  type __pyx_vtabstruct__memoryviewslice (line 1020) | struct __pyx_vtabstruct__memoryviewslice {
    type __pyx_vtabstruct_memoryview (line 1021) | struct __pyx_vtabstruct_memoryview
  type __pyx_vtabstruct__memoryviewslice (line 1023) | struct __pyx_vtabstruct__memoryviewslice
    type __pyx_vtabstruct_memoryview (line 1021) | struct __pyx_vtabstruct_memoryview
  function CYTHON_INLINE (line 1091) | static CYTHON_INLINE PyObject* __Pyx_PyObject_GetAttrStr(PyObject* obj, ...
  type __pyx_memoryview_obj (line 1158) | struct __pyx_memoryview_obj
  function PyObject (line 1202) | static PyObject *__Pyx_PyDict_GetItem(PyObject *d, PyObject* key) {
  type __pyx_array_obj (line 1254) | struct __pyx_array_obj
  function CYTHON_INLINE (line 1327) | static CYTHON_INLINE int __Pyx_ListComp_Append(PyObject* list, PyObject*...
  function CYTHON_INLINE (line 1351) | static CYTHON_INLINE int __Pyx_PyList_Extend(PyObject* L, PyObject* v) {
  function CYTHON_INLINE (line 1365) | static CYTHON_INLINE int __Pyx_PyList_Append(PyObject* list, PyObject* x) {
  type __Pyx_CodeObjectCache (line 1412) | struct __Pyx_CodeObjectCache {
  type __Pyx_CodeObjectCache (line 1417) | struct __Pyx_CodeObjectCache
  type NPY_TYPES (line 1578) | enum NPY_TYPES
  type __pyx_array_obj (line 1640) | struct __pyx_array_obj
  type __pyx_memoryview_obj (line 1641) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1642) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1643) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1644) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1644) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1645) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1646) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1647) | struct __pyx_memoryview_obj
  type __pyx_memoryviewslice_obj (line 1648) | struct __pyx_memoryviewslice_obj
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  type __pyx_memoryviewslice_obj (line 1649) | struct __pyx_memoryviewslice_obj
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  type __pyx_array_obj (line 1698) | struct __pyx_array_obj
  type __pyx_memoryview_obj (line 1704) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1704) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1709) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1710) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1711) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1712) | struct __pyx_memoryview_obj
  type __pyx_array_obj (line 1939) | struct __pyx_array_obj
  type __pyx_array_obj (line 1940) | struct __pyx_array_obj
  type __pyx_array_obj (line 1941) | struct __pyx_array_obj
  type __pyx_array_obj (line 1942) | struct __pyx_array_obj
  type __pyx_array_obj (line 1943) | struct __pyx_array_obj
  type __pyx_array_obj (line 1944) | struct __pyx_array_obj
  type __pyx_array_obj (line 1945) | struct __pyx_array_obj
  type __pyx_MemviewEnum_obj (line 1946) | struct __pyx_MemviewEnum_obj
  type __pyx_MemviewEnum_obj (line 1947) | struct __pyx_MemviewEnum_obj
  type __pyx_memoryview_obj (line 1948) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1949) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1950) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1951) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1952) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1953) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1954) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1955) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1956) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1957) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1958) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1959) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1960) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1961) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1962) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1963) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1964) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1965) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1966) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1967) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 1968) | struct __pyx_memoryview_obj
  type __pyx_memoryviewslice_obj (line 1969) | struct __pyx_memoryviewslice_obj
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  type __pyx_memoryviewslice_obj (line 1970) | struct __pyx_memoryviewslice_obj
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  function PyObject (line 2019) | static PyObject *__pyx_pw_3lsh_8cMinhash_1minhash_64(PyObject *__pyx_sel...
  function PyObject (line 2098) | static PyObject *__pyx_pf_3lsh_8cMinhash_minhash_64(CYTHON_UNUSED PyObje...
  function PyObject (line 2420) | static PyObject *__pyx_pw_3lsh_8cMinhash_3minhash_32(PyObject *__pyx_sel...
  function PyObject (line 2499) | static PyObject *__pyx_pf_3lsh_8cMinhash_2minhash_32(CYTHON_UNUSED PyObj...
  function CYTHON_UNUSED (line 2816) | static CYTHON_UNUSED int __pyx_pw_5numpy_7ndarray_1__getbuffer__(PyObjec...
  function __pyx_pf_5numpy_7ndarray___getbuffer__ (line 2827) | static int __pyx_pf_5numpy_7ndarray___getbuffer__(PyArrayObject *__pyx_v...
  function CYTHON_UNUSED (line 3688) | static CYTHON_UNUSED void __pyx_pw_5numpy_7ndarray_3__releasebuffer__(Py...
  function __pyx_pf_5numpy_7ndarray_2__releasebuffer__ (line 3697) | static void __pyx_pf_5numpy_7ndarray_2__releasebuffer__(PyArrayObject *_...
  function CYTHON_INLINE (line 3778) | static CYTHON_INLINE PyObject *__pyx_f_5numpy_PyArray_MultiIterNew1(PyOb...
  function CYTHON_INLINE (line 3825) | static CYTHON_INLINE PyObject *__pyx_f_5numpy_PyArray_MultiIterNew2(PyOb...
  function CYTHON_INLINE (line 3872) | static CYTHON_INLINE PyObject *__pyx_f_5numpy_PyArray_MultiIterNew3(PyOb...
  function CYTHON_INLINE (line 3919) | static CYTHON_INLINE PyObject *__pyx_f_5numpy_PyArray_MultiIterNew4(PyOb...
  function CYTHON_INLINE (line 3966) | static CYTHON_INLINE PyObject *__pyx_f_5numpy_PyArray_MultiIterNew5(PyOb...
  function CYTHON_INLINE (line 4013) | static CYTHON_INLINE char *__pyx_f_5numpy__util_dtypestring(PyArray_Desc...
  function CYTHON_INLINE (line 4768) | static CYTHON_INLINE void __pyx_f_5numpy_set_array_base(PyArrayObject *_...
  function CYTHON_INLINE (line 4864) | static CYTHON_INLINE PyObject *__pyx_f_5numpy_get_array_base(PyArrayObje...
  function __pyx_array___cinit__ (line 4938) | static int __pyx_array___cinit__(PyObject *__pyx_v_self, PyObject *__pyx...
  function __pyx_array___pyx_pf_15View_dot_MemoryView_5array___cinit__ (line 5052) | static int __pyx_array___pyx_pf_15View_dot_MemoryView_5array___cinit__(s...
  function CYTHON_UNUSED (line 5664) | static CYTHON_UNUSED int __pyx_array_getbuffer(PyObject *__pyx_v_self, P...
  function __pyx_array___pyx_pf_15View_dot_MemoryView_5array_2__getbuffer__ (line 5675) | static int __pyx_array___pyx_pf_15View_dot_MemoryView_5array_2__getbuffe...
  function __pyx_array___dealloc__ (line 5966) | static void __pyx_array___dealloc__(PyObject *__pyx_v_self) {
  function __pyx_array___pyx_pf_15View_dot_MemoryView_5array_4__dealloc__ (line 5975) | static void __pyx_array___pyx_pf_15View_dot_MemoryView_5array_4__dealloc...
  function PyObject (line 6097) | static PyObject *__pyx_pw_15View_dot_MemoryView_5array_7memview_1__get__...
  function PyObject (line 6108) | static PyObject *__pyx_pf_15View_dot_MemoryView_5array_7memview___get__(...
  function PyObject (line 6155) | static PyObject *__pyx_array_get_memview(struct __pyx_array_obj *__pyx_v...
  function PyObject (line 6234) | static PyObject *__pyx_array___getattr__(PyObject *__pyx_v_self, PyObjec...
  function PyObject (line 6245) | static PyObject *__pyx_array___pyx_pf_15View_dot_MemoryView_5array_6__ge...
  function PyObject (line 6299) | static PyObject *__pyx_array___getitem__(PyObject *__pyx_v_self, PyObjec...
  function PyObject (line 6310) | static PyObject *__pyx_array___pyx_pf_15View_dot_MemoryView_5array_8__ge...
  function __pyx_array___setitem__ (line 6364) | static int __pyx_array___setitem__(PyObject *__pyx_v_self, PyObject *__p...
  function __pyx_array___pyx_pf_15View_dot_MemoryView_5array_10__setitem__ (line 6375) | static int __pyx_array___pyx_pf_15View_dot_MemoryView_5array_10__setitem...
  type __pyx_array_obj (line 6421) | struct __pyx_array_obj
  type __pyx_array_obj (line 6422) | struct __pyx_array_obj
  type __pyx_array_obj (line 6423) | struct __pyx_array_obj
  type __pyx_array_obj (line 6472) | struct __pyx_array_obj
  type __pyx_array_obj (line 6536) | struct __pyx_array_obj
  function __pyx_MemviewEnum___init__ (line 6595) | static int __pyx_MemviewEnum___init__(PyObject *__pyx_v_self, PyObject *...
  function __pyx_MemviewEnum___pyx_pf_15View_dot_MemoryView_4Enum___init__ (line 6642) | static int __pyx_MemviewEnum___pyx_pf_15View_dot_MemoryView_4Enum___init...
  function PyObject (line 6684) | static PyObject *__pyx_MemviewEnum___repr__(PyObject *__pyx_v_self) {
  function PyObject (line 6695) | static PyObject *__pyx_MemviewEnum___pyx_pf_15View_dot_MemoryView_4Enum_...
  function __pyx_memoryview___cinit__ (line 6820) | static int __pyx_memoryview___cinit__(PyObject *__pyx_v_self, PyObject *...
  function __pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memoryview___cinit__ (line 6891) | static int __pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memoryview_...
  function __pyx_memoryview___dealloc__ (line 7187) | static void __pyx_memoryview___dealloc__(PyObject *__pyx_v_self) {
  function __pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memoryview_2__dealloc__ (line 7196) | static void __pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memoryview...
  type __pyx_memoryview_obj (line 7375) | struct __pyx_memoryview_obj
  function PyObject (line 7512) | static PyObject *__pyx_memoryview___getitem__(PyObject *__pyx_v_self, Py...
  function PyObject (line 7523) | static PyObject *__pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memor...
  function __pyx_memoryview___setitem__ (line 7702) | static int __pyx_memoryview___setitem__(PyObject *__pyx_v_self, PyObject...
  function __pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memoryview_6__setitem__ (line 7713) | static int __pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memoryview_...
  function PyObject (line 7897) | static PyObject *__pyx_memoryview_is_slice(struct __pyx_memoryview_obj *...
  function PyObject (line 8108) | static PyObject *__pyx_memoryview_setitem_slice_assignment(struct __pyx_...
  function PyObject (line 8191) | static PyObject *__pyx_memoryview_setitem_slice_assign_scalar(struct __p...
  function PyObject (line 8477) | static PyObject *__pyx_memoryview_setitem_indexed(struct __pyx_memoryvie...
  function PyObject (line 8535) | static PyObject *__pyx_memoryview_convert_item_to_object(struct __pyx_me...
  function PyObject (line 8790) | static PyObject *__pyx_memoryview_assign_item_from_object(struct __pyx_m...
  function CYTHON_UNUSED (line 9007) | static CYTHON_UNUSED int __pyx_memoryview_getbuffer(PyObject *__pyx_v_se...
  function __pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memoryview_8__getbuffer__ (line 9018) | static int __pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memoryview_...
  function PyObject (line 9291) | static PyObject *__pyx_pw_15View_dot_MemoryView_10memoryview_1T_1__get__...
  function PyObject (line 9302) | static PyObject *__pyx_pf_15View_dot_MemoryView_10memoryview_1T___get__(...
  function PyObject (line 9374) | static PyObject *__pyx_pw_15View_dot_MemoryView_10memoryview_4base_1__ge...
  function PyObject (line 9385) | static PyObject *__pyx_pf_15View_dot_MemoryView_10memoryview_4base___get...
  function PyObject (line 9427) | static PyObject *__pyx_pw_15View_dot_MemoryView_10memoryview_5shape_1__g...
  function PyObject (line 9438) | static PyObject *__pyx_pf_15View_dot_MemoryView_10memoryview_5shape___ge...
  function PyObject (line 9505) | static PyObject *__pyx_pw_15View_dot_MemoryView_10memoryview_7strides_1_...
  function PyObject (line 9516) | static PyObject *__pyx_pf_15View_dot_MemoryView_10memoryview_7strides___...
  function PyObject (line 9616) | static PyObject *__pyx_pw_15View_dot_MemoryView_10memoryview_10suboffset...
  function PyObject (line 9627) | static PyObject *__pyx_pf_15View_dot_MemoryView_10memoryview_10suboffset...
  function PyObject (line 9731) | static PyObject *__pyx_pw_15View_dot_MemoryView_10memoryview_4ndim_1__ge...
  function PyObject (line 9742) | static PyObject *__pyx_pf_15View_dot_MemoryView_10memoryview_4ndim___get...
  function PyObject (line 9791) | static PyObject *__pyx_pw_15View_dot_MemoryView_10memoryview_8itemsize_1...
  function PyObject (line 9802) | static PyObject *__pyx_pf_15View_dot_MemoryView_10memoryview_8itemsize__...
  function PyObject (line 9851) | static PyObject *__pyx_pw_15View_dot_MemoryView_10memoryview_6nbytes_1__...
  function PyObject (line 9862) | static PyObject *__pyx_pf_15View_dot_MemoryView_10memoryview_6nbytes___g...
  function PyObject (line 9921) | static PyObject *__pyx_pw_15View_dot_MemoryView_10memoryview_4size_1__ge...
  function PyObject (line 9932) | static PyObject *__pyx_pf_15View_dot_MemoryView_10memoryview_4size___get...
  function Py_ssize_t (line 10059) | static Py_ssize_t __pyx_memoryview___len__(PyObject *__pyx_v_self) {
  function Py_ssize_t (line 10070) | static Py_ssize_t __pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memo...
  function PyObject (line 10139) | static PyObject *__pyx_memoryview___repr__(PyObject *__pyx_v_self) {
  function PyObject (line 10150) | static PyObject *__pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memor...
  function PyObject (line 10244) | static PyObject *__pyx_memoryview___str__(PyObject *__pyx_v_self) {
  function PyObject (line 10255) | static PyObject *__pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memor...
  function PyObject (line 10320) | static PyObject *__pyx_memoryview_is_c_contig(PyObject *__pyx_v_self, CY...
  function PyObject (line 10331) | static PyObject *__pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memor...
  function PyObject (line 10391) | static PyObject *__pyx_memoryview_is_f_contig(PyObject *__pyx_v_self, CY...
  function PyObject (line 10402) | static PyObject *__pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memor...
  function PyObject (line 10462) | static PyObject *__pyx_memoryview_copy(PyObject *__pyx_v_self, CYTHON_UN...
  function PyObject (line 10473) | static PyObject *__pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memor...
  function PyObject (line 10553) | static PyObject *__pyx_memoryview_copy_fortran(PyObject *__pyx_v_self, C...
  function PyObject (line 10564) | static PyObject *__pyx_memoryview___pyx_pf_15View_dot_MemoryView_10memor...
  function PyObject (line 10643) | static PyObject *__pyx_memoryview_new(PyObject *__pyx_v_o, int __pyx_v_f...
  function CYTHON_INLINE (line 10731) | static CYTHON_INLINE int __pyx_memoryview_check(PyObject *__pyx_v_o) {
  function PyObject (line 10770) | static PyObject *_unellipsify(PyObject *__pyx_v_index, int __pyx_v_ndim) {
  function PyObject (line 11229) | static PyObject *assert_direct_dimensions(Py_ssize_t *__pyx_v_suboffsets...
  type __pyx_memoryview_obj (line 11314) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 11314) | struct __pyx_memoryview_obj
  type __pyx_memoryviewslice_obj (line 11321) | struct __pyx_memoryviewslice_obj
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 11331) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 11336) | struct __pyx_memoryview_obj
  type __pyx_memoryviewslice_obj (line 11403) | struct __pyx_memoryviewslice_obj
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 11815) | struct __pyx_memoryview_obj
  type __pyx_memoryview_obj (line 11856) | struct __pyx_memoryview_obj
  function __pyx_memoryview_slice_memviewslice (line 11891) | static int __pyx_memoryview_slice_memviewslice(__Pyx_memviewslice *__pyx...
  function __pyx_memslice_transpose (line 12988) | static int __pyx_memslice_transpose(__Pyx_memviewslice *__pyx_v_memslice) {
  function __pyx_memoryviewslice___dealloc__ (line 13159) | static void __pyx_memoryviewslice___dealloc__(PyObject *__pyx_v_self) {
  function __pyx_memoryviewslice___pyx_pf_15View_dot_MemoryView_16_memoryviewslice___dealloc__ (line 13168) | static void __pyx_memoryviewslice___pyx_pf_15View_dot_MemoryView_16_memo...
  function PyObject (line 13201) | static PyObject *__pyx_memoryviewslice_convert_item_to_object(struct __p...
  function PyObject (line 13284) | static PyObject *__pyx_memoryviewslice_assign_item_from_object(struct __...
  function PyObject (line 13366) | static PyObject *__pyx_pw_15View_dot_MemoryView_16_memoryviewslice_4base...
  function PyObject (line 13377) | static PyObject *__pyx_pf_15View_dot_MemoryView_16_memoryviewslice_4base...
  function PyObject (line 13417) | static PyObject *__pyx_memoryview_fromslice(__Pyx_memviewslice __pyx_v_m...
  function __Pyx_memviewslice (line 13769) | static __Pyx_memviewslice *__pyx_memoryview_get_slice_from_memoryview(st...
  function __pyx_memoryview_slice_copy (line 13869) | static void __pyx_memoryview_slice_copy(struct __pyx_memoryview_obj *__p...
  function PyObject (line 13993) | static PyObject *__pyx_memoryview_copy_object(struct __pyx_memoryview_ob...
  function PyObject (line 14050) | static PyObject *__pyx_memoryview_copy_object_from_slice(struct __pyx_me...
  function Py_ssize_t (line 14173) | static Py_ssize_t abs_py_ssize_t(Py_ssize_t __pyx_v_arg) {
  function __pyx_get_best_slice_order (line 14239) | static char __pyx_get_best_slice_order(__Pyx_memviewslice *__pyx_v_mslic...
  function _copy_strided_to_strided (line 14427) | static void _copy_strided_to_strided(char *__pyx_v_src_data, Py_ssize_t ...
  function copy_strided_to_strided (line 14661) | static void copy_strided_to_strided(__Pyx_memviewslice *__pyx_v_src, __P...
  function Py_ssize_t (line 14691) | static Py_ssize_t __pyx_memoryview_slice_get_size(__Pyx_memviewslice *__...
  function Py_ssize_t (line 14761) | static Py_ssize_t __pyx_fill_contig_strides_array(Py_ssize_t *__pyx_v_sh...
  type __pyx_memoryview_obj (line 14890) | struct __pyx_memoryview_obj
  function __pyx_memoryview_err_extents (line 15130) | static int __pyx_memoryview_err_extents(int __pyx_v_i, Py_ssize_t __pyx_...
  function __pyx_memoryview_err_dim (line 15220) | static int __pyx_memoryview_err_dim(PyObject *__pyx_v_error, char *__pyx...
  function __pyx_memoryview_err (line 15313) | static int __pyx_memoryview_err(PyObject *__pyx_v_error, char *__pyx_v_m...
  function __pyx_memoryview_copy_contents (line 15432) | static int __pyx_memoryview_copy_contents(__Pyx_memviewslice __pyx_v_src...
  function __pyx_memoryview_broadcast_leading (line 16006) | static void __pyx_memoryview_broadcast_leading(__Pyx_memviewslice *__pyx...
  function __pyx_memoryview_refcount_copying (line 16117) | static void __pyx_memoryview_refcount_copying(__Pyx_memviewslice *__pyx_...
  function __pyx_memoryview_refcount_objects_in_slice_with_gil (line 16167) | static void __pyx_memoryview_refcount_objects_in_slice_with_gil(char *__...
  function __pyx_memoryview_refcount_objects_in_slice (line 16206) | static void __pyx_memoryview_refcount_objects_in_slice(char *__pyx_v_dat...
  function __pyx_memoryview_slice_assign_scalar (line 16336) | static void __pyx_memoryview_slice_assign_scalar(__Pyx_memviewslice *__p...
  function __pyx_memoryview__slice_assign_scalar (line 16384) | static void __pyx_memoryview__slice_assign_scalar(char *__pyx_v_data, Py...
  type __pyx_vtabstruct_array (line 16503) | struct __pyx_vtabstruct_array
    type __pyx_array_obj (line 987) | struct __pyx_array_obj
  function PyObject (line 16505) | static PyObject *__pyx_tp_new_array(PyTypeObject *t, PyObject *a, PyObje...
  function __pyx_tp_dealloc_array (line 16524) | static void __pyx_tp_dealloc_array(PyObject *o) {
  function PyObject (line 16543) | static PyObject *__pyx_sq_item_array(PyObject *o, Py_ssize_t i) {
  function __pyx_mp_ass_subscript_array (line 16551) | static int __pyx_mp_ass_subscript_array(PyObject *o, PyObject *i, PyObje...
  function PyObject (line 16562) | static PyObject *__pyx_tp_getattro_array(PyObject *o, PyObject *n) {
  function PyObject (line 16571) | static PyObject *__pyx_getprop___pyx_array_memview(PyObject *o, CYTHON_U...
  type PyGetSetDef (line 16580) | struct PyGetSetDef
  type __pyx_array_obj (line 16624) | struct __pyx_array_obj
  type __pyx_MemviewEnum_obj (line 16688) | struct __pyx_MemviewEnum_obj
  type __pyx_MemviewEnum_obj (line 16694) | struct __pyx_MemviewEnum_obj
  type __pyx_MemviewEnum_obj (line 16694) | struct __pyx_MemviewEnum_obj
  function __pyx_tp_traverse_Enum (line 16705) | static int __pyx_tp_traverse_Enum(PyObject *o, visitproc v, void *a) {
  function __pyx_tp_clear_Enum (line 16714) | static int __pyx_tp_clear_Enum(PyObject *o) {
  type __pyx_MemviewEnum_obj (line 16730) | struct __pyx_MemviewEnum_obj
  type __pyx_vtabstruct_memoryview (line 16784) | struct __pyx_vtabstruct_memoryview
    type __pyx_memoryview_obj (line 1001) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1002) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1003) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1004) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1004) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1005) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1006) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1007) | struct __pyx_memoryview_obj
  function PyObject (line 16786) | static PyObject *__pyx_tp_new_memoryview(PyTypeObject *t, PyObject *a, P...
  function __pyx_tp_dealloc_memoryview (line 16807) | static void __pyx_tp_dealloc_memoryview(PyObject *o) {
  function __pyx_tp_traverse_memoryview (line 16829) | static int __pyx_tp_traverse_memoryview(PyObject *o, visitproc v, void *...
  function __pyx_tp_clear_memoryview (line 16847) | static int __pyx_tp_clear_memoryview(PyObject *o) {
  function PyObject (line 16862) | static PyObject *__pyx_sq_item_memoryview(PyObject *o, Py_ssize_t i) {
  function __pyx_mp_ass_subscript_memoryview (line 16870) | static int __pyx_mp_ass_subscript_memoryview(PyObject *o, PyObject *i, P...
  function PyObject (line 16881) | static PyObject *__pyx_getprop___pyx_memoryview_T(PyObject *o, CYTHON_UN...
  function PyObject (line 16885) | static PyObject *__pyx_getprop___pyx_memoryview_base(PyObject *o, CYTHON...
  function PyObject (line 16889) | static PyObject *__pyx_getprop___pyx_memoryview_shape(PyObject *o, CYTHO...
  function PyObject (line 16893) | static PyObject *__pyx_getprop___pyx_memoryview_strides(PyObject *o, CYT...
  function PyObject (line 16897) | static PyObject *__pyx_getprop___pyx_memoryview_suboffsets(PyObject *o, ...
  function PyObject (line 16901) | static PyObject *__pyx_getprop___pyx_memoryview_ndim(PyObject *o, CYTHON...
  function PyObject (line 16905) | static PyObject *__pyx_getprop___pyx_memoryview_itemsize(PyObject *o, CY...
  function PyObject (line 16909) | static PyObject *__pyx_getprop___pyx_memoryview_nbytes(PyObject *o, CYTH...
  function PyObject (line 16913) | static PyObject *__pyx_getprop___pyx_memoryview_size(PyObject *o, CYTHON...
  type PyGetSetDef (line 16925) | struct PyGetSetDef
  type __pyx_memoryview_obj (line 16977) | struct __pyx_memoryview_obj
  type __pyx_memoryviewslice_obj (line 17037) | struct __pyx_memoryviewslice_obj
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  type __pyx_vtabstruct_memoryview (line 17038) | struct __pyx_vtabstruct_memoryview
    type __pyx_memoryview_obj (line 1001) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1002) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1003) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1004) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1004) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1005) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1006) | struct __pyx_memoryview_obj
    type __pyx_memoryview_obj (line 1007) | struct __pyx_memoryview_obj
  type __pyx_memoryviewslice_obj (line 17045) | struct __pyx_memoryviewslice_obj
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  type __pyx_memoryviewslice_obj (line 17045) | struct __pyx_memoryviewslice_obj
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  function __pyx_tp_traverse__memoryviewslice (line 17065) | static int __pyx_tp_traverse__memoryviewslice(PyObject *o, visitproc v, ...
  function __pyx_tp_clear__memoryviewslice (line 17075) | static int __pyx_tp_clear__memoryviewslice(PyObject *o) {
  function PyObject (line 17086) | static PyObject *__pyx_getprop___pyx_memoryviewslice_base(PyObject *o, C...
  type PyGetSetDef (line 17094) | struct PyGetSetDef
  type __pyx_memoryviewslice_obj (line 17102) | struct __pyx_memoryviewslice_obj
    type __pyx_memoryview_obj (line 969) | struct __pyx_memoryview_obj
  type PyModuleDef (line 17170) | struct PyModuleDef
  function __Pyx_InitCachedBuiltins (line 17285) | static int __Pyx_InitCachedBuiltins(void) {
  function __Pyx_InitCachedConstants (line 17300) | static int __Pyx_InitCachedConstants(void) {
  function __Pyx_InitGlobals (line 17601) | static int __Pyx_InitGlobals(void) {
  function PyMODINIT_FUNC (line 17616) | PyMODINIT_FUNC PyInit_cMinhash(void)
  function __Pyx_RefNannyAPIStruct (line 17974) | static __Pyx_RefNannyAPIStruct *__Pyx_RefNannyImportAPI(const char *modn...
  function PyObject (line 17990) | static PyObject *__Pyx_GetBuiltinName(PyObject *name) {
  function __Pyx_RaiseArgtupleInvalid (line 18004) | static void __Pyx_RaiseArgtupleInvalid(
  function __Pyx_RaiseDoubleKeywordsError (line 18030) | static void __Pyx_RaiseDoubleKeywordsError(
  function __Pyx_ParseOptionalKeywords (line 18044) | static int __Pyx_ParseOptionalKeywords(
  function __Pyx_RaiseArgumentTypeInvalid (line 18146) | static void __Pyx_RaiseArgumentTypeInvalid(const char* name, PyObject *o...
  function CYTHON_INLINE (line 18151) | static CYTHON_INLINE int __Pyx_ArgTypeTest(PyObject *obj, PyTypeObject *...
  function CYTHON_INLINE (line 18173) | static CYTHON_INLINE int __Pyx_IsLittleEndian(void) {
  function __Pyx_BufFmt_Init (line 18177) | static void __Pyx_BufFmt_Init(__Pyx_BufFmt_Context* ctx,
  function __Pyx_BufFmt_ParseNumber (line 18204) | static int __Pyx_BufFmt_ParseNumber(const char** ts) {
  function __Pyx_BufFmt_ExpectNumber (line 18219) | static int __Pyx_BufFmt_ExpectNumber(const char **ts) {
  function __Pyx_BufFmt_RaiseUnexpectedChar (line 18226) | static void __Pyx_BufFmt_RaiseUnexpectedChar(char ch) {
  function __Pyx_BufFmt_TypeCharToStandardSize (line 18254) | static size_t __Pyx_BufFmt_TypeCharToStandardSize(char ch, int is_comple...
  function __Pyx_BufFmt_TypeCharToNativeSize (line 18272) | static size_t __Pyx_BufFmt_TypeCharToNativeSize(char ch, int is_complex) {
  function __Pyx_BufFmt_TypeCharToAlignment (line 18301) | static size_t __Pyx_BufFmt_TypeCharToAlignment(char ch, CYTHON_UNUSED in...
  function __Pyx_BufFmt_TypeCharToPadding (line 18333) | static size_t __Pyx_BufFmt_TypeCharToPadding(char ch, CYTHON_UNUSED int ...
  function __Pyx_BufFmt_TypeCharToGroup (line 18351) | static char __Pyx_BufFmt_TypeCharToGroup(char ch, int is_complex) {
  function __Pyx_BufFmt_RaiseExpected (line 18372) | static void __Pyx_BufFmt_RaiseExpected(__Pyx_BufFmt_Context* ctx) {
  function __Pyx_BufFmt_ProcessTypeChunk (line 18396) | static int __Pyx_BufFmt_ProcessTypeChunk(__Pyx_BufFmt_Context* ctx) {
  function CYTHON_INLINE (line 18498) | static CYTHON_INLINE PyObject *
  function CYTHON_INLINE (line 18675) | static CYTHON_INLINE void __Pyx_ZeroBuffer(Py_buffer* buf) {
  function CYTHON_INLINE (line 18682) | static CYTHON_INLINE int __Pyx_GetBufferAndValidate(
  function CYTHON_INLINE (line 18716) | static CYTHON_INLINE void __Pyx_SafeReleaseBuffer(Py_buffer* info) {
  function CYTHON_INLINE (line 18742) | static CYTHON_INLINE PyObject* __Pyx_PyObject_Call(PyObject *func, PyObj...
  function CYTHON_INLINE (line 18761) | static CYTHON_INLINE int __Pyx_TypeTest(PyObject *obj, PyTypeObject *typ...
  function __Pyx_init_memviewslice (line 18774) | static int
  function CYTHON_INLINE (line 18827) | static CYTHON_INLINE void __pyx_fatalerror(const char *fmt, ...) {
  function CYTHON_INLINE (line 18839) | static CYTHON_INLINE int
  function CYTHON_INLINE (line 18849) | static CYTHON_INLINE int
  function CYTHON_INLINE (line 18859) | static CYTHON_INLINE void
  function CYTHON_INLINE (line 18880) | static CYTHON_INLINE void __Pyx_XDEC_MEMVIEW(__Pyx_memviewslice *memslice,
  function CYTHON_INLINE (line 18910) | static CYTHON_INLINE void __Pyx_ErrRestoreInState(PyThreadState *tstate,...
  function CYTHON_INLINE (line 18922) | static CYTHON_INLINE void __Pyx_ErrFetchInState(PyThreadState *tstate, P...
  function __Pyx_Raise (line 18934) | static void __Pyx_Raise(PyObject *type, PyObject *value, PyObject *tb,
  function CYTHON_INLINE (line 19096) | static CYTHON_INLINE void __Pyx_RaiseTooManyValuesError(Py_ssize_t expec...
  function CYTHON_INLINE (line 19102) | static CYTHON_INLINE void __Pyx_RaiseNeedMoreValuesError(Py_ssize_t inde...
  function CYTHON_INLINE (line 19109) | static CYTHON_INLINE void __Pyx_RaiseNoneNotIterableError(void) {
  function CYTHON_INLINE (line 19114) | static CYTHON_INLINE int __Pyx_PyBytes_Equals(PyObject* s1, PyObject* s2...
  function CYTHON_INLINE (line 19152) | static CYTHON_INLINE int __Pyx_PyUnicode_Equals(PyObject* s1, PyObject* ...
  function CYTHON_INLINE (line 19236) | static CYTHON_INLINE Py_ssize_t __Pyx_div_Py_ssize_t(Py_ssize_t a, Py_ss...
  function CYTHON_INLINE (line 19244) | static CYTHON_INLINE PyObject *__Pyx_GetAttr(PyObject *o, PyObject *n) {
  function CYTHON_INLINE (line 19257) | static CYTHON_INLINE PyObject* __Pyx_decode_c_string(
  function CYTHON_INLINE (line 19291) | static CYTHON_INLINE void __Pyx__ExceptionSave(PyThreadState *tstate, Py...
  function CYTHON_INLINE (line 19299) | static CYTHON_INLINE void __Pyx__ExceptionReset(PyThreadState *tstate, P...
  function CYTHON_INLINE (line 19315) | static CYTHON_INLINE int __Pyx_PyErr_ExceptionMatchesInState(PyThreadSta...
  function __Pyx_GetException (line 19327) | static int __Pyx_GetException(PyObject **type, PyObject **value, PyObjec...
  function CYTHON_INLINE (line 19386) | static CYTHON_INLINE void __Pyx__ExceptionSwap(PyThreadState *tstate, Py...
  function CYTHON_INLINE (line 19399) | static CYTHON_INLINE void __Pyx_ExceptionSwap(PyObject **type, PyObject ...
  function PyObject (line 19410) | static PyObject *__Pyx_Import(PyObject *name, PyObject *from_list, int l...
  function CYTHON_INLINE (line 19484) | static CYTHON_INLINE PyObject *__Pyx_GetItemInt_Generic(PyObject *o, PyO...
  function CYTHON_INLINE (line 19491) | static CYTHON_INLINE PyObject *__Pyx_GetItemInt_List_Fast(PyObject *o, P...
  function CYTHON_INLINE (line 19506) | static CYTHON_INLINE PyObject *__Pyx_GetItemInt_Tuple_Fast(PyObject *o, ...
  function CYTHON_INLINE (line 19521) | static CYTHON_INLINE PyObject *__Pyx_GetItemInt_Fast(PyObject *o, Py_ssi...
  function PyObject (line 19566) | static PyObject* __Pyx_PyInt_AddObjC(PyObject *op1, PyObject *op2, CYTHO...
  function CYTHON_INLINE (line 19663) | static CYTHON_INLINE void __Pyx_RaiseUnboundLocalError(const char *varna...
  function __Pyx_div_long (line 19668) | static CYTHON_INLINE long __Pyx_div_long(long a, long b) {
  function __Pyx_WriteUnraisable (line 19676) | static void __Pyx_WriteUnraisable(const char *name, CYTHON_UNUSED int cl...
  function CYTHON_INLINE (line 19719) | static CYTHON_INLINE PyObject* __Pyx_PyObject_CallMethO(PyObject *func, ...
  function PyObject (line 19739) | static PyObject* __Pyx__PyObject_CallOneArg(PyObject *func, PyObject *ar...
  function CYTHON_INLINE (line 19762) | static CYTHON_INLINE PyObject* __Pyx_PyObject_CallOneArg(PyObject *func,...
  function __Pyx_SetVtable (line 19773) | static int __Pyx_SetVtable(PyObject *dict, void *vtable) {
  function __pyx_bisect_code_objects (line 19791) | static int __pyx_bisect_code_objects(__Pyx_CodeObjectCacheEntry* entries...
  function PyCodeObject (line 19812) | static PyCodeObject *__pyx_find_code_object(int code_line) {
  function __pyx_insert_code_object (line 19826) | static void __pyx_insert_code_object(int code_line, PyCodeObject* code_o...
  function PyCodeObject (line 19874) | static PyCodeObject* __Pyx_CreateCodeObjectForTraceback(
  function __Pyx_AddTraceback (line 19926) | static void __Pyx_AddTraceback(const char *funcname, int c_line,
  function __Pyx_GetBuffer (line 19952) | static int __Pyx_GetBuffer(PyObject *obj, Py_buffer *view, int flags) {
  function __Pyx_ReleaseBuffer (line 19960) | static void __Pyx_ReleaseBuffer(Py_buffer *view) {
  function __pyx_memviewslice_is_contig (line 19975) | static int
  function __pyx_get_array_memory_extents (line 19998) | static void
  function __pyx_slices_overlap (line 20022) | static int
  function CYTHON_INLINE (line 20034) | static CYTHON_INLINE PyObject *
  function CYTHON_INLINE (line 20069) | static CYTHON_INLINE PyObject* __Pyx_PyInt_From_uint32_t(uint32_t value) {
  function CYTHON_INLINE (line 20096) | static CYTHON_INLINE PyObject* __Pyx_PyInt_From_long(long value) {
  function CYTHON_INLINE (line 20125) | static CYTHON_INLINE __pyx_t_float_complex __pyx_t_float_complex_from_pa...
  function CYTHON_INLINE (line 20129) | static CYTHON_INLINE __pyx_t_float_complex __pyx_t_float_complex_from_pa...
  function CYTHON_INLINE (line 20134) | static CYTHON_INLINE __pyx_t_float_complex __pyx_t_float_complex_from_pa...
  function CYTHON_INLINE (line 20145) | static CYTHON_INLINE int __Pyx_c_eqf(__pyx_t_float_complex a, __pyx_t_fl...
  function CYTHON_INLINE (line 20148) | static CYTHON_INLINE __pyx_t_float_complex __Pyx_c_sumf(__pyx_t_float_co...
  function CYTHON_INLINE (line 20154) | static CYTHON_INLINE __pyx_t_float_complex __Pyx_c_difff(__pyx_t_float_c...
  function CYTHON_INLINE (line 20160) | static CYTHON_INLINE __pyx_t_float_complex __Pyx_c_prodf(__pyx_t_float_c...
  function CYTHON_INLINE (line 20166) | static CYTHON_INLINE __pyx_t_float_complex __Pyx_c_quotf(__pyx_t_float_c...
  function CYTHON_INLINE (line 20173) | static CYTHON_INLINE __pyx_t_float_complex __Pyx_c_negf(__pyx_t_float_co...
  function CYTHON_INLINE (line 20179) | static CYTHON_INLINE int __Pyx_c_is_zerof(__pyx_t_float_complex a) {
  function CYTHON_INLINE (line 20182) | static CYTHON_INLINE __pyx_t_float_complex __Pyx_c_conjf(__pyx_t_float_c...
  function CYTHON_INLINE (line 20189) | static CYTHON_INLINE float __Pyx_c_absf(__pyx_t_float_complex z) {
  function CYTHON_INLINE (line 20196) | static CYTHON_INLINE __pyx_t_float_complex __Pyx_c_powf(__pyx_t_float_co...
  function CYTHON_INLINE (line 20247) | static CYTHON_INLINE __pyx_t_double_complex __pyx_t_double_complex_from_...
  function CYTHON_INLINE (line 20251) | static CYTHON_INLINE __pyx_t_double_complex __pyx_t_double_complex_from_...
  function CYTHON_INLINE (line 20256) | static CYTHON_INLINE __pyx_t_double_complex __pyx_t_double_complex_from_...
  function CYTHON_INLINE (line 20267) | static CYTHON_INLINE int __Pyx_c_eq(__pyx_t_double_complex a, __pyx_t_do...
  function CYTHON_INLINE (line 20270) | static CYTHON_INLINE __pyx_t_double_complex __Pyx_c_sum(__pyx_t_double_c...
  function CYTHON_INLINE (line 20276) | static CYTHON_INLINE __pyx_t_double_complex __Pyx_c_diff(__pyx_t_double_...
  function CYTHON_INLINE (line 20282) | static CYTHON_INLINE __pyx_t_double_complex __Pyx_c_prod(__pyx_t_double_...
  function CYTHON_INLINE (line 20288) | static CYTHON_INLINE __pyx_t_double_complex __Pyx_c_quot(__pyx_t_double_...
  function CYTHON_INLINE (line 20295) | static CYTHON_INLINE __pyx_t_double_complex __Pyx_c_neg(__pyx_t_double_c...
  function CYTHON_INLINE (line 20301) | static CYTHON_INLINE int __Pyx_c_is_zero(__pyx_t_double_complex a) {
  function CYTHON_INLINE (line 20304) | static CYTHON_INLINE __pyx_t_double_complex __Pyx_c_conj(__pyx_t_double_...
  function CYTHON_INLINE (line 20311) | static CYTHON_INLINE double __Pyx_c_abs(__pyx_t_double_complex z) {
  function CYTHON_INLINE (line 20318) | static CYTHON_INLINE __pyx_t_double_complex __Pyx_c_pow(__pyx_t_double_c...
  function CYTHON_INLINE (line 20367) | static CYTHON_INLINE PyObject* __Pyx_PyInt_From_int(int value) {
  function CYTHON_INLINE (line 20394) | static CYTHON_INLINE PyObject* __Pyx_PyInt_From_enum__NPY_TYPES(enum NPY...
  function __Pyx_memviewslice (line 20421) | static __Pyx_memviewslice
  function CYTHON_INLINE (line 20488) | static CYTHON_INLINE int __Pyx_PyInt_As_int(PyObject *x) {
  function CYTHON_INLINE (line 20673) | static CYTHON_INLINE uint32_t __Pyx_PyInt_As_uint32_t(PyObject *x) {
  function CYTHON_INLINE (line 20858) | static CYTHON_INLINE char __Pyx_PyInt_As_char(PyObject *x) {
  function __Pyx_PyInt_As_long (line 21043) | static CYTHON_INLINE long __Pyx_PyInt_As_long(PyObject *x) {
  function __pyx_typeinfo_cmp (line 21228) | static int
  function __pyx_check_strides (line 21269) | static int
  function __pyx_check_suboffsets (line 21322) | static int
  function __pyx_verify_contig (line 21345) | static int
  function __Pyx_ValidateAndInit_memviewslice (line 21377) | static int __Pyx_ValidateAndInit_memviewslice(
  function CYTHON_INLINE (line 21451) | static CYTHON_INLINE __Pyx_memviewslice __Pyx_PyObject_to_MemoryviewSlic...
  function CYTHON_INLINE (line 21474) | static CYTHON_INLINE __Pyx_memviewslice __Pyx_PyObject_to_MemoryviewSlic...
  function __Pyx_check_binary_version (line 21497) | static int __Pyx_check_binary_version(void) {
  function PyObject (line 21515) | static PyObject *__Pyx_ImportModule(const char *name) {
  function PyTypeObject (line 21533) | static PyTypeObject *__Pyx_ImportType(const char *module_name, const cha...
  function __Pyx_InitStrings (line 21596) | static int __Pyx_InitStrings(__Pyx_StringTabEntry *t) {
  function CYTHON_INLINE (line 21626) | static CYTHON_INLINE PyObject* __Pyx_PyUnicode_FromString(const char* c_...
  function CYTHON_INLINE (line 21629) | static CYTHON_INLINE char* __Pyx_PyObject_AsString(PyObject* o) {
  function CYTHON_INLINE (line 21691) | static CYTHON_INLINE int __Pyx_PyObject_IsTrue(PyObject* x) {
  function CYTHON_INLINE (line 21741) | static CYTHON_INLINE Py_ssize_t __Pyx_PyIndex_AsSsize_t(PyObject* b) {
  function CYTHON_INLINE (line 21803) | static CYTHON_INLINE PyObject * __Pyx_PyInt_FromSize_t(size_t ival) {

FILE: llm-train/megatron/gpt2/data/download.py
  function load_urls (line 114) | def load_urls(fh, max_urls=-1):
  function vet_link (line 121) | def vet_link(link):
  function download (line 140) | def download(
  function archive_chunk (line 193) | def archive_chunk(cid, cdata, out_dir, fmt, arch_meta):
  function load_state (line 227) | def load_state(url_file):
  function save_state (line 240) | def save_state(url_file, cid):
  function sqlite_conn (line 246) | def sqlite_conn():

FILE: llm-train/megatron/gpt2/data/file_utils.py
  function url_to_filename (line 39) | def url_to_filename(url, etag=None):
  function filename_to_url (line 57) | def filename_to_url(filename, cache_dir=None):
  function cached_path (line 83) | def cached_path(url_or_filename, cache_dir=None):
  function split_s3_path (line 113) | def split_s3_path(url):
  function s3_request (line 126) | def s3_request(func):
  function s3_etag (line 146) | def s3_etag(url):
  function s3_get (line 155) | def s3_get(url, temp_file):
  function http_get (line 162) | def http_get(url, temp_file):
  function get_from_cache (line 174) | def get_from_cache(url, cache_dir=None):
  function read_set_from_file (line 234) | def read_set_from_file(filename):
  function get_file_extension (line 246) | def get_file_extension(path, dot=True, lower=True):

FILE: llm-train/megatron/gpt2/merge_ck_and_inference/checkpoint_loader_megatron.py
  function add_arguments (line 8) | def add_arguments(parser):
  function _load_checkpoint (line 19) | def _load_checkpoint(queue, args):
  function load_checkpoint (line 345) | def load_checkpoint(queue, args):

FILE: llm-train/megatron/gpt2/merge_ck_and_inference/checkpoint_saver_megatron.py
  function add_arguments (line 9) | def add_arguments(parser):
  function save_checkpoint (line 22) | def save_checkpoint(queue, args):

FILE: llm-train/megatron/gpt2/merge_ck_and_inference/checkpoint_util.py
  function load_plugin (line 89) | def load_plugin(plugin_type, name):
  function main (line 106) | def main():

FILE: llm-train/megatron/gpt2/merge_ck_and_inference/run_text_generation_server.py
  function model_provider (line 21) | def model_provider(pre_process=True, post_process=True):
  function add_text_generate_args (line 29) | def add_text_generate_args(parser):

FILE: llm-train/peft/clm/peft_lora_clm_accelerate_ds_zero3_offload.py
  function levenshtein_distance (line 24) | def levenshtein_distance(str1, str2):
  function get_closest_label (line 45) | def get_closest_label(eval_pred, classes):
  function b2mb (line 57) | def b2mb(x):
  class TorchTracemalloc (line 62) | class TorchTracemalloc:
    method __enter__ (line 63) | def __enter__(self):
    method cpu_mem_used (line 77) | def cpu_mem_used(self):
    method peak_monitor_func (line 81) | def peak_monitor_func(self):
    method __exit__ (line 93) | def __exit__(self, *exc):
  function main (line 109) | def main():

FILE: llm-train/peft/multimodal/blip2_lora_int8_fine_tune.py
  class ImageCaptioningDataset (line 17) | class ImageCaptioningDataset(Dataset):
    method __init__ (line 18) | def __init__(self, dataset, processor):
    method __len__ (line 22) | def __len__(self):
    method __getitem__ (line 25) | def __getitem__(self, idx):
  function plot (line 34) | def plot(loss_list, output_path):
  function main (line 50) | def main():

FILE: llm-train/pytorch/distribution/data-parallel/ddp_launch.py
  class ToyModel (line 14) | class ToyModel(nn.Module):
    method __init__ (line 15) | def __init__(self):
    method forward (line 21) | def forward(self, x):
  function demo_basic (line 25) | def demo_basic(local_world_size, local_rank):
  function spmd_main (line 53) | def spmd_main(local_world_size, local_rank):

FILE: llm-train/pytorch/distribution/data-parallel/ddp_main.py
  function setup (line 13) | def setup(rank, world_size):
  function cleanup (line 21) | def cleanup():
  class ToyModel (line 25) | class ToyModel(nn.Module):
    method __init__ (line 26) | def __init__(self):
    method forward (line 32) | def forward(self, x):
  function demo_basic (line 36) | def demo_basic(rank, world_size):
  function run_demo (line 56) | def run_demo(demo_fn, world_size):
  function demo_checkpoint (line 63) | def demo_checkpoint(rank, world_size):
  class ToyMpModel (line 107) | class ToyMpModel(nn.Module):
    method __init__ (line 108) | def __init__(self, dev0, dev1):
    method forward (line 116) | def forward(self, x):
  function demo_model_parallel (line 125) | def demo_model_parallel(rank, world_size):

FILE: llm-train/pytorch/distribution/data-parallel/elastic_ddp.py
  class ToyModel (line 8) | class ToyModel(nn.Module):
    method __init__ (line 9) | def __init__(self):
    method forward (line 15) | def forward(self, x):
  function demo_basic (line 19) | def demo_basic():

FILE: llm-train/pytorch/distribution/pipeline-parallel/ddp_pipeline.py
  class PositionalEncoding (line 43) | class PositionalEncoding(nn.Module):
    method __init__ (line 45) | def __init__(self, d_model, dropout=0.1, max_len=5000):
    method forward (line 57) | def forward(self, x):
  class Encoder (line 89) | class Encoder(nn.Module):
    method __init__ (line 90) | def __init__(self, ntoken, ninp, dropout=0.5):
    method init_weights (line 97) | def init_weights(self):
    method forward (line 101) | def forward(self, src):
  class Decoder (line 107) | class Decoder(nn.Module):
    method __init__ (line 108) | def __init__(self, ntoken, ninp):
    method init_weights (line 113) | def init_weights(self):
    method forward (line 118) | def forward(self, inp):
  function run_worker (line 132) | def run_worker(rank, world_size):

FILE: llm-train/pytorch/distribution/tensor-parallel/2d_parallel_example.py
  function demo_2d (line 57) | def demo_2d(rank, args):

FILE: llm-train/pytorch/distribution/tensor-parallel/sequence_parallel_example.py
  function demo_sp (line 36) | def demo_sp(rank, args):

FILE: llm-train/pytorch/distribution/tensor-parallel/tensor_parallel_example.py
  function demo_tp (line 44) | def demo_tp(rank, args):

FILE: llm-train/pytorch/distribution/tensor-parallel/utils.py
  function setup (line 10) | def setup(rank, world_size):
  function cleanup (line 19) | def cleanup():
  class ToyModel (line 23) | class ToyModel(nn.Module):
    method __init__ (line 24) | def __init__(self):
    method forward (line 30) | def forward(self, x):

FILE: llm-train/qlora/accuracy.py
  class Accuracy (line 81) | class Accuracy(evaluate.Metric):
    method _info (line 82) | def _info(self):
    method _compute (line 101) | def _compute(self, predictions, references, normalize=True, sample_wei...

FILE: llm-train/qlora/qlora.py
  class ModelArguments (line 52) | class ModelArguments:
  class DataArguments (line 66) | class DataArguments:
  class TrainingArguments (line 102) | class TrainingArguments(transformers.Seq2SeqTrainingArguments):
  class GenerationArguments (line 190) | class GenerationArguments:
  function find_all_linear_names (line 221) | def find_all_linear_names(args, model):
  class SavePeftModelCallback (line 235) | class SavePeftModelCallback(transformers.TrainerCallback):
    method save_model (line 236) | def save_model(self, args, state, kwargs):
    method on_save (line 250) | def on_save(self, args, state, control, **kwargs):
    method on_train_end (line 254) | def on_train_end(self, args, state, control, **kwargs):
  function get_accelerate_model (line 262) | def get_accelerate_model(args, checkpoint_dir):
  function print_trainable_parameters (line 346) | def print_trainable_parameters(args, model):
  function smart_tokenizer_and_embedding_resize (line 363) | def smart_tokenizer_and_embedding_resize(
  class DataCollatorForCausalLM (line 386) | class DataCollatorForCausalLM(object):
    method __call__ (line 393) | def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
  function extract_unnatural_instructions_data (line 438) | def extract_unnatural_instructions_data(examples, extract_reformulations...
  function extract_alpaca_dataset (line 468) | def extract_alpaca_dataset(example):
  function local_dataset (line 475) | def local_dataset(dataset_name):
  function make_data_module (line 490) | def make_data_module(tokenizer: transformers.PreTrainedTokenizer, args) ...
  function get_last_checkpoint (line 617) | def get_last_checkpoint(checkpoint_dir):
  function train (line 631) | def train():

Download .json

Condensed preview — 822 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (6,757K chars).

[
  {
    "path": ".gitignore",
    "chars": 3078,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": "LICENSE",
    "chars": 11357,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "README.md",
    "chars": 32558,
    "preview": "<p align=\"center\">\n  <img src=\"https://github.com/liguodongiot/llm-action/blob/main/pic/llm-action-v4.jpg\" >\n</p>\n\n\n<p> "
  },
  {
    "path": "ai-compiler/README.md",
    "chars": 4271,
    "preview": "\n## 树模型编译器\n\n- https://mlsys.org/Conferences/doc/2018/196.pdf\n- https://github.com/dmlc/treelite\n- https://treelite.readt"
  },
  {
    "path": "ai-compiler/Treebeard/README.md",
    "chars": 1043,
    "preview": "\nTreebeard: An Optimizing Compiler for Decision Tree Based ML Inference\n\n\n## 流程\n\n输入一个决策树数据结构，compiler通过一系列 IR 转化，将决策树数据结"
  },
  {
    "path": "ai-compiler/treelit/README.md",
    "chars": 170,
    "preview": "\n\n\n```\nconda create -n model-inference-venv python=3.9 -y\n\n\nconda activate model-inference-venv\n```\n\n\n\n\n\n\n\n- 机器学习：软件工程方法"
  },
  {
    "path": "ai-compiler/treelit/xgb.md",
    "chars": 59,
    "preview": "\n\n\n\n```\nconda create -n model-server-venv python=3.9 -y\n```"
  },
  {
    "path": "ai-compiler/triton-lang/README.md",
    "chars": 3,
    "preview": "\n\n\n"
  },
  {
    "path": "ai-framework/README.md",
    "chars": 149,
    "preview": "\n\n\n\n\n## 国外\n\n\n### PyTorch\n\n\n\n\n\n## 国内\n\n\n### Oneflow\n\n\n\n\n### PaddlePaddle\n\n\n\n\n### MindSpore\n\n\n\n\n\n\n自动混合精度\n\n- https://github."
  },
  {
    "path": "ai-framework/TensorRT-Model-Optimizer.md",
    "chars": 213,
    "preview": "\n\n\n\n- 代码：https://github.com/NVIDIA/TensorRT-Model-Optimizer\n- 文档：https://nvidia.github.io/TensorRT-Model-Optimizer/\n\n- 量"
  },
  {
    "path": "ai-framework/cuda/README.md",
    "chars": 3,
    "preview": "\n\n\n"
  },
  {
    "path": "ai-framework/deepspeed/1.DeepSpeed入门.md",
    "chars": 2450,
    "preview": "\n\n\n## DeepSpeed \n\n通过简单三步将Pytorch DDP模型训练改造 DeepSpeed DP 模型训练。\n\n第一步：**初始化DeepSpeed引擎**:\n```\nmodel_engine, optimizer, _, _"
  },
  {
    "path": "ai-framework/deepspeed/2.安装DeepSpeed.md",
    "chars": 5988,
    "preview": "\n## 安装DeepSpeed\n通过 pip 是最快捷的开始使用 DeepSpeed 的方式，这将安装最新版本的 DeepSpeed，不会与特定的 PyTorch 或 CUDA 版本绑定。DeepSpeed 包含若干个 C++/CUDA 扩"
  },
  {
    "path": "ai-framework/deepspeed/3.基于CIFAR-10使用DeepSpeed进行分布式训练 .md",
    "chars": 10889,
    "preview": "在本教程中，我们将向 CIFAR-10 模型中添加 DeepSpeed，这是一个小型图像分类模型。\n\n首先，我们将介绍如何运行原始的 CIFAR-10 模型。然后，我们将逐步启用此模型以在 DeepSpeed 中运行。\n\n## 运行原始的 "
  },
  {
    "path": "ai-framework/deepspeed/DeepSpeed配置JSON文件.md",
    "chars": 294,
    "preview": "## DeepSpeed Configuration JSON\n\n地址：https://www.deepspeed.ai/docs/config-json/\n\n\n\n### FP16 训练的 ZeRO 优化\n\n启用和配置 ZeRO 内存优化\n"
  },
  {
    "path": "ai-framework/deepspeed/README.md",
    "chars": 111,
    "preview": "\n\n\n- https://github.com/microsoft/DeepSpeedExamples\n- https://github.com/microsoft/DeepSpeedExamples.git\n\n\n\n\n\n\n"
  },
  {
    "path": "ai-framework/deepspeed/config-json/README.md",
    "chars": 4810,
    "preview": "- https://www.deepspeed.ai/docs/config-json/\n\n\n## Batch Size 相关的参数\n\n\ntrain_batch_size 必须等于 train_micro_batch_size_per_gp"
  },
  {
    "path": "ai-framework/deepspeed/config-json/deepspeed-nvme.md",
    "chars": 96,
    "preview": "\n\n\n\n\n\n- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning\n\n```\n\n```\n\n\n"
  },
  {
    "path": "ai-framework/deepspeed/deepspeed-slurm.md",
    "chars": 311,
    "preview": "\n\n\n\n\n\n## 支持的发布\n\nPDSH_LAUNCHER = 'pdsh'\nPDSH_MAX_FAN_OUT = 1024\n\nOPENMPI_LAUNCHER = 'openmpi'\nMPICH_LAUNCHER = 'mpich'\nIM"
  },
  {
    "path": "ai-framework/deepspeed/hello_bert/README.md",
    "chars": 3329,
    "preview": "\n# HelloDeepSpeed\n\n\n- 源码：https://github.com/microsoft/DeepSpeedExamples/tree/master/training/HelloDeepSpeed\n\n\n## HF\n\n```"
  },
  {
    "path": "ai-framework/deepspeed/hello_bert/train_bert.py",
    "chars": 30057,
    "preview": "import datetime\nimport json\nimport pathlib\nimport re\nimport string\nfrom functools import partial\nfrom typing import Any,"
  },
  {
    "path": "ai-framework/deepspeed/hello_bert/train_bert_ds.py",
    "chars": 61923,
    "preview": "import datetime\nimport json\nimport pathlib\nimport re\nimport string\nfrom functools import partial\nfrom typing import Any,"
  },
  {
    "path": "ai-framework/deepspeed/training/pipeline_parallelism/README.md",
    "chars": 121,
    "preview": "\n\n\n\n\n\n\n```\ndeepspeed --include localhost:3,4,5,6 train.py --deepspeed_config=ds_config.json -p 2 --steps=200\n```\n\n\n\n\n\n\n\n"
  },
  {
    "path": "ai-framework/dlrover.md",
    "chars": 120,
    "preview": "\n\n\nhttps://github.com/intelligent-machine-learning/dlrover\n\nDLRover: An Automatic Distributed Deep Learning System\n\n\n\n\n\n"
  },
  {
    "path": "ai-framework/huggingface-accelerate/README.md",
    "chars": 547,
    "preview": "\n\n- https://huggingface.co/docs/accelerate/package_reference/cli\n\n```\naccelerate env \n\n# \naccelerate config default [arg"
  },
  {
    "path": "ai-framework/huggingface-peft/README.md",
    "chars": 3,
    "preview": "\n\n\n"
  },
  {
    "path": "ai-framework/huggingface-transformers/API.md",
    "chars": 1008,
    "preview": "\n\n\n\n\n- https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py\n- TrainingArguments\n\n\n\n\n#"
  },
  {
    "path": "ai-framework/huggingface-transformers/FSDP.md",
    "chars": 611,
    "preview": "\n\n\n\n- https://pytorch.org/docs/stable/fsdp.html\n- https://huggingface.co/docs/accelerate/usage_guides/fsdp\n\n\ntransformer"
  },
  {
    "path": "ai-framework/huggingface-transformers/README.md",
    "chars": 1182,
    "preview": "\n\n## 量化\n\ntransformers 已经集成并 原生 支持了 bitsandbytes 和 auto-gptq 这两个量化库。\n\n\n- https://huggingface.co/docs/transformers/v4.35.2"
  },
  {
    "path": "ai-framework/huggingface-trl/README.md",
    "chars": 9,
    "preview": "\n\n\n\n\n\n\n\n\n"
  },
  {
    "path": "ai-framework/jax/README.md",
    "chars": 108,
    "preview": "\n\n\nJax 是我看过那么多项目中，唯一一个让我看了之后觉得「哇，软件还可以这么写，一切都很有道理」的项目。我觉得 Google 还是吸取了很多 Tensorflow 的经验，把它们都用到了 Jax 里面。\n\n\n\n\n"
  },
  {
    "path": "ai-framework/jax/reference.md",
    "chars": 122,
    "preview": "\n\n\n\n- https://jax.readthedocs.io/en/latest/notebooks/neural_network_with_tfds_data.html\n- https://github.com/google/jax\n"
  },
  {
    "path": "ai-framework/llama-cpp/README.md",
    "chars": 4218,
    "preview": "\n\n\n- https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file\n- https://github.com/ggerganov/llama.cpp\n\n\n\n\n\nGGUF量化"
  },
  {
    "path": "ai-framework/megatron-deepspeed/README.md",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "ai-framework/megatron-lm/README.md",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "ai-framework/mxnet/README.md",
    "chars": 822,
    "preview": "\n\n\n## 安装\n\n```\npip install --upgrade mxnet gluonnlp\n\npip install  mxnet==1.9.1 gluonnlp==0.10.0\n```\n\n## docker \n\n```\n# GP"
  },
  {
    "path": "ai-framework/mxnet/mnist.py",
    "chars": 3787,
    "preview": "\n\n# pylint: skip-file\nfrom __future__ import print_function\n\nimport argparse\nimport logging\nlogging.basicConfig(level=lo"
  },
  {
    "path": "ai-framework/mxnet/mxnet_cnn_mnist.py",
    "chars": 9779,
    "preview": "from __future__ import print_function\r\n\r\nimport argparse\r\nimport logging\r\nlogging.basicConfig(level=logging.INFO)\r\n\r\nimp"
  },
  {
    "path": "ai-framework/mxnet/mxnet_mlp_mnist.py",
    "chars": 3927,
    "preview": "# pylint: skip-file\r\nfrom __future__ import print_function\r\n\r\nimport argparse\r\nimport logging\r\nlogging.basicConfig(level"
  },
  {
    "path": "ai-framework/mxnet/oneflow_cnn_mnist.py",
    "chars": 6161,
    "preview": "import oneflow as flow\r\nimport oneflow.nn as nn\r\nfrom flowvision import transforms\r\nfrom flowvision import datasets\r\nimp"
  },
  {
    "path": "ai-framework/mxnet/oneflow_mlp_mnist.py",
    "chars": 3186,
    "preview": "import oneflow as flow\r\nimport oneflow.nn as nn\r\nfrom flowvision import transforms\r\nfrom flowvision import datasets\r\n\r\n\r"
  },
  {
    "path": "ai-framework/mxnet/reference.md",
    "chars": 122,
    "preview": "\n\n\n\n- https://github.com/apache/mxnet\n- https://github.com/dmlc/gluon-nlp/\n- https://nlp.gluon.ai/model_zoo/index.html\n\n"
  },
  {
    "path": "ai-framework/oneflow/README.md",
    "chars": 795,
    "preview": "\n\n## oneflow\n```\n\npython3 -m pip install oneflow==0.9.0\n\n\npython3 -m pip install -f https://release.oneflow.info oneflow"
  },
  {
    "path": "ai-framework/oneflow/oneflow_mlp_mnist.py",
    "chars": 3062,
    "preview": "\nimport oneflow as flow\nimport oneflow.nn as nn\nfrom flowvision import transforms\nfrom flowvision import datasets\n\n\nBATC"
  },
  {
    "path": "ai-framework/oneflow/reference.md",
    "chars": 171,
    "preview": "\n\n\n- https://github.com/Oneflow-Inc/oneflow\n- https://docs.oneflow.org/master/basics/04_build_network.html\n- https://doc"
  },
  {
    "path": "ai-framework/openai-triton/README.md",
    "chars": 100,
    "preview": "\n\n- https://github.com/openai/triton\n\n\n\n- OpenAI Triton 入门教程: https://zhuanlan.zhihu.com/p/684473453"
  },
  {
    "path": "ai-framework/paddlepaddle/README.md",
    "chars": 388,
    "preview": "\n\n- https://www.paddlepaddle.org.cn/install/quick\n- https://github.com/PaddlePaddle/PaddleNLP\n\n\n\n```\npython -m pip insta"
  },
  {
    "path": "ai-framework/paddlepaddle/reference.md",
    "chars": 553,
    "preview": "\n- paddle支持的硬件：https://www.paddlepaddle.org.cn/install/other\n\n- 【推荐】手写数字识别模型：https://www.paddlepaddle.org.cn/tutorials/p"
  },
  {
    "path": "ai-framework/pai-megatron-patch/README.md",
    "chars": 10397,
    "preview": "- https://github.com/alibaba/Pai-Megatron-Patch/\n\n```bash\ngit clone --recurse-submodules https://github.com/alibaba/Pai-"
  },
  {
    "path": "ai-framework/pai-torchacc.md",
    "chars": 205,
    "preview": "\n\n\n- https://help.aliyun.com/zh/pai/user-guide/torchacc-overview\n\n\nPAI-TorchAcc（Torch Accelerator）是基于PyTorch的训练加速框架，通过Gr"
  },
  {
    "path": "ai-framework/pytorch/README.md",
    "chars": 237,
    "preview": "\n\n\n## eager 模式\n\n\n\n\n## CUDA Graphs\n\n\n- Accelerating PyTorch with CUDA Graphs：https://pytorch.org/blog/accelerating-pytorc"
  },
  {
    "path": "ai-framework/pytorch/install.md",
    "chars": 879,
    "preview": "\n\n\n\n\n- 版本：https://pytorch.org/get-started/previous-versions/\n- https://download.pytorch.org/whl/torch/\n\n\n\n```\nconda inst"
  },
  {
    "path": "ai-framework/pytorch/reference.md",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "ai-framework/tensorflow/README.md",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "ai-framework/tensorflow/reference.md",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "ai-framework/transformer-engine/mnist/README.md",
    "chars": 18985,
    "preview": "\r\n\r\n```\r\ndocker rm -f transformer_engine\r\n\r\nnvidia-docker run -dti --name transformer_engine \\\r\n--restart=always --gpus "
  },
  {
    "path": "ai-framework/transformer-engine/mnist/main.py",
    "chars": 7707,
    "preview": "import argparse\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.optim as optim\nfrom torc"
  },
  {
    "path": "ai-framework/transformer-engine/mnist/main_stat.py",
    "chars": 8616,
    "preview": "import argparse\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.optim as optim\nfrom torc"
  },
  {
    "path": "ai-framework/unsloth-微调.md",
    "chars": 122,
    "preview": "\n\n\n- https://github.com/unslothai/unsloth\n\n\nUnsloth\n模型的微调，全部用 Triton Kernel 重写。从技术角度来看，这个项目非常有意思，它推到了 PyTorch 目前无法达到的优化极"
  },
  {
    "path": "ai-infra/ai-cluster/README.md",
    "chars": 82,
    "preview": "\r\n## AI硬件 \r\n\r\n\r\nA800只是在A100的基础上，将NVLink高速互连总线的带宽从600GB/s降低到400GB/s，仅此而已。\r\n\r\n\r\n\r\n\r\n"
  },
  {
    "path": "ai-infra/ai-hardware/AI芯片软件生态.md",
    "chars": 480,
    "preview": "\n\n\n\n## cuda\n\n\n```\nwget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_"
  },
  {
    "path": "ai-infra/ai-hardware/CUDA.md",
    "chars": 631,
    "preview": "\n\n\nCUDA 库主要包括以下几个部分：\n\nCUDA Runtime API：这是CUDA的核心库，提供了运行时的设备初始化、内存管理、内核执行等功能。\n\nCUDA Driver API：这是CUDA的底层驱动库，提供了与设备和操作系统底层"
  },
  {
    "path": "ai-infra/ai-hardware/GPU-network.md",
    "chars": 369,
    "preview": "\n\n\n\nnetwork\n- https://docs.nvidia.com/networking/display/mlnxofedv583070101/introduction\n\n\n\nnvme-of\n- https://docs.nvidi"
  },
  {
    "path": "ai-infra/ai-hardware/GPU相关环节变量.md",
    "chars": 172,
    "preview": "\n\n\n## CUDA\n\nCUDA_VISIBLE_DEVICES=1 \nexport CUDA_LAUNCH_BLOCKING=1\nexport CUDA_DEVICE_MAX_CONNECTIONS=1\n\n\n## NCCL\n\n\nexpor"
  },
  {
    "path": "ai-infra/ai-hardware/NIXL.md",
    "chars": 37,
    "preview": "\n\n\nhttps://github.com/ai-dynamo/nixl\n"
  },
  {
    "path": "ai-infra/ai-hardware/OEM-DGX.md",
    "chars": 477,
    "preview": "\n\n\n\n\n## H3C\n\nH3C UniServer R5500LC G5服务器---全新A800 GPU的人工智能液冷服务器，支持HGX A800 8-GPU模组，8块A800 GPU通过6个NVSWITCH实现400GB/s的全互联，A"
  },
  {
    "path": "ai-infra/ai-hardware/README.md",
    "chars": 563,
    "preview": "\r\n\r\n## Nvidia GPU\r\n\r\n\r\nNVIDIA A100 80GB PCIe GPU: https://www.edomtech.com.cn/product-detail/nvidia-a100-80gb-pcie-gpu/\r"
  },
  {
    "path": "ai-infra/ai-hardware/TSMC-台积电.md",
    "chars": 64,
    "preview": "\nTSMC N7（7纳米工艺）-DUV（深紫外线）光刻技术\n\n\nTSMC N4（4纳米工艺）-EUV（极紫外线）光刻技术\n\n\n\n"
  },
  {
    "path": "ai-infra/ai-hardware/cuda镜像.md",
    "chars": 291,
    "preview": "\n\n\n```\nhttps://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda\nhttps://gitlab.com/nvidia/container-images/cuda/-/tree"
  },
  {
    "path": "ai-infra/ai-hardware/gpudirect.md",
    "chars": 1890,
    "preview": "\n\n\n- https://docs.nvidia.com/gpudirect-storage/design-guide/index.html\n- https://docs.nvidia.com/gpudirect-storage/overv"
  },
  {
    "path": "ai-infra/ai-hardware/硬件对比.md",
    "chars": 241,
    "preview": "\n\n\nFPGA 相比同等面积和工艺 ASIC 的算力差着数量级\n\n\n大多数大模型 16-bit 权重真的可以量化到 8-bit 而不太损失精度。但要压缩到 4-bit，精度一般就会有比较大的损失了。\n\n\n\nNVIDIA 的 Tensor C"
  },
  {
    "path": "ai-infra/communication.md",
    "chars": 113,
    "preview": "\n\n\n- MoE 通信优化技术 COMET 开源: https://zhuanlan.zhihu.com/p/29264560896\n- https://github.com/bytedance/flux\n\n\n\n\n\n\n\n\n\n\n"
  },
  {
    "path": "ai-infra/存储/README.md",
    "chars": 128,
    "preview": "\n\n\n- [GDDR6 vs DDR4 vs HBM2?为什么CPU还不用GDDR？异构内存的未来在哪里？](https://www.zhihu.com/tardis/zm/art/83935084?source_id=1003)\n\n\n\n\n"
  },
  {
    "path": "ai-infra/存储/REF.md",
    "chars": 5,
    "preview": "\n\n\n\n\n"
  },
  {
    "path": "ai-infra/存储/nvme-ssd.md",
    "chars": 241,
    "preview": "\n\n\n- ChatGPT一路狂飙，NVMe SSD能否应对性能挑战？：https://blog.csdn.net/Memblaze_2011/article/details/129040963\n- NVMe 2.0 简介：https://b"
  },
  {
    "path": "ai-infra/存储/固态硬盘.md",
    "chars": 1429,
    "preview": "\n\n\n- NVME高端固态硬盘推荐（PCIE3.0篇）: https://zhuanlan.zhihu.com/p/455414014\n\n## 固态硬盘组成\n\n固态硬盘一般由四个部分组成：控制单元（主控）、存储单元（颗粒）、缓存单元、电路板"
  },
  {
    "path": "ai-infra/存储/存储.md",
    "chars": 327,
    "preview": "\n\n\n\n\n\n\n\n\n\n\n- [硬盘科普，M.2，PCI-E，NVMe 傻傻分不清](https://zhuanlan.zhihu.com/p/396745362)\n\n物理接口，通道，协议\n\n\n\n- [NVMe、AHCI、PCIe、SATA、N"
  },
  {
    "path": "ai-infra/算力/AI芯片.md",
    "chars": 3764,
    "preview": "\n\n\n## 摩尔线程\n\n\n2022年，摩尔线程就推出了GPU统一系统架构MUSA，发布并量产“苏堤”和“春晓”两颗全功能GPU芯片，这也是国内采用现代GPU架构\n\n\n\n\n\n## 主流 AI 芯片配置\n\n\n\n| 厂商  | 型号       "
  },
  {
    "path": "ai-infra/算力/GPU工作原理.md",
    "chars": 119,
    "preview": "\n\n\n- [GPU 工作原理解析](https://zhuanlan.zhihu.com/p/697694330)\n- [GPU 架构与 CUDA 关系](https://zhuanlan.zhihu.com/p/697746975)\n\n"
  },
  {
    "path": "ai-infra/算力/NVIDIA-GPU型号.md",
    "chars": 2120,
    "preview": "\n\n\nNvidia下游市场分为四类：游戏、专业可视化、数据中心、汽车，各市场重点产品如下：\n\n游戏：GeForce RTX/GTX系列GPU（PCs）、GeForce NOW（云游戏）、SHIELD（游戏主机）；\n\n专业可视化：Quadro"
  },
  {
    "path": "ai-infra/算力/推理芯片.md",
    "chars": 71,
    "preview": "\n\n如果说大模型「上半场」是技术的较量，那么「下半场」则是商业化的比拼。一旦大模型成熟，与之而来的便是落地应用，滋生对推理芯片的庞大需求。\n\n"
  },
  {
    "path": "ai-infra/算力/昇腾NPU.md",
    "chars": 24,
    "preview": "\n\n\n\nAtlas 800-9000A2\n\n\n\n"
  },
  {
    "path": "ai-infra/网络/HPC性能测试.md",
    "chars": 928,
    "preview": "\n\n\n\n- HPC-单机&多机点对点RDMA网络性能测试：https://www.volcengine.com/docs/6419/164863\n\n\n```\napt update && apt install -y infiniband-d"
  },
  {
    "path": "ai-infra/网络/IB-docker.md",
    "chars": 37,
    "preview": "\n\n\n\n\n\n\n```\nyum install libibverbs\n```"
  },
  {
    "path": "ai-infra/网络/IB流量监控.md",
    "chars": 92,
    "preview": "\n\n\nifstat,nload 这些工具都只能监控 TCP/IP 的流量，因此虽然其上面能显示出 IB 卡，但其实并不能监控到出入 IB 的流量数据，结果中对应部分一直都是 0。\n\n\n"
  },
  {
    "path": "ai-infra/网络/IB软件.md",
    "chars": 1312,
    "preview": "\n\n\n\n- centos.install.mellanox.gpudirect.md\n- Assueme NVIDIA Driver and CUDA already successfully installed.\n- https://gi"
  },
  {
    "path": "ai-infra/网络/InfiniBand.md",
    "chars": 635,
    "preview": "\n\nInfiniBand网络接口的一种分类方式，按照数据传输速率的的不同进行区分。具体如下：\n\nSDR（Single Data Rate）：单倍数据率，即8Gb/s 。\nDDR（Double Data Rate）：双倍数据率，即16Gb/s"
  },
  {
    "path": "ai-infra/网络/NCCL.md",
    "chars": 413,
    "preview": "\n\nNCCL 通信库仅针对 Nvidia Spectrum-X 和 Nvidia InfiniBand 进行了优化。\n\n博通 Tomahawk 5 以太网方案，客户需要有足够的工程能力来为 Tomahawk 5 适配及优化英伟达的 NCCL"
  },
  {
    "path": "ai-infra/网络/README.md",
    "chars": 614,
    "preview": "\n\n\n\n- 聊透 GPU 通信技术——GPU Direct、NVLink、RDMA: https://zhuanlan.zhihu.com/p/654417967\n- 腾讯机智团队分享--GPU数据传输概览: https://zhuanla"
  },
  {
    "path": "ai-infra/网络/REF.md",
    "chars": 98,
    "preview": "\n\n\n- [RoCE、IB和TCP等网络的基本知识及差异对比](https://support.huawei.com/enterprise/zh/doc/EDOC1100203347)\n\n\n\n\n\n"
  },
  {
    "path": "ai-infra/网络/Spine-Leaf和InfiniBand网络架构区别简述.md",
    "chars": 1695,
    "preview": "\nSpine-Leaf和InfiniBand是两种不同的网络架构和技术，它们在设计和应用上有一些区别。\n \n1. Spine-Leaf网络架构：\n   - Spine-Leaf是一种扁平化（flat）的网络架构，通常应用于数据中心网络。它由"
  },
  {
    "path": "ai-infra/网络/nccl-test-集合通讯的性能测试.md",
    "chars": 1008,
    "preview": "\n\n- https://github.com/NVIDIA/nccl-tests\n- https://cloud.baidu.com/doc/GPU/s/Yl3mr0ren\n- HPC-基于NCCL通信库的多机RDMA网络性能测试: htt"
  },
  {
    "path": "ai-infra/网络/nvbandwidth.md",
    "chars": 28787,
    "preview": "# nvbandwidth\n\n用于测量 NVIDIA GPU 带宽的工具。\n\n使用copy engine或kernel copy方法测量不同链路上各种 memcpy 模式的带宽。 \n\nnvbandwidth 报告系统上当前测量的带宽。 可能"
  },
  {
    "path": "ai-infra/网络/roce.md",
    "chars": 527,
    "preview": "\n\n\nAI场景下高性能网络技术RoCE v2介绍: https://mp.weixin.qq.com/s/XyMFst3w-d65u4fU7cgLPA\n\n\nRoCE是基于 Ethernet的RDMA，RoCEv1版本基于网络链路层，无法跨网"
  },
  {
    "path": "ai-infra/网络/网络硬件.md",
    "chars": 316,
    "preview": "\nLOC PIX PXB PHB SYS\n\n\nGPU间的通讯速度：\n\nNV# > PIX > PXB > PHB > NODE > SYS\n\n\n- SYS ： 穿越 PCIe 的连接以及 NUMA 节点之间的 SMP 互连（例如 QPI/U"
  },
  {
    "path": "ai-infra/网络/通信软件.md",
    "chars": 646,
    "preview": "\n\n\n\nOpen MPI / MPICH\n\n- https://github.com/pmodels/mpich\n- https://github.com/open-mpi/ompi\n\n\n\n\n\n\nMPI有多种实现方式，例如OpenMPI，M"
  },
  {
    "path": "ai-infra/网络/集合通信原语.md",
    "chars": 78,
    "preview": "\n\n\n集合通信总结和 mpi4py 实践\n\nhttps://www.armcvai.cn/2025-06-28/mpi4py-summary.html\n\n\n"
  },
  {
    "path": "blog/TODO.md",
    "chars": 639,
    "preview": "\n\n\nllm推理优化技术：\n- https://github.com/liguodongiot/llm-action/blob/main/docs/llm-inference/llm%E6%8E%A8%E7%90%86%E4%BC%98%E"
  },
  {
    "path": "blog/ai-infra/AI 集群基础设施 InfiniBand 详解.md",
    "chars": 30514,
    "preview": "\nGPU在高性能计算和深度学习加速中扮演着非常重要的角色， GPU的强大的并行计算能力，大大提升了运算性能。随着运算数据量的不断攀升，GPU间需要大量的交换数据，因此，GPU通信性能成为了非常重要的指标。\n\n在 AI 集群中进行分布式训练时"
  },
  {
    "path": "blog/ai-infra/AI 集群基础设施 NVMe SSD 详解.md",
    "chars": 38288,
    "preview": "\n随着 AI 和 HPC 数据集的大小不断增加，为给定应用程序加载数据所花费的时间开始对整个应用程序的性能造成压力。 在考虑端到端应用程序性能时，快速的 GPU 通过缓慢的 I/O 将显著降低GPU的利用率。\n\nI/O 是将数据从存储加载到"
  },
  {
    "path": "blog/distribution-parallelism/大模型分布式训练并行技术（一）-概述.md",
    "chars": 4961,
    "preview": "近年来，随着Transformer、MOE架构的提出，使得深度学习模型轻松突破上万亿规模参数，传统的单机单卡模式已经无法满足超大模型进行训练的要求。因此，我们需要基于单机多卡、甚至是多机多卡进行分布式大模型的训练。\n\n而利用AI集群，使深度"
  },
  {
    "path": "blog/distribution-parallelism/大模型分布式训练并行技术（九）-总结.md",
    "chars": 6117,
    "preview": "\n\n\n\n近年来，随着Transformer、MOE架构的提出，使得深度学习模型轻松突破上万亿规模参数，传统的单机单卡模式已经无法满足超大模型进行训练的要求。因此，我们需要基于单机多卡、甚至是多机多卡进行分布式大模型的训练。\n\n而利用AI集群"
  },
  {
    "path": "blog/distribution-parallelism/大模型分布式训练并行技术（六）-多维混合并行.md",
    "chars": 7391,
    "preview": "\n近年来，随着Transformer、MOE架构的提出，使得深度学习模型轻松突破上万亿规模参数，传统的单机单卡模式已经无法满足超大模型进行训练的要求。因此，我们需要基于单机多卡、甚至是多机多卡进行分布式大模型的训练。\n\n而利用AI集群，使深"
  },
  {
    "path": "blog/llm-algo/moe.md",
    "chars": 85,
    "preview": "\n\n\n\n将输入路由到不止一个专家，以便门控学会如何进行有效的路由选择，因此至少需要选择两个专家。Switch Transformers 就这点进行了更多的研究。\n\n\n\n\n"
  },
  {
    "path": "blog/llm-algo/大白话Transformer架构.md",
    "chars": 815,
    "preview": "\n\nAttention（注意力机制）： Attention机制允许模型为输入序列中的每个位置分配不同的权重，用以关注输入序列中不同位置的信息。它通过计算每个位置与其他所有位置之间的相似度（通过点积、缩放点积等方法），然后将这些相似度转换成权"
  },
  {
    "path": "blog/llm-compression/大模型量化技术原理-ZeroQuant系列.md",
    "chars": 1253,
    "preview": "\n近年来，随着Transformer、MOE架构的提出，使得深度学习模型轻松突破上万亿规模参数，从而导致模型变得越来越大，因此，我们需要一些大模型压缩技术来降低模型部署的成本，并提升模型的推理性能。\n模型压缩主要分为如下几类：\n\n-   剪"
  },
  {
    "path": "blog/llm-compression/大模型量化技术原理：QoQ量化及QServe推理服务系统.md",
    "chars": 21309,
    "preview": "近年来，随着Transformer、MOE架构的提出，使得深度学习模型轻松突破上万亿规模参数，从而导致模型变得越来越大，因此，我们需要一些大模型压缩技术来降低模型部署的成本，并提升模型的推理性能。\n模型压缩主要分为如下几类：\n\n-   剪枝"
  },
  {
    "path": "blog/llm-inference/大模型推理框架概述.md",
    "chars": 15051,
    "preview": "\n从 ChatGPT 面世以来，引领了大模型时代的变革，除了大模型遍地开花以外，承载大模型进行推理的框架也是层出不穷，大有百家争鸣的态势。本文主要针对业界知名度较高的一些大模型推理框架进行相应的概述。\n\n## vLLM\n\n- GitHub:"
  },
  {
    "path": "blog/llm-localization/大模型国产化适配1-华为昇腾AI全栈软硬件平台总结.md",
    "chars": 13778,
    "preview": "随着 ChatGPT 的现象级走红，引领了AI大模型时代的变革，从而导致 AI 算力日益紧缺。与此同时，中美贸易战，导致AI算力国产化适配势在必行。本文主要对最近使用昇腾芯片做一个简单总结。\n\n\n## 昇腾AI全栈软硬件平台简述\n\n昇腾芯片"
  },
  {
    "path": "blog/llm-localization/大模型国产化适配4-基于昇腾910使用LLaMA-13B进行多机多卡训练.md",
    "chars": 18437,
    "preview": "\n\n随着 ChatGPT 的现象级走红，引领了 AI 大模型时代的变革，从而导致 AI 算力日益紧缺。与此同时，中美贸易战以及美国对华进行AI芯片相关的制裁导致 AI 算力的国产化适配势在必行。之前讲述了**基于昇腾910使用ChatGLM"
  },
  {
    "path": "blog/llm-peft/大模型参数高效微调技术原理综述（一）-背景、参数高效微调简介.md",
    "chars": 4951,
    "preview": "随着，ChatGPT 迅速爆火，引发了大模型的时代变革。然而对于普通大众来说，进行大模型的预训练或者全量微调遥不可及。由此，催生了各种参数高效微调技术，让科研人员或者普通开发者有机会尝试微调大模型。\n\n因此，该技术值得我们进行深入分析其背后"
  },
  {
    "path": "blog/llm-peft/大模型参数高效微调技术原理综述（五）-LoRA、AdaLoRA、QLoRA.md",
    "chars": 6864,
    "preview": "随着，ChatGPT 迅速爆火，引发了大模型的时代变革。然而对于普通大众来说，进行大模型的预训练或者全量微调遥不可及。由此，催生了各种参数高效微调技术，让科研人员或者普通开发者有机会尝试微调大模型。\n\n因此，该技术值得我们进行深入分析其背后"
  },
  {
    "path": "blog/reference/高性能 LLM 推理框架的设计与实现.md",
    "chars": 64,
    "preview": "\n\n\n- 高性能 LLM 推理框架的设计与实现：https://zhuanlan.zhihu.com/p/682872971\n\n"
  },
  {
    "path": "docs/README.md",
    "chars": 351,
    "preview": "\n\n\n## [LLM 基础](https://github.com/liguodongiot/llm-action/tree/main/docs/llm-base)\n\n\n\n## [LLM 面试题](https://github.com/li"
  },
  {
    "path": "docs/conda.md",
    "chars": 89,
    "preview": "\n\n安装\n- https://docs.anaconda.com/free/miniconda/\n\n- https://repo.anaconda.com/miniconda/\n"
  },
  {
    "path": "docs/flash-attention/FlashAttention.md",
    "chars": 688,
    "preview": "\n\n\n\n\n- https://github.com/Dao-AILab/flash-attention\n\n- FlashAttention: Fast and Memory-Efficient Exact Attention with IO"
  },
  {
    "path": "docs/llm-base/FLOPS.md",
    "chars": 69,
    "preview": "\n\n\n## FLOPS\n\nFLOPS（Floating-point operations per second），每秒浮点运算次数\n\n\n\n"
  },
  {
    "path": "docs/llm-base/NVIDIA-Nsight-Systems性能分析.md",
    "chars": 429,
    "preview": "\n\n\n# NVIDIA Nsight Systems\n\n\nNVIDIA Nsight Systems是一款低开销性能分析工具，旨在为开发人员提供优化软件所需的洞察力。无偏差的活动数据可在工具中可视化，可帮助用户调查瓶颈，避免推断误报，并以更"
  },
  {
    "path": "docs/llm-base/README.md",
    "chars": 557,
    "preview": "\r\n\r\n\r\n## [AI 算法](https://github.com/liguodongiot/llm-action/blob/main/docs/llm-base/ai-algo.md)\r\n\r\n## [AI 集群](https://gi"
  },
  {
    "path": "docs/llm-base/a800-env-install.md",
    "chars": 9673,
    "preview": "\r\n## GCC 升级\r\n\r\n```\r\nyum update -y\r\nyum install -y centos-release-scl\r\nyum install -y devtoolset-9\r\n\r\n\r\nsource /opt/rh/de"
  },
  {
    "path": "docs/llm-base/ai-algo.md",
    "chars": 514,
    "preview": "\n\n\n\n## CodeGeeX\n\n- CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X\n- http"
  },
  {
    "path": "docs/llm-base/autoregressive-lm-decoding-methods.md",
    "chars": 10762,
    "preview": "近年来，以Transformers架构为基础的大模型正在席卷整个AI界。Transformer 开创了继 MLP 、CNN和 RNN之后的第四大类模型。而基于Transformer结构的模型又可以分为Encoder-only、Decoder"
  },
  {
    "path": "docs/llm-base/dcgmi.md",
    "chars": 88581,
    "preview": "\n\n\n下表列出了不同 GPU 产品上支持的功能。\n\n| Feature Group | Tesla | Titan | Quadro | GeForce |\n| --- | --- | --- | --- | --- |\n| Field V"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/README.md",
    "chars": 1731,
    "preview": "\r\n\r\n\r\n\r\n- One weird trick for parallelizing convolutional neural networks\r\n  - 不同的层适合用不同的并行方式，具体的，卷积层数据比参数大，适合数据并行，全连接层参"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/auto-parallel/Alpa.md",
    "chars": 7301,
    "preview": "\n\n- https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin\n\n\n\n- alpa: https://www.zhihu.com/question/414549"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/auto-parallel/Flexflow.md",
    "chars": 3596,
    "preview": "\n- 原paper\n\n## Beyond Data and Model Parallelism for Deep Neural Networks\n\n### 概要\n\n训练深度神经网络 (DNN) 的计算要求已经增长到现在并行训练已成为标准做法"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/auto-parallel/Galvatron.md",
    "chars": 2112,
    "preview": "\n\n北大河图大模型自动并行训练工具Galvatron：https://zhuanlan.zhihu.com/p/591924340\n\n\n\n\n系统特性\n为了解决上述问题，研究者们提出了一些系列工作来探索混合并行的自动搜索：一类工作主要讨论了同"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/auto-parallel/Mesh-Tensorflow.md",
    "chars": 681,
    "preview": "\n\n\n\n\n- Mesh-Tensorflow: 广义分布式: https://zhuanlan.zhihu.com/p/342223356\n\n\n在深度学习中，由于数据量和计算量的浩大，往往会使用到分布式计算。而最常用的分布式模式是SPMD("
  },
  {
    "path": "docs/llm-base/distribution-parallelism/auto-parallel/README.md",
    "chars": 2495,
    "preview": "\r\n\r\n- 分布式训练自动并行论文：https://zhuanlan.zhihu.com/p/642446009\r\n- 北大河图大模型自动并行训练工具Galvatron：https://zhuanlan.zhihu.com/p/591924"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/auto-parallel/Unity.md",
    "chars": 224,
    "preview": "\n\n\n- Unity：通过代数变换和并行化的联合优化加速 DNN 训练：https://www.victorlamp.com/article/7387511088\n- 【论文赏读】Unity: Accelerating DNN Traini"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/auto-parallel/auto-parallel.md",
    "chars": 218,
    "preview": "\n\n\n- Colossal-Auto\n- MindSpore\n-  [Tofu ](https://arxiv.org/abs/1807.08887),\n-  [Flexflow ](https://arxiv.org/abs/1807.0"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/auto-parallel/gspmd.md",
    "chars": 401,
    "preview": "\n\n\n\n- GSPMD\n\n- GSPMD:General and Scalable Parallelization for ML Computation Graphs: https://zhuanlan.zhihu.com/p/506026"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/auto-parallel/分布式训练自动并行概述.md",
    "chars": 86,
    "preview": "\n\n\n\n\nA Survey on Auto-Parallelism of Neural Networks Training\n\n\n\n\n\n\n\n## 2. 问题定义\n\n\n\n\n\n\n"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/auto-parallel/飞桨面向异构场景下的自动并行设计与实践.md",
    "chars": 7628,
    "preview": "\n\n\n- 飞桨面向异构场景下的自动并行设计与实践: https://www.51cto.com/article/753512.html\n\n## 一、背景介绍\n\n\n第一个维度是自动并行的程度，分为全自动和半自动；\n\n第二个维度是并行粒度，分别"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/data-parallelism/README.md",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "docs/llm-base/distribution-parallelism/moe-parallel/README.md",
    "chars": 1919,
    "preview": "\n\n- https://github.com/laekov/fastmoe\n- SmartMoE: https://github.com/zms1999/SmartMoE\n\n\n\n\n\n- 飞浆-MOE：https://www.paddlepa"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/moe-parallel/moe-framework.md",
    "chars": 660,
    "preview": "\n\n\n\n## colossalai\n\n- https://colossalai.org/zh-Hans/docs/advanced_tutorials/integrate_mixture_of_experts_into_your_model"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/moe-parallel/moe-parallel.md",
    "chars": 2,
    "preview": "\n\n"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/moe-parallel/paddle_moe.py",
    "chars": 1929,
    "preview": "\n# 导入需要的包\nimport paddle\nfrom paddle.nn import Layer, LayerList, Linear, Dropout\nfrom paddle.incubate.distributed.models."
  },
  {
    "path": "docs/llm-base/distribution-parallelism/multidimensional-hybrid-parallel/README.md",
    "chars": 7766,
    "preview": "- https://huggingface.co/docs/transformers/perf_train_gpu_many\r\n- https://huggingface.co/transformers/v4.12.5/parallelis"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/pipeline-parallelism/README.md",
    "chars": 211,
    "preview": "\n\n\n\nDP 将批次（global batch size）拆分为小批次（mini-batch）。PP 将一个小批次切分为多个块 (chunks)，因此，PP 引入了微批次(micro-batch，MBS) 的概念。\n\n计算 DP + PP "
  },
  {
    "path": "docs/llm-base/distribution-parallelism/tensor-parallel/README.md",
    "chars": 138,
    "preview": "\n\n\n\nMegatron-LM 的张量并行，通信量很大，同时，计算和通信没办法同时进行。\n\n\n\n需要特别考虑的是：由于前向和后向传播中每层都有两个 all reduce，因此 TP 需要设备间有非常快速的互联。因此，除非你有一个非常快的网络"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/tensor-parallel/tensor-parallel.md",
    "chars": 1045,
    "preview": "\n\n\n\n我们以一个线性层为例，它包括一个通用矩阵乘法(GEMM)：$Y=XA$。 给定2个处理器，我们把列 A 划分为 $[A1 A2]$, 并在每个处理器上计算 $Y_i=XA_i$ ， 然后，形成 $[Y_1 Y_2]=[XA_1 XA"
  },
  {
    "path": "docs/llm-base/distribution-parallelism/并行技术.drawio",
    "chars": 1496,
    "preview": "<mxfile host=\"Electron\" modified=\"2023-08-31T11:56:35.644Z\" agent=\"5.0 (Macintosh; Intel Mac OS X 12_3_1) AppleWebKit/53"
  },
  {
    "path": "docs/llm-base/distribution-training/Bloom-176B训练经验.md",
    "chars": 965,
    "preview": "\n\n\n- https://huggingface.co/blog/zh/bloom-megatron-deepspeed\n\n\n用 FP16 训练巨型 LLM 模型是一个禁忌。FP16 经常溢出！FP16 的最大数值范围为 64k，您只能进行"
  },
  {
    "path": "docs/llm-base/distribution-training/FP16-BF16.md",
    "chars": 482,
    "preview": "\n\n\n\n\n## FP16\n\n\n数值上溢和数值下溢的问题\n\n数值上溢：大量级的数被近似为正无穷或负无穷时发生上溢，进一步运算导致无限值变为非数字。\n\n数值下溢：接近零的数被四舍五入为0时发生下溢。被零除，取零的对数，进一步运算会变为非数字。\n"
  },
  {
    "path": "docs/llm-base/distribution-training/GLM-130B训练经验.md",
    "chars": 1211,
    "preview": "\n- https://github.com/THUDM/GLM-130B/blob/main/README_zh.md\n\n\n1. 浮点数格式：FP16 混合精度\n\nFP16混合精度已经成为主流大规模模型训练框架的默认选项，用于训练十亿到百亿"
  },
  {
    "path": "docs/llm-base/distribution-training/OPT-175B训练经验.md",
    "chars": 4470,
    "preview": "\n## OPT-175B是如何炼成的\n- https://zhuanlan.zhihu.com/p/622061951\n\n\n### 训练大模型的痛点\n\n我们都知道，训练大模型需要以月计的时间，比如这次OPT-175B就要在1000个80G "
  },
  {
    "path": "docs/llm-base/distribution-training/README.md",
    "chars": 54,
    "preview": "\n\n\n\n用 FP16 训练巨型 LLM 模型是一个禁忌，它将面临更多的稳定性挑战。\n\n\n\n\n\n\n\n\n\n\n\n\n"
  },
  {
    "path": "docs/llm-base/distribution-training/自动混合精度.md",
    "chars": 111,
    "preview": "\n\n\nTorch.cuda.amp vs Nvidia apex\n\n\n\n\n\npytorch从1.6版本开始，已经内置了torch.cuda.amp，采用自动混合精度训练就不需要加载第三方NVIDIA的apex库了。\n\n\n\n"
  },
  {
    "path": "docs/llm-base/gpu-env-var.md",
    "chars": 30,
    "preview": "\n\n\nCUDA_VISIBLE_DEVICES=1 \n\n\n\n"
  },
  {
    "path": "docs/llm-base/h800-env-install.md",
    "chars": 3859,
    "preview": "\n| Fermi **†** | Kepler **†** | Maxwell **‡** | Pascal | Volta | Turing | Ampere | Ada (Lovelace) | [Hopper](https://www"
  },
  {
    "path": "docs/llm-base/monitor.md",
    "chars": 2371,
    "preview": "\n\n# NVIDIA DCGM\n\n\n## Remove Older Installations\n\nTo remove the previous installation (if any), perform the following ste"
  },
  {
    "path": "docs/llm-base/multimodal/sora.md",
    "chars": 344,
    "preview": "\n\n\n方案：VAE Encoder（视频压缩） -> Transform Diffusion （从视频数据中学习分布，并根据条件生成新视频） -> VAE Decoder （视频解压缩）\n\n从博客出发，经过学术Survey，可以推断出全貌。"
  },
  {
    "path": "docs/llm-base/nvidia-smi-dmon.md",
    "chars": 4572,
    "preview": "\n\n设备监控命令，以滚动条形式显示GPU设备统计信息。\n\nGPU统计信息以一行的滚动格式显示，要监控的指标可以基于终端窗口的宽度进行调整。如果没有指定任何GPU，则默认监控所有GPU。\n\n\n## 指定刷新时间(-d )\n```\n> nvid"
  },
  {
    "path": "docs/llm-base/nvidia-smi.md",
    "chars": 25542,
    "preview": "\n# nvidia-smi\n\n## 基本概念\n- Tx是发送数据的意思，Rx是接收数据的意思。\n\n\n\n## 基本操作\n\n```\nnvidia-smi\n```\n\n### 查询GPU卡信息和统计GPU卡数\n\n\n```\nnvidia-smi -L"
  },
  {
    "path": "docs/llm-base/rlhf/README.md",
    "chars": 303,
    "preview": "\n\n\n\n\n\n## 百川2\n\n\nReward Model:\n\n\nPrompt多样性：构造了一个200+细分类目的数据体系，尽可能覆盖用户需求，同时提升每类prompt多样性，从而提升泛化能力\nResponse多样性：用不同尺寸和阶段的百川模型"
  },
  {
    "path": "docs/llm-base/scenes/README.md",
    "chars": 6563,
    "preview": "\n\n## 任务合集\n\n```\n句子嵌入（Sentence Embedding）：将句子映射到固定维度的向量表示形式。\n文本排序（Text Ranking）：对一组文本进行排序，以确定它们与给定查询的相关性。\n分词（Word Segmenta"
  },
  {
    "path": "docs/llm-base/scenes/cv/README.md",
    "chars": 305,
    "preview": "\n\n\n## CV算法\n\n\n\n- 图像分类\n- 图像语义分割（Semantic Segmentation）\n- 目标检测（Object Detection）\n- 视频分类（video classification）\n\n\n\n\n- 人脸关键点检测"
  },
  {
    "path": "docs/llm-base/scenes/cv/paddle/README.md",
    "chars": 325,
    "preview": "\n\n\n\n\n- https://www.paddlepaddle.org.cn/documentation/docs/zh/practices/cv/landmark_detection.html\n\n\n\nwget --no-check-cer"
  },
  {
    "path": "docs/llm-base/scenes/cv/pytorch/README.md",
    "chars": 131,
    "preview": "\n\n\n## pyav\n\n\n```\npip install av\n```\n\n\n\n\n\n\n\n- Pytorch搭建训练简单的图像分割模型:https://blog.csdn.net/qq_42032507/article/details/1030"
  },
  {
    "path": "docs/llm-base/scenes/cv/reference.md",
    "chars": 57,
    "preview": "\n\n\n- https://pytorch.org/vision/stable/models.html\n\n\n\n\n\n\n"
  },
  {
    "path": "docs/llm-base/scenes/multi-modal/README.md",
    "chars": 1957,
    "preview": "\n\n\n## 算法\n\nCLIP\n\nBLIP\n\n\nBLIP2 \n\nLLaVA \n\nminiGPT4\n\nInstructBLIP\n\n\nMDETR\n\n\n\n### Stable Diffusion  \n\n扩散模型 ， 多模态任务：文生图 图生图\n\n-"
  },
  {
    "path": "docs/llm-base/scenes/multi-modal/reference.md",
    "chars": 613,
    "preview": "\n\n\n\n- [多模态大模型 CLIP, BLIP, BLIP2, LLaVA, miniGPT4, InstructBLIP 系列解读](https://zhuanlan.zhihu.com/p/653902791)\n- [AIGC爆火的背"
  },
  {
    "path": "docs/llm-base/singularity命令.md",
    "chars": 96,
    "preview": "\n\n\n\n\n\n\n\n\n\n```\n# -cleanenv选项来禁用所有环境变量，确保容器的环境是独立的\nsingularity run --cleanenv my_container.sif\n```"
  },
  {
    "path": "docs/llm-base/slurm.md",
    "chars": 1932,
    "preview": "\n\n- Slurm简介: http://hmli.ustc.edu.cn/doc/linux/slurm-install/slurm-install.html\n\n\n\n## 简介\n\n所有需运行的作业，无论是用于程序调试还是业务计算，都可以通过"
  },
  {
    "path": "docs/llm-base/分布式训练加速技术.md",
    "chars": 1149,
    "preview": "\n\n\n\n\n分布式并行技术：\n\t- 数据并行\n\t- 张量并行\n\t- 流水线并行\n\t- MOE并行（稀疏化）\n\t- ZeRO\n\t- 序列并行（LayerNorm 和 Dropout 的计算被平摊到了各个设备上，减少了计算资源的浪费；LayerN"
  },
  {
    "path": "docs/llm-base/多机RDMA性能测试.txt",
    "chars": 1389,
    "preview": "\n\n\n\n\n\n\n```\nwget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.r"
  },
  {
    "path": "docs/llm-base/机器学习中常用的数据类型.md",
    "chars": 1702,
    "preview": "\n- https://en.wikipedia.org/wiki/Bfloat16_floating-point_format#bfloat16_floating-point_format\n\n\n\n模型的大小由其参数量及其精度决定，精度通常为"
  },
  {
    "path": "docs/llm-experience.md",
    "chars": 144,
    "preview": "\n\n\n\n\n\n微调：\n\n\nPEFT总是有局限性，基于低秩的微调可能并不always work，比如：finetune与pretrain的gap过大的时候，比如中英差异。\n\n\n微调的过程不是让模型适应另外的数据分布，而是让模型更好的激发出本身的"
  },
  {
    "path": "docs/llm-inference/DeepSpeed-Inference.md",
    "chars": 282,
    "preview": "\n\n\n\n目前业界基本都针对 Transformer layer 结构特点，手工实现了算子融合。以 DeepSpeed Inference 为例，算子融合主要分为如下四类：\n\n归一化层和 QKV 横向融合：将三次计算 Query/Key/Va"
  },
  {
    "path": "docs/llm-inference/KV-Cache.md",
    "chars": 207,
    "preview": "\n\n\n最后需要注意当sequence特别长的时候，KV Cache其实还是个Memory刺客。\n\n比如batch_size=32, head=32, layer=32, dim_size=4096, seq_length=2048, flo"
  },
  {
    "path": "docs/llm-inference/LLM服务框架对比.md",
    "chars": 881,
    "preview": "\n\n\n\n\n## FlexFlow Server\n\n- https://github.com/flexflow/FlexFlow/tree/inference\n\n\n指标：\n\n每秒生成token的延迟\n\n\n模型：\n\nLLaMA-30B\n\nLLa"
  },
  {
    "path": "docs/llm-inference/README.md",
    "chars": 551,
    "preview": "\n\n\n\n吞吐量  \n\n\n延迟\n\n\n\n\n\n投机采样：\n- https://github.com/feifeibear/LLMSpeculativeSampling\n\n美杜莎：\n- https://github.com/FasterDecodi"
  },
  {
    "path": "docs/llm-inference/blog.md",
    "chars": 4859,
    "preview": "\n\n\n- LLM推理优化技术综述：KVCache、PageAttention、FlashAttention、MQA、GQA：https://zhuanlan.zhihu.com/p/655325832\n\n## KVCache\n\n\n## Pa"
  },
  {
    "path": "docs/llm-inference/flexflow/投机采样.md",
    "chars": 1489,
    "preview": "\n\n\n\n- 大模型推理妙招—投机采样（Speculative Decoding）: https://zhuanlan.zhihu.com/p/651359908\n\n为了解决推理速度慢的问题，已经进行了许多针对推理的工程优化，例如改进的计算核"
  },
  {
    "path": "docs/llm-inference/llm推理优化技术.md",
    "chars": 1875,
    "preview": "\n\n\n\n- Mastering LLM Techniques: Inference Optimization: https://developer.nvidia.com/blog/mastering-llm-techniques-infer"
  },
  {
    "path": "docs/llm-inference/llm推理框架.md",
    "chars": 155,
    "preview": "\n## vLLM\n\n适用于大批量Prompt输入，并对推理速度要求高的场景；\n\n\n\n\n\n## Huggingface TGI\n\n\n依赖HuggingFace模型，并且不需要为核心模型增加多个adapter的场景；\n\n\n\n\n\n## DeepS"
  },
  {
    "path": "docs/llm-inference/vllm.md",
    "chars": 574,
    "preview": "\n\n\n- VLLM推理流程梳理（一）: https://zhuanlan.zhihu.com/p/649974825\n- VLLM推理流程梳理（二）: https://zhuanlan.zhihu.com/p/649977422\n- 大模型"
  },
  {
    "path": "docs/llm-peft/LoRA-FA.md",
    "chars": 1317,
    "preview": "\n\n\n低秩适应方法(LoRA)可以在很大程度上减少训练参数数量,以微调大型语言模型(LLM),然而,仍需要昂贵的激活记忆更新低秩权重。减少LoRA层数或使用激活重计算可能会损害微调性能或增加计算开销。\n\n\n在本文中,我们提出了LoRA-FA"
  },
  {
    "path": "docs/llm-peft/MAM_Adapter.md",
    "chars": 914,
    "preview": "\n\n\n\n\n\n\n# ----- MAM adapter -----\nattn_mode=\"prefix\"\nattn_option=\"concat\"\nattn_composition=\"add\"\nattn_bn=30  # attn bottl"
  },
  {
    "path": "docs/llm-peft/README.md",
    "chars": 69,
    "preview": "\n\n\n\n\n- https://github.com/OpenAccess-AI-Collective/axolotl\n\n\n\n- \n\n\n\n\n"
  },
  {
    "path": "docs/llm-peft/ReLoRA.md",
    "chars": 4,
    "preview": "\n\n\n\n"
  },
  {
    "path": "docs/llm-summarize/README.md",
    "chars": 500,
    "preview": "\n\n\n\n## LLM选择标准\n\n在选本地化的LLM之前，我们先根据实际情况定义一些选择标准：\n\n- 归纳优先：我们不需要LLM在各个方面都很优秀，不需要它们会很强的coding和复杂逻辑推理能力，RAG最重要的还是出色的归纳能力；\n- 体量"
  },
  {
    "path": "docs/llm-summarize/distribution_dl_roadmap.md",
    "chars": 200,
    "preview": "\n\n\n\n分布式并行技术：\n \n- 数据并行\n- 流水线并行\n- 张量并行\n- 序列并行\n- 多维混合并行\n- 自动并行\n- MOE 并行\n\n大模型算法结构：\n\n- Transformer\n- GPT2 (345M)\n- Bloom\n- LL"
  },
  {
    "path": "docs/llm-summarize/大模型实践总结-20230930.md",
    "chars": 17927,
    "preview": "随着ChatGPT的迅速出圈，加速了大模型时代的变革。对于以Transformer、MOE结构为代表的大模型来说，传统的单机单卡训练模式肯定不能满足上千（万）亿级参数的模型训练，这时候我们就需要解决内存墙、通信墙、性能墙、调优墙等一系列问题"
  },
  {
    "path": "docs/llm-summarize/大模型实践总结.md",
    "chars": 23030,
    "preview": "随着ChatGPT的迅速出圈，加速了大模型时代的变革。对于以Transformer、MOE结构为代表的大模型来说，传统的单机单卡训练模式肯定不能满足上千（万）亿级参数的模型训练，这时候我们就需要解决内存墙和通信墙等一系列问题，在单机多卡或者"
  },
  {
    "path": "docs/llm-summarize/文档大模型.md",
    "chars": 347,
    "preview": "\n处理流程：\n\n1. 对表格或者文章文档切分成chunk，将其存入DB\n2. 根据chunk文档内容，通过prompt生成问题（qwen）\n3. 通过sentencetransformer生成embbedding(Text embeddin"
  },
  {
    "path": "docs/llm-summarize/金融大模型.md",
    "chars": 799,
    "preview": "\n\n\n\n\n## FinGPT\n\n\n- 数据集：https://github.com/AI4Finance-Foundation/FinGPT/tree/master/fingpt/FinGPT-v3\n\n\n\nFinGPT v3 系列是在新闻和"
  },
  {
    "path": "docs/llm-summarize/领域大模型.md",
    "chars": 1242,
    "preview": "\n\n\n## 领域技术标准文档或领域相关数据是领域模型Continue PreTrain的关键。\n\n现有大模型在预训练过程中都会加入书籍、论文等数据，那么在领域预训练时这两种数据其实也是必不可少的，主要是因为这些数据的数据质量较高、领域强相关"
  },
  {
    "path": "docs/transformer内存估算.md",
    "chars": 101,
    "preview": "\n\n\n\n\n\nhttps://blog.eleuther.ai/transformer-math/\n\nhttps://kipp.ly/transformer-inference-arithmetic/\n\n"
  },
  {
    "path": "faq/FAQ.md",
    "chars": 4377,
    "preview": "\n\n## FAQ\n\n\n### baichuan2报错\n\n- 'BitsAndBytesConfig' object is not subscriptable\n\nhttps://huggingface.co/baichuan-inc/Baic"
  },
  {
    "path": "git-pull-push.sh",
    "chars": 207,
    "preview": "git pull origin main\ngit add .\n\n#time=`date -Iminutes`\n#time=`date +\"%Y-%m-%d_%H:%M:%S\"`\ntime=`date +\"%Y-%m-%d\"`\necho $t"
  },
  {
    "path": "llm-algo/FLOPs.md",
    "chars": 618,
    "preview": "\n\n\n\n\n\nhttps://epochai.org/blog/backward-forward-FLOP-ratio\n\n\n如何计算FLOPs\n\n有两种方式：\n\n根据计算公式和模型结构手动推算\n\n借助第三方工具：calflops、ptflop"
  },
  {
    "path": "llm-algo/InternLM-20B.md",
    "chars": 85,
    "preview": "\n\nInternLM预训练框架\n\n大模型微调工具箱XTuner\n\n\n\nLMDeploy推理工具链\n\n\nOpenCompas大模型评测平台\n\n\n\nLagent智能体框架\n\n"
  },
  {
    "path": "llm-algo/README.md",
    "chars": 5058,
    "preview": "- 可视化：https://bbycroft.net/llm\r\n- https://zhuanlan.zhihu.com/p/644815089\r\n\r\n## 模型对比\r\n\r\n| 模型                             "
  },
  {
    "path": "llm-algo/baichuan2/baichuan.md",
    "chars": 93,
    "preview": "\n\n- https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/modeling_baichuan.py\n\n\n\n\n"
  },
  {
    "path": "llm-algo/bert/模型架构.md",
    "chars": 13520,
    "preview": "\n\n\n- https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py\n\n\n## BertEmbedd"
  },
  {
    "path": "llm-algo/bert.md",
    "chars": 687,
    "preview": "\n\n\n\n```\nBertEmbeddings\n\nBertSelfAttention\nBertSelfOutput\nBertAttention\n\n\nBertIntermediate\nBertOutput\nBertLayer\n\nBertEnco"
  },
  {
    "path": "llm-algo/bloom/README.md",
    "chars": 75,
    "preview": "\r\n\r\n\r\n\r\n\r\n- [BLOOM模型结构详解](https://juejin.cn/post/7223305855923044409)\r\n- \r\n"
  }
]

// ... and 622 more files (download for full content)

About this extraction

This page contains the full source code of the liguodongiot/llm-action GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 822 files (6.2 MB), approximately 1.7M tokens, and a symbol index with 1065 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo